We showed off a re-skinned version of our Ångströ project at in San Diego that I hope makes a bit more sense as microsearchyou know, for . :)

We pointed at a live instance at http://tpd.angstro.net:19988/ — go search, grab the miffy bookmarkley, and start adding microformats to our big shared pile of bits!

Congratulations for an applause-winning demo to Ben Sittler, as the mad Javascript genius behind the whole system, and Elias Sinderson, who added semi-structured XQuery to the system!

Herewith, some notes from our slides…


What is “Atomic-Scale”?

Web pages contain chunks of information

A natural consequence of growing adoption of template languages & content management tools

Feeds create the illusion of immediacy

As chunks of information change, we can expect notification (in the form of updated feed files)

Microformats create the illusion of structure

Even if it’s HTML all the way down, we can read it

… so maybe REST will make more sense for atoms than for pages

How miffy works

walks through the document looking for the ‘root classes’ of the µfs it knows about

places green anchor boxes in front of them
using css — no graphics, since we want it to work offline

‘capturing’ clones those DOM nodes, then walks the tree to “reformulate” it

the only data structure that can represent all future µfs is the DOM itself

For More Information

Is Open Source —
but not yet an open-repository

grab snapshots of the code and our Subversion archive from our wiki: https://commerce.net/wiki/tpd

Uses Open Source

  • Depends on some other OS projects you’ll need
  • DBXml from Sleepycat Software
  • BeautifulSoup by Leonard Richardson
  • Feedparser by Mark Pilgrim
  • … and (not least!) Twisted by TwistedMatrix
Test Service

Running at http://tpd.angstro.net:19988

This afternoon, PubSub and Broadband Mechanics are announcing a “structured blogging initiative” at the Syndicate conference. The press release even includes a quote in support from us here at CommerceNet:

CommerceNet believes strongly in the vision of bootstrapping a more intelligent Web by embedding semi-structured information with easy-to-author techniques like microformats. Through our own research in developing tools for finding, sharing, indexing, and broadcasting microformatted data, we appreciate the challenges these companies have overcome to offer tools that will interoperate as widely as possible. We applaud their recent decision to support the microformats.org community in all of the core areas where commonly accepted schemas already exist, such as calendar entries, contact information, and reviews.

Given that we’re strong supporters of microformats.org, why did we take this stand? First and foremost, for the reasons stated above: because they’re committing to shipping tools that make it easier to produce microcontent using microformats. Even if they were supporting any number of other formats, we’d be glad to welcome any new implementations to the fold.

Of course, we’d prefer to minimize any confusion, too. Many other implementations exist for microformats and are copiously documented and discussed in public forums at microformats.org. Clearly, the (re-)launch of a public .org site titled StructuredBlogging with aspirations to non-profit status of its own could lead to perceptions that there’s some sort of “vs.” battle going on.

That might even have been true, a few months ago when the idea-of-structured-blogging was still conflated with a debatable proposal for structured-blogging-the-format that hid chunks of isolated XML within otherwise readable documents using a <SCRIPT> tag. The major news here today that we’d like to celebrate is that they’re in favor of using microformats for all of their core, commonly-used schemas like reviews, events, and lists.

Now, is the old format still in their code tree when you grab their alpha plugin? Sure, and there will always be room for developers who really, really want to cons up their own schema out of thin air. The microformats-rest mailing list is grappling with the same problem, focusing on XOXO as a solution for now.

The more intriguing implication of their work at StructuredBlogging.org is their microcontent description (MCD) format — even if it’s all hReview at the bottom, there’s room for custom UIs for reviewing movies that are different from reviewing restaurants, and we’ll see if that’s where these explorations lead to…

Tim Bray had an (in)famous series of blog postings he called the Technology Predictor Success Matrix, or TPSM. It came up again recently in a debate over the validity of the Web2.0 moniker. I thought I should try applying it to see how Microformats might fare…

In his original series, Tim Bray evaluated 7 big winners from the last few decades of the computer industry, and 7 big losers according to 9 metrics. The only one that (IHNSHO) was a useful predictor was the 80/20 rule. Nonetheless, the complete list includes:

Compelling idea

The idea that the web is “full of data” just waiting to be harnessed by computers seems so compelling that the entire Semantic Web movement was founded on it. Tantek originally got quite a bit of mileage out of coining the term “lowercase semantic web” to explicitly signify how closely the microformats vision drafts behind the great expectations for the Semantic Web.

More specifically, however, the microformats idea is to weave semantics into ordinary web pages — using a feature primarily intended for style sheets to encode hints about meaning, too. So we can’t simply share the, say, 7/10 score of Semantic Web.

Personally, I think that Microformats comes even closer than XML ever did to the vision of “imagine a web where you could express <price> as easily as boldface. So for the idea of “semantic highlighters” — the idea that anything you can select in a browser can also be marked up as meaningful — I’d rate it a 9/10.

WWTBT? I don’t think Tim would agree — his standard for a 10/10 is AI, and a 9/10 is VRML; he’d rate even breakthrough ideas like the Web or Java in the 5-6 range because the meme is NOT possible to transmit in a few words. So perhaps I’m wildly off by conflating compelling-to-web-geeks with compelling-to-grandma. Because all of the microformats work to date is still less compelling to grandma than dragging-a-satellite-map-image around.

Technical elegance

As an advocate, I’d like to say microformats are an elegant re-imagining of the role of the CLASS, REL, and REV attributes in HTML — but it’s still a hack. Perhaps I could award a 9/10 for cleverness, but that’s not exactly the criteria. To wit: “an entry gets a ten if the inventors are up for the Turing Award; a zero goes to glue-and-string, duct tape and sweat, the things that only work despite themselves.”

Well, there’s no way a microformat’s going to win the Turing Award; it’s not even clear it could ever win the System Software Award. Heck, given the CS/AI community’s thrall to the Semantic Web, I’m not even sure a grad student should pursue a grand-unified-theory of microformats in pursuit of an Doctoral Dissertation Award (instead, look to past winners such as machine learning to translate existing XML data sets).

So, while I’d like to think microformats are elegant, I’d have to score it closer to 3/10.

WWTBT? He might rate it even lower, since an especially-important indicator of elegance for data-formats is the ease of ‘downstream’ processing. It’s spectacularly easy to ‘access’ members of a microformatted data structure for formatting using CSS selector syntax, but frankly, quite painful to do so from XPath or from the DOM. It may be unfair that so many developers have to sit down and write their own getElementsByClassName() function, but it still detracts further from microformats’ elegance score.

Standardization process

What process? No, really, just kidding. Actually, it’s significant that the microformats movement is not aligned around any existing standards body. CommerceNet, to the degree we can be helpful, provides a neutral home for it (and develops some software of its own), but is not a ‘standards body,’ if indeed it ever was one.

The social norms that have developed around microformats so far emphasize the need for research into working systems and an interest in codifying what’s common practice already. It does not pay much of a strategy tax yet, because there aren’t so many existing formats that new ones are beholden to the past. And it might be said that the whole philosophy favors ease-of-authoring over ease-of-parsing, but at least it’s open and upfront about its tastes. I’d give it a 3/10.

WWTBT? Tim quoth “Open Source should really have a question mark rather than a zero, because it’s entirely oblivious to standards, it just cares about what works” — that’s probably closer to the spirit of microformats than, say, the Semantic Web (which might be in the sixes, based solely on the profusion of specifications! :)

(Apparent) Return on Investment

The posited return (to readers) of investment (by writers) is that microformatted data can be reused more easily. This is certainly possible today — particularly on the Mac, where a bit of XSLT wizardry turns blogs into live calendar feeds in iCal, or exports hCards to AddressBook.

However, the ROI proposition here has two weaknesses: the costs and benefits are allocated to different actors, and many of the ‘intelligent apps’ that consume microformats don’t exist yet. [To be sure, not many exist for the Semantic Web yet, either] Since the key is the apparent ROI, I have to admit that many folks have adopted microformats for the cool-factor; or how low the “I” is (‘just tweak your templates!’) than by a compelling, documented return as yet. I’d guess it’s 3/10 as yet, but I’d hope it hits the 5/10 range of XML or the Web soon.

WWTBT? As he said of the Web: “the return on viewing everything as net-hyperlinked text through a document rendering engine was far from obvious.” I think that he might even place the ROI closer to that of SGML (1/10, in the it’s-good-for-you! eat-your-vegetables! sense). I wouldn’t be that harsh. That’s where I’d place the ROI on RDF at present :) <duck/>

Management Approval and Investor Support

The main limitation to measuring approval and support from these two classes of ‘suits’ is actually being aware of this technology’s existence in the first place… In fact, we’ve done a reasonable bit of exposing the VC community to microformats (e.g. our workshop at Supernova, persuading various darling startups to adopt them); and at least the scientifically-inclined bits of Corporate America may have seen the Semantic Web piece in Scientific American. But as yet, it’s probably closer to 1/10.

WWTBT? These two predictors in particular can be accused of circular logic: these scores vary significantly across the technology adoption lifecycle. I believe his scores were for the peak of each movement, so maybe he’d cut microformats a bit more slack a few years from now…

Good Implementations and Happy Programmers

You’d think these two qualities would go together hand-in-hand, but in fact there were a few divergent cases. Not least was XML, which he gave a 9 for ready-to-ship and a 3 for fun-to-work-with. Since microformats aren’t a single monolithic technology with an accompanying test suite, it’s that much harder to claim there are good implementations out there (2/10), but there are happy hackers (4/10), and in a transparent attempt to slip in a new category, I’d venture there are happy authors, too (6/10).

WWTBT?“A zero would go to something that arrived as idea-ware and then turned out to be hard to build.” I think that he might classify microformats as closer to idea-ware than I’d be comfortable admitting — but part of the evidence of that is the movement’s self-conscious mantle of promulgating philosophy as much as specifications.

80/20 Point

And so we meander to the punchline — because while the scores haven’t been pretty so far, they’ve all been measured against factors that were largely uncorrelated with eventual success. This one proved key — whether developers could enjoy “80%” of the benefits of the technology after only the first “20%” of effort.

Early in the adoption cycle, the great appeal of microformats is how well-integrated they are with existing idioms for CSS formatting and XHTML hyperlinking. Furthermore, any author can adopt them because they’re plain, inline HTML — no fussy file attachements or external links, no separate languages to parse or scripting languages to learn.

And then there’s the ace in the hole — search engines. Unlke the early Web, this time around we already have lots of investments in scalable, publicly-available, and essentially free services that can adapt once to take advantage of microformats innovations and add value to authors overnight.

Consider the explosive growth of tagging — simply because Technorati/Del.icio.us/etc provide easy aggregation of all the other content out there that hadn’t been connected directly before. Or of vote links, or no-follow, or the XFN friend-mapper, or …

Since microformats appear to be one of those technologies that, rather annoyingly, manage to “work in practice, if not in theory,” I’d award it a 7/10. That puts it behind the stripped-down origins of the Web and the PC (10s), but just shy of SQL or XML.

WWTBT? The 80/20 Tribe’s offerings are denounced as “Just a toy!”, while they hurl back accusations of pedantry, big-system disease, and so on. Amen!

Next time, I’d like to hear back comments (by email, unfortunately), and see if we can’t tackle Everett Rogers’ Diffusion of Innovations model…

PS. For extra credit, someone might want to debate why I didn’t use hReview for this review; and if so, how exactly would I want to express this? Would I use Tim’s original posts as “tags”?

As part of the zMarket project, zLab is considering building a research platform for electronic markets. This is the sort of shared infrastructure a 1997 NSF workshop called for, which we discussed in an earlier blog entry.

Unfortunately, the original link to the Netlab workshop report died in the last month. It’s odd how content can survive seven years, but not seven years and a month. The University of Iowa redid its webpages and now there are at least 29 pages with dead links to it. Amazingly, there were no other copies of it on the web. Not even on Google’s cache, since the original page now returns 404.

After some initial shock and a half-hour of searching, it occurred to me that even though I use a Mac, Adam’s handy Google Desktop Search might have kept a copy. However, it seems like he never clicked through on that blog entry.

However, I realized that our zSearch experiment existed for just this purpose: indexing the collective knowledge of our group by crawling the content we link to as well — and the Nutch cache still had the page. So now we can both send a gentle reminder of this problem to Iowa and point you all at the resurrected National Science Foundation NetLab Workshop Report.

Gary Potter posted about an article in MIT’s Magazine of Innovation, Technology Review, in which Mark Frauenfelder of boingboing interviews creator of the World Wide Web Tim Berners-Lee:

On why people aren’t excited about the Semantic Web…

It’s not the first time I’ve had this paradigm-shift problem.

On getting past it…

…we are just starting by putting applications onto the Semantic Web one by one and linking them up…. what’s exciting is the network effect.

On how the Semantic Web understands data…

Suppose you’re browsing the Web and you find a seminar advertised, and you decide to go. Now, there is all sorts of information on that page, which is accessible to you as a human being, but your computer doesn’t know what it means. So you must open a new calendar entry…..Then get your address book and add new entries for the people involved in the seminar.

If there were a Semantic Web version of the page, it would have labeled information on it that would tell the computer “this is an event” and what time and date it is.

If you want to be part of the Semantic Web, you’ll need a friend-of-a-friend file. A FOAF file is a formatted rendition of your personal data – first name, last name, email address, etc. To create the file, you need to fill out a form and an automated process creates the file. Links on how it works and how to implement it on your page are also there.

Slowly we evolve toward a more understanding, more understandable web…

Tomorrow begins the WWW@10 conference on “the visions, technologies, and directions that characterized the Web’s first decade… a forum in which scholars and practitioners of all disciplines — cultural, historical, and technical — can share perspectives, concerns, and innovative ideas about the World Wide Web.” The program looks quite promising.

Shouldn’t there be more than one Web page in the Universe that uses the phrase “Semantic Web Disease”?

Oh.

Well now I guess there are two. Isn’t this how all diseases start to spread?

Matt Haughey talks about the switch of the Creative Commons search engine (used to record the semantics of web page metadata) to use Nutch: “We flipped the switch last week and have been testing it ever since. Compared to the last version of our search engine, this one is blazingly fast to return results, the results are much more specific to what you’re looking for, and it is constantly keeping up to date on over 1 million pages with Creative Commons license info in them.”

Notes Doug Cutting about the newly launched Creative Commons Search: “It crawls CC-licensed pages, indexing license properties, making them searchable. I did most of the initial development, using it as a motivating case when adding metadata support to Nutch… This is cool in several ways. It demonstrates how easily Nutch can be extended to do stuff that would be hard to do with any other search engine. (This is all of the CC-specific code.) It’s also cool since it helps folks find content they can reuse, like songs that can be sampled, art that can be clipped and text that can be excerpted.”

It’s clear that the Semantic Web will happen not all-at-once but little by little as cool efforts such as this one create a groundswell.