A simple thought experiment in weblog semantics
1:37 PM – Joe makes a post on his blog about his golf vacation on Prince Edward Island
3:24 PM – Joe makes a post on his blog about hang gliding in Newfoundland
Two weeks pass.
Sam searches Google for hang gliding in Prince Edward Island. Joe’s blog is the first result.
The trouble with this picture is that Joe never wrote anything about hang gliding on Prince Edward Island. The thought never even crossed his mind.
This is a fundamental problem with searching by keywords – the page is not necessarily the finest unit of web content. Often, especially on weblogs, any given page will have dozens of completely independent posts – made by different authors, on different days, about different topics. Google has no way to tell one post from another and can only link to general archive pages rather than to individual posts.
With all the talk about separating design from content and encoding semantics, it occurred to me that this current level of separation isn’t particularly useful to most of us. It can help with accessibility, which is important; but the cruel fact is, that for truly semantic code to become universal, people are going to have to see concrete results (ie. cool stuff happening).
BlogML (weblog markup language) doesn’t exist yet (as far as I know), but I think could save us.
The average weblog has a relatively simple set of fields for each post: title, author, date/time, permanent URL, # of replies, URL of replies, and main content (I’m sure I’m missing some, but you get the idea).
If we could somehow code our weblogs with this structure, Google and other services would be able to see the content as it really is: a loose collection of independent posts. When Google indexes weblog archives, search results could include individual blog posts with the appropriate links (rather than linking to an archive page with 30 posts).
This is not a new idea – it’s the semantic web and it’s been coming for a while. Here’s the key: The weblog community is in the unique position situation to affect significant change through nimble and collective action.
Think about it – if some kind of markup could be defined for this, it wouldn’t take years to be adopted (like most standards). Rather, it would take the cooperation of a few key weblog players. If Blogger, Moveable Type, GreyMatter, and the UserLand suite all started pumping out BlogML-enhanced HTML, it would be instant critical mass. The majority of weblogs would be on board in a matter of days and it wouldn’t take long for the rest of us to jump on the bandwagon. Search tools like Google, BlogDex, and DayPop would be able to offer better search results to their customers.
Webloggers are in a unique position to take collective leaps and bounds towards the semantic web.
Ok, so how do you actually do this? What is BlogML? What does it look like? I’m not sure – I’m shooting from the hip here. Perhaps a simple markup could be hidden in HTML comment tags. Or perhaps a set of reserved keyword DIV and SPAN titles could be established (eg. <div id=”BlogMLtitle”>) as this would be an appropriate use of the ID element according the W3C specs.
Implementation can be dealt with. What I want to know first is: Does this make sense? What hasn’t it already been done? Had is already been done? What am I overlooking? I look forward to your feedback.
There is no doubt that the blog world should be separated somehow from the rest of the webworld, or else we are going to be flooded with less and less relevant information.
I think a great first step would be the ability to exclude blogs from my Google search if I choose (ala- daypop)…
The idea of a BlogML is very cool. I would think that the DIV tag makes a lot of sense there. I don’t think it’s the place for XML, because people aren’t going to step up very soon and actually deliver content in XML.
At the same time: I wish my search engine woulden’t return a page that mentions my search once at the very top when there are probably sites out there which have a much stronger focus on what I want. But if a blog mentions hang gliding in Newfoundland a dozen times, then by all means; show it to me.
How present is this problem?
Check out RSS. I think it pretty much encompasses what bloggers need. Of course, it would be nice if Radio supported reading RSS 1.0 files. Why can’t it do it??
XHTML’s modularization facilities address Jevon’s question about people unwilling to switch from HTML markup. If you’re willing to use a model where the editorial component of weblog posting is a container with only a couple of methods (getContent, setContent) then you could wrap the posting in a semanticaly rich XML container.
If writers were willing to use div and span elements in their postings, then we could post interesting queries to to the editorial content container, but that would require writers, tool vendors, and search engines to agree on a limited vocabulary.
Aw fudge… I’m going to go write this up.
Radio does support RSS 1.0, just not with embedded HTML like that feed (the one Tony mentions here) has. That document is also strange because it declares xhtml as the “html” namespace then doesn’t use the “html” namespace identifier on the HTML tags, which doesn’t sound right.
I would go back to basics and question if it’s valid XML first.
Good points, Mark… I think I’ll bug Ugo to serve up some valid feeds 🙂 Anyway, there’s been some discussion over on the cocoon-dev list about duplicating functionality of Radio in Cocoon. My idea is to just use RSS as the “native” XML schema for storing blog entries and archiving stories, since RSS has been created for this sort of thing anyway. Ugo and I are throwing around ideas for an open-sourced version of Radio, with a Mozilla frontend. If anyone’s interested, drop me a message and we can talk about this more.
I think this is a great idea. I think it has a bigger scope then just weblogs. It’s about describing a section of a page.
I think it should be done as a seperate namespace in XHTML, possibly as RDF metadata. If you find this interesting I can write up a short example?
I don’t know if approaching this as simply a technical problem is really right. Of course it’s not hard to provide categorization or metadata facilities so that individual posts are better “positioned.” MoveableType 2.0 for example will allow postings in multiple user-defined categories.
The problem with metadata/categorizing/increased tagging options is that most people don’t care to use them. Even one more step beyond the “think it, type it, post it” conceptual model of blogs is more than most people want to bother with. Maybe not people here, who care about technology, blogs, tags, metadata, etc., but Joe the hangglider and PEI golfer probably doesn’t care enough to do it.
Being the Joe non-techie, would I not be better served and 90% of these incidents solved if I could use a proximity limitor on my Google searching? Maybe I am missing something but I can use quotation marks to place words together but I cannot use “/5” to place them within five words, a common tool in Legal research search engines like Quicklaw. If Lord Goog allowed this I would not get all those results where one word is separated by 6597 from the other.
You certainly can’t use the id attribute. id’s have to be unique within a page.
If you haven’t already taken a look, you might find the discussions around an imagined (common) weblog API to be of interest:
http://groups.yahoo.com/group/weblog-devel/messages
There are a few problems with any kind of proposed markup language for weblogs.
The first is that there are atleast 3-4 different schemas (as in, not XML) for chunking out the parts of a single blog post [1]. You could say that this is essentially the pithy-comment vs. 3000-words argument all over again (if you don’t know what I’m talking about, trust me you’re not missing anything…)
Once you get past that, there are you’re left with trying to figure out which parts of a text to assign tags/meaning to. On the one hand, it seems sort of silly to re-invent DocBook [2] ? However, there is the very real question of whether you really want to write and maintain a tool that has to parse DocBook [3]? Even if you do, most people would rather poke their eyes out than write DocBoook; those of us that sort of enjoy it shouldn’t do so with any kind of illusions.
There is something to be said for “extending” XHTML since you get to assign blog specific markup *and* keep the presentation markup[4], thus avoiding the argument about whether or not to include html in the <description> tags[5].
And if you want to use XHTML without using a million different class names — which is conceptually broken[6] since the class attributes is supposed to denote style — you’re going to have to be careful not to get bitten by some of the peculiarities surrounding parameter entities[7].
I am not opposed to a BlogML. If anything, it would make writing a common API easier. But even if you solve the hard problems, you still have to make the solutions reasonable *and* thorough enough that people will want to devote time and energy to write tools for them.
Lots of us wrote tools to implement the Blogger API mostly because it was so easy. That many of us are also now trying to figure out a more nuanced API says something, I think.
[1] http://groups.yahoo.com/group/weblog-devel/message/117
[2] http://www.docbook.org
[3] http://www.nwalsh.com/docs/articles/dbdesign/
[4] Although, there is reason to believe that subsequent versions of XHTML will do away with “form” (as in, not html forms) related tags altogether which ultimately begs the question…
[5] I wrote about this here (http://groups.yahoo.com/group/reallySimpleSyndication/message/81) but the archives have since been taken down
[6] I mention it only in passing since we’re all busy trying to do the “Right Thing”.
[7] http://aaronland.net/weblog/archive/3891
Tony: I do have an RSS feed for AOV. Dave Winer asks, What’s wrong with RSS? Good questions – but my RSS feed doesn’t help Google know one post from another. However, if you could somehow twin your site with an RSS feed (including the archives) then you might be able to do something really useful there.
Niklas: I agree – the ideas here aren’t limited to weblogs. The reason I thought it would be a good idea to start with weblogs are, 1) you gotta take it one step at a time, and 2) since much of the code generation for the majority of weblogs happens in central locations (blogger, moveabletype, radio, etc.) there would be a real possiblity of critical mass adoption. Then we take it to the rest of the web.
Andrew: You’re absolutely right that people won’t do more than they need to (especially if the payoff isn’t immediate). But if all this did was let google know where one post ends and another begins, it would be a huge step. So Joe the hangglider and PEI golfer just writes his insipid little posts and Blogger does the rest.
DrBeat: right you are.
Thanks for the great feedback everyone. If there are more discussions or action following from this, please come back and post a link on this thread.
I’ve already been thinking about this. Long way from finished, but would appreciate your thoughts …
Further thoughts on what’s wrong with RSS: it’s primarily a headline syndication service, not suitable as a native data format for weblogs. I want a format to hold my data, so that I can use a number of different tools to manipulate and display it.
i’d rather not use a BlogML markup on my webpages, even if they were indexed (they are not according to robots.txt). converting webpages to another format is not a task to be taken lightly.
i think the issue here isn’t the issue of variable relevance in weblogs to a specific topic; really it’s google’s, or any search engines, faulty assumption that one page is specific to a finite set of topics. google can refine their search processes quite simply by determing distance metrics between words, in quite the same way that networks determine metrics or “hops” between server A and server B. the results with the shortest metrics should appear in ascending order. you sidestep the messy issue of XML altogether, and — superficially — that would seem to be the thing to do to improve the relevance and quality of searches.
I’m with Alan & moz here. Why I could agree a mark-up language aimed at content such as a weblog wouldn’t be a bad thing (but that’s what XML should be all about), I think it’s in the hands of the search engines to get back the relevant information.
You’re right, it really can be dismissed as that simple (I like the Idea actually)…
But why not have MT, Blogger, etc… throw in the indexing… If it has some value to the communities they serve, then they should consider it. And these are the types of people (companies?) who really work hard to serve, which is very cool. I don’t think they idea is without merit.
But Google should have proximity operands.
Steven –
Here’s the error I get when I try to subscribe to your RSS feed in Radio.
Not sure what is going on. Anyway, yes, this is all for naught if Google or Daypop aren’t smart enough to realize what is going on. Maybe we need to come up with our own search engine that uses this blog format, too =]
Tony
Tony, you are not alone. It’s an issue with the http header that my CF server produces. It will be fixed when I move AOV to PHP in the next few months. In the mean time – my appologies.
Hm, that gives me an idea. I’m going to see if I can get PHP to readfile() your RSS feed. Maybe it’ll spit out a friendly header. Be right back.
Bingo. The following PHP will let me subscribe to AOV using Radio:
<?php
$url = “https://www.actsofvolition.com/rss.cfm”;
readfile($url);
?>
All I do is subscribe to the URL of that script and it lets me at the goods. Awesome!
Nice Tony. I’m going to see if I can do just that from one of our linux servers here and have a solution that can be enjoyed by everyone.
> A simple thought experiment in weblog semantics:
>
> 1:37 PM – Joe makes a post on his blog about his golf
> vacation on Prince Edward Island
>
> 3:24 PM – Joe makes a post on his blog about hang gliding > in Newfoundland
>
> Two weeks pass.
>
> Sam searches Google for hang gliding in Newfoundland.
> Joe’s blog is the first result.
>
> The trouble with this picture is that Joe never wrote
> anything about hang gliding on Prince Edward Island. The > thought never even crossed his mind.
Shouldn’t that read “Sam searches Google for hang gliding on Prince Edward Island. Joe’s blog is the first result.” That would make more sense as problem that needs to be solved via BlogML.
Yup. Fixed. Thanks.
I wonder how many people that confused – or did their brains automagically fixed that – like mine did.
Update on the RSS feed problem that Tony mentioned: It’s fixed (thanks for help Nrf). New URL for the RSS feed:
http://rss.actsofvolition.com
I did a rough proposal for Semantic Web Markup Language that takes account of some of the issues here:
http://groups.yahoo.com/group/syndication/message/2283
Aarons Swartz’ excellent script: http://logicerror.com/blogifyYourPage takes a link mentioned in a weblog posting and
assumes that the surrounding commentary is about that link (if there is more
than one link in a posting then you have to choose which one the commentary
is about). If you like, the XML that is produced is metadata about another
page other than the one where the metadata is published within span tags.
The more generalized approach assumes that you want to automatically
generate metadata about the page that you are running the script over, be
able to generate richer metadata and allow for multiple links.
In addition you will want to use namespaces defined by URI’s to avoid
collisions and it may be usefull to use the concept of weblog style
permanent archiving to attach the default namespaces at the level of
individual postings as opposed to web pages which are a transient thing. In
other words metadata should be attached to a piece of content (a posting
which may be a paragraph or several pages) as it was authored, as opposed to
a page which is merely a rendering of part of some content or a collection
of different pieces of content.
Although ID’s must be unique on a page, name attributes aren’t. So you could still embed whatever you liked in the HTML. Of course, the browser will ignore any attribute that it doesn’t know about so you could add blogML attributes to everthing if you wanted to.
This article/thread hit #9 on the DayPop top 40 yesterday. Now we’re slipping. Fleeting but flattering.
Hi Steven,
I talked about a similar structure (http://www.wrongwaygoback.com/articleone/blogxml.asp) a little while ago, but didn’t hear much feedback about it.
Tony, who’s posted on this thread, sent an email to a few of us who’ve been discussing the concept of BlogML. With his permission, I’m posting parts of his email and my response (both with some minor formatting changes) as some of it may be of general interest.
Thanks Tony. Here’s my response:
Did someone mention Radio clone?
Now, that’s something I bet some people would sink their teeth into.
I’d love to see an open source clone of Radio. I’m tired of the 30-day trial periods, but I don’t want to pay $40 it. $30, maybe, but $40’s ridiculous.
Also, as to the group of people interested in blogML: I’d be willing to donate some time to it. If someone could email me about that, it’d be great. 🙂 Thanks again…
For all you guys who are interested in this other Radio clone side-project I’ve been thinking about, get a hold of me here and we’ll talk.
Just to chuck another BlogML idea in – http://www.blogworks.com/blogml.asp
First draft and open to changes and suggestions.
I’ve setup a discussion forum for BlogML. Follow the link and tell your friends. Let’s get a good spec started and see where it takes us.
Tony
Sorry, that should have been a link:
http://www.blogworks.com/blogml.asp
Sorry for jumping in the discussion so late. I am having problems with my Internet connection, was without mail for the last 24 hours and still my blog is unreachable :(.
By the way, yes, there are problems with my RSS feed and namespaces. They are probably due to a bug in Xerces or Xalan. If I prefix HTML elements with “html:”, all link href’s are changed from:
html:a href=”some-URI” to
html:a some-URI=”some-URI”
Bizarre!
While you get this blogML thing going, you might want to look at how scottandrew.com has implemented sort of the reverse: allowing quick Google searches on post-specific topics.
Radio has implemented it as a macro:http://radio.userland.com/googleItMacro
The original idea’s at http://www.scottandrew.com “Google This!”
(This is to anyone interested in this, but directed mainly at Mr. Frost):
The idea drawn up on BlogWorks (http://www.blogworks.com/blogml.asp) looks very nice–simple, elegant, definitely gets the job done, so it just might work. However, it brought one question to mind for me: where would the comments go? I noticed there was a commentable option (yes or no), but there was no actual place for comments in the file itself.
So, I took the draft Mr. Frost had already drawn up and added in a few tags–nothing major at all, by any means, but it works inline comments into this draft. check it out here: http://www.interalia.org/m3ta/filebin/blogML.xml. Also, the link just under my name goes to the post where I mentioned this, along with an explanation of the tags I added and what purpose they serve.
Comments are welcome. Flames are not. 😉
Cheers,
Phil Ulrich
The problem cited (bad search engine matches) is not unique to blogs. The same thing will happen to any page which includes summaries of multiple base articles. Like, uh, the navigation pages of any website.
The one element most blogs contain in a permalink is a # mark trailing at the end of the URL with a unique item number to identify it.
Why isn’t Google or any other search service set up to identify the ‘#’+number nearest to the entry and jump straight to it?
Or to the nearest anchor?
In blogger, you can set an anchor to sit at the top of the entry with the a name=blogitemnumber
I use Atomz search locally on my site, The Copydesk, and it can jump straight to an individually indexed entry searched for on my site.
I’d say this is less about webloggers having to change practice, and more about Google et al doing so.
(I know this is slightly off-topic, but there should be a ‘description’ attribute for every link. Perhaps the ‘title’ attribute does this anyway. If present, Google would log this rather than the link-text. This way, blogs could still use links in flowing paragraphs – using words like ‘this’ and so on as anchor text – whilst also adding to the Google database of knowledge about sites.)