BlogML: The Weblog Markup Language

A simple thought experiment in weblog semantics

1:37 PM – Joe makes a post on his blog about his golf vacation on Prince Edward Island

3:24 PM – Joe makes a post on his blog about hang gliding in Newfoundland

Two weeks pass.

Sam searches Google for hang gliding in Prince Edward Island. Joe’s blog is the first result.

The trouble with this picture is that Joe never wrote anything about hang gliding on Prince Edward Island. The thought never even crossed his mind.

This is a fundamental problem with searching by keywords – the page is not necessarily the finest unit of web content. Often, especially on weblogs, any given page will have dozens of completely independent posts – made by different authors, on different days, about different topics. Google has no way to tell one post from another and can only link to general archive pages rather than to individual posts.

With all the talk about separating design from content and encoding semantics, it occurred to me that this current level of separation isn’t particularly useful to most of us. It can help with accessibility, which is important; but the cruel fact is, that for truly semantic code to become universal, people are going to have to see concrete results (ie. cool stuff happening).

BlogMLBlogML (weblog markup language) doesn’t exist yet (as far as I know), but I think could save us.

The average weblog has a relatively simple set of fields for each post: title, author, date/time, permanent URL, # of replies, URL of replies, and main content (I’m sure I’m missing some, but you get the idea).

If we could somehow code our weblogs with this structure, Google and other services would be able to see the content as it really is: a loose collection of independent posts. When Google indexes weblog archives, search results could include individual blog posts with the appropriate links (rather than linking to an archive page with 30 posts).

This is not a new idea – it’s the semantic web and it’s been coming for a while. Here’s the key: The weblog community is in the unique position situation to affect significant change through nimble and collective action.

Think about it – if some kind of markup could be defined for this, it wouldn’t take years to be adopted (like most standards). Rather, it would take the cooperation of a few key weblog players. If Blogger, Moveable Type, GreyMatter, and the UserLand suite all started pumping out BlogML-enhanced HTML, it would be instant critical mass. The majority of weblogs would be on board in a matter of days and it wouldn’t take long for the rest of us to jump on the bandwagon. Search tools like Google, BlogDex, and DayPop would be able to offer better search results to their customers.

Webloggers are in a unique position to take collective leaps and bounds towards the semantic web.

Ok, so how do you actually do this? What is BlogML? What does it look like? I’m not sure – I’m shooting from the hip here. Perhaps a simple markup could be hidden in HTML comment tags. Or perhaps a set of reserved keyword DIV and SPAN titles could be established (eg. <div id=”BlogMLtitle”>) as this would be an appropriate use of the ID element according the W3C specs.

Implementation can be dealt with. What I want to know first is: Does this make sense? What hasn’t it already been done? Had is already been done? What am I overlooking? I look forward to your feedback.

 

41 thoughts on “BlogML: The Weblog Markup Language

  1. There is no doubt that the blog world should be separated somehow from the rest of the webworld, or else we are going to be flooded with less and less relevant information.

    I think a great first step would be the ability to exclude blogs from my Google search if I choose (ala- daypop)…

    The idea of a BlogML is very cool. I would think that the DIV tag makes a lot of sense there. I don’t think it’s the place for XML, because people aren’t going to step up very soon and actually deliver content in XML.

    At the same time: I wish my search engine woulden’t return a page that mentions my search once at the very top when there are probably sites out there which have a much stronger focus on what I want. But if a blog mentions hang gliding in Newfoundland a dozen times, then by all means; show it to me.

    How present is this problem?

  2. Check out RSS. I think it pretty much encompasses what bloggers need. Of course, it would be nice if Radio supported reading RSS 1.0 files. Why can’t it do it??

  3. XHTML’s modularization facilities address Jevon’s question about people unwilling to switch from HTML markup. If you’re willing to use a model where the editorial component of weblog posting is a container with only a couple of methods (getContent, setContent) then you could wrap the posting in a semanticaly rich XML container.

    If writers were willing to use div and span elements in their postings, then we could post interesting queries to to the editorial content container, but that would require writers, tool vendors, and search engines to agree on a limited vocabulary.

    Aw fudge… I’m going to go write this up.

  4. Radio does support RSS 1.0, just not with embedded HTML like that feed (the one Tony mentions here) has. That document is also strange because it declares xhtml as the “html” namespace then doesn’t use the “html” namespace identifier on the HTML tags, which doesn’t sound right.

    I would go back to basics and question if it’s valid XML first.

  5. Good points, Mark… I think I’ll bug Ugo to serve up some valid feeds 🙂 Anyway, there’s been some discussion over on the cocoon-dev list about duplicating functionality of Radio in Cocoon. My idea is to just use RSS as the “native” XML schema for storing blog entries and archiving stories, since RSS has been created for this sort of thing anyway. Ugo and I are throwing around ideas for an open-sourced version of Radio, with a Mozilla frontend. If anyone’s interested, drop me a message and we can talk about this more.

  6. I think this is a great idea. I think it has a bigger scope then just weblogs. It’s about describing a section of a page.

    I think it should be done as a seperate namespace in XHTML, possibly as RDF metadata. If you find this interesting I can write up a short example?

  7. I don’t know if approaching this as simply a technical problem is really right. Of course it’s not hard to provide categorization or metadata facilities so that individual posts are better “positioned.” MoveableType 2.0 for example will allow postings in multiple user-defined categories.

    The problem with metadata/categorizing/increased tagging options is that most people don’t care to use them. Even one more step beyond the “think it, type it, post it” conceptual model of blogs is more than most people want to bother with. Maybe not people here, who care about technology, blogs, tags, metadata, etc., but Joe the hangglider and PEI golfer probably doesn’t care enough to do it.

  8. Being the Joe non-techie, would I not be better served and 90% of these incidents solved if I could use a proximity limitor on my Google searching? Maybe I am missing something but I can use quotation marks to place words together but I cannot use “/5” to place them within five words, a common tool in Legal research search engines like Quicklaw. If Lord Goog allowed this I would not get all those results where one word is separated by 6597 from the other.

  9. If you haven’t already taken a look, you might find the discussions around an imagined (common) weblog API to be of interest:

    http://groups.yahoo.com/group/weblog-devel/messages

    There are a few problems with any kind of proposed markup language for weblogs.

    The first is that there are atleast 3-4 different schemas (as in, not XML) for chunking out the parts of a single blog post [1]. You could say that this is essentially the pithy-comment vs. 3000-words argument all over again (if you don’t know what I’m talking about, trust me you’re not missing anything…)

    Once you get past that, there are you’re left with trying to figure out which parts of a text to assign tags/meaning to. On the one hand, it seems sort of silly to re-invent DocBook [2] ? However, there is the very real question of whether you really want to write and maintain a tool that has to parse DocBook [3]? Even if you do, most people would rather poke their eyes out than write DocBoook; those of us that sort of enjoy it shouldn’t do so with any kind of illusions.

    There is something to be said for “extending” XHTML since you get to assign blog specific markup *and* keep the presentation markup[4], thus avoiding the argument about whether or not to include html in the <description> tags[5].

    And if you want to use XHTML without using a million different class names — which is conceptually broken[6] since the class attributes is supposed to denote style — you’re going to have to be careful not to get bitten by some of the peculiarities surrounding parameter entities[7].

    I am not opposed to a BlogML. If anything, it would make writing a common API easier. But even if you solve the hard problems, you still have to make the solutions reasonable *and* thorough enough that people will want to devote time and energy to write tools for them.

    Lots of us wrote tools to implement the Blogger API mostly because it was so easy. That many of us are also now trying to figure out a more nuanced API says something, I think.

    [1] http://groups.yahoo.com/group/weblog-devel/message/117
    [2] http://www.docbook.org
    [3] http://www.nwalsh.com/docs/articles/dbdesign/
    [4] Although, there is reason to believe that subsequent versions of XHTML will do away with “form” (as in, not html forms) related tags altogether which ultimately begs the question…
    [5] I wrote about this here (http://groups.yahoo.com/group/reallySimpleSyndication/message/81) but the archives have since been taken down
    [6] I mention it only in passing since we’re all busy trying to do the “Right Thing”.
    [7] http://aaronland.net/weblog/archive/3891

  10. Tony: I do have an RSS feed for AOV. Dave Winer asks, What’s wrong with RSS? Good questions – but my RSS feed doesn’t help Google know one post from another. However, if you could somehow twin your site with an RSS feed (including the archives) then you might be able to do something really useful there.

    Niklas: I agree – the ideas here aren’t limited to weblogs. The reason I thought it would be a good idea to start with weblogs are, 1) you gotta take it one step at a time, and 2) since much of the code generation for the majority of weblogs happens in central locations (blogger, moveabletype, radio, etc.) there would be a real possiblity of critical mass adoption. Then we take it to the rest of the web.

    Andrew: You’re absolutely right that people won’t do more than they need to (especially if the payoff isn’t immediate). But if all this did was let google know where one post ends and another begins, it would be a huge step. So Joe the hangglider and PEI golfer just writes his insipid little posts and Blogger does the rest.

    DrBeat: right you are.

    Thanks for the great feedback everyone. If there are more discussions or action following from this, please come back and post a link on this thread.

  11. Further thoughts on what’s wrong with RSS: it’s primarily a headline syndication service, not suitable as a native data format for weblogs. I want a format to hold my data, so that I can use a number of different tools to manipulate and display it.

  12. i’d rather not use a BlogML markup on my webpages, even if they were indexed (they are not according to robots.txt). converting webpages to another format is not a task to be taken lightly.

    i think the issue here isn’t the issue of variable relevance in weblogs to a specific topic; really it’s google’s, or any search engines, faulty assumption that one page is specific to a finite set of topics. google can refine their search processes quite simply by determing distance metrics between words, in quite the same way that networks determine metrics or “hops” between server A and server B. the results with the shortest metrics should appear in ascending order. you sidestep the messy issue of XML altogether, and — superficially — that would seem to be the thing to do to improve the relevance and quality of searches.

  13. I’m with Alan & moz here. Why I could agree a mark-up language aimed at content such as a weblog wouldn’t be a bad thing (but that’s what XML should be all about), I think it’s in the hands of the search engines to get back the relevant information.

  14. You’re right, it really can be dismissed as that simple (I like the Idea actually)…

    But why not have MT, Blogger, etc… throw in the indexing… If it has some value to the communities they serve, then they should consider it. And these are the types of people (companies?) who really work hard to serve, which is very cool. I don’t think they idea is without merit.

    But Google should have proximity operands.

  15. Steven –

    Here’s the error I get when I try to subscribe to your RSS feed in Radio.

    Not sure what is going on. Anyway, yes, this is all for naught if Google or Daypop aren’t smart enough to realize what is going on. Maybe we need to come up with our own search engine that uses this blog format, too =]

    Tony

  16. Tony, you are not alone. It’s an issue with the http header that my CF server produces. It will be fixed when I move AOV to PHP in the next few months. In the mean time – my appologies.

  17. Hm, that gives me an idea. I’m going to see if I can get PHP to readfile() your RSS feed. Maybe it’ll spit out a friendly header. Be right back.

  18. Bingo. The following PHP will let me subscribe to AOV using Radio:

    <?php
    $url = “http://www.actsofvolition.com/rss.cfm”;
    readfile($url);
    ?>

    All I do is subscribe to the URL of that script and it lets me at the goods. Awesome!

  19. Nice Tony. I’m going to see if I can do just that from one of our linux servers here and have a solution that can be enjoyed by everyone.

  20. > A simple thought experiment in weblog semantics:
    >
    > 1:37 PM – Joe makes a post on his blog about his golf
    > vacation on Prince Edward Island
    >
    > 3:24 PM – Joe makes a post on his blog about hang gliding > in Newfoundland
    >
    > Two weeks pass.
    >
    > Sam searches Google for hang gliding in Newfoundland.
    > Joe’s blog is the first result.
    >
    > The trouble with this picture is that Joe never wrote
    > anything about hang gliding on Prince Edward Island. The > thought never even crossed his mind.

    Shouldn’t that read “Sam searches Google for hang gliding on Prince Edward Island. Joe’s blog is the first result.” That would make more sense as problem that needs to be solved via BlogML.

  21. Yup. Fixed. Thanks.

    I wonder how many people that confused – or did their brains automagically fixed that – like mine did.

  22. I did a rough proposal for Semantic Web Markup Language that takes account of some of the issues here:

    http://groups.yahoo.com/group/syndication/message/2283

    Aarons Swartz’ excellent script: http://logicerror.com/blogifyYourPage takes a link mentioned in a weblog posting and
    assumes that the surrounding commentary is about that link (if there is more
    than one link in a posting then you have to choose which one the commentary
    is about). If you like, the XML that is produced is metadata about another
    page other than the one where the metadata is published within span tags.
    The more generalized approach assumes that you want to automatically
    generate metadata about the page that you are running the script over, be
    able to generate richer metadata and allow for multiple links.
    In addition you will want to use namespaces defined by URI’s to avoid
    collisions and it may be usefull to use the concept of weblog style
    permanent archiving to attach the default namespaces at the level of
    individual postings as opposed to web pages which are a transient thing. In
    other words metadata should be attached to a piece of content (a posting
    which may be a paragraph or several pages) as it was authored, as opposed to
    a page which is merely a rendering of part of some content or a collection
    of different pieces of content.

  23. Although ID’s must be unique on a page, name attributes aren’t. So you could still embed whatever you liked in the HTML. Of course, the browser will ignore any attribute that it doesn’t know about so you could add blogML attributes to everthing if you wanted to.

  24. Tony, who’s posted on this thread, sent an email to a few of us who’ve been discussing the concept of BlogML. With his permission, I’m posting parts of his email and my response (both with some minor formatting changes) as some of it may be of general interest.

    From: Tony Collen
    To: A bunch of people interested in BlogML

    To start things off, I’m Tony, I was posting about BlogML over on actsofvolition. It seems like there’s at least somewhat of a need in the community for a “standard” markup language for blogs. The main benefit that I can see in this case is interopability. If someone has a legacy blog from Blogger, they won’t have a problem moving it over to MT or Radio. However, this all depends on the “big guys” taking up these ideas and supporting them.

    The main reason I was thinking about this was because I was coming up with some ideas for an open-source clone of Radio (And possibly the RCS,
    too) and would use Mozilla + XUL as an interface. I was looking for some good archiving system that didn’t rely on a relational database for storage.

    Anyway, I suggest we get a BlogML working group formed where we can toss around ideas and mull around possible DTD’s, good things/bad things, changes, etc.

    Some things to consider:

    • Simplicity. Ease of use is key here.
    • Flexible. Can we come up with a DTD that everyone will support? (I think Dan has already come up with a good starting point)
    • Ease of conversion to RSS. It’s XML so in theory, if the DTD is designed correctly it should be trivial to convert to things like RSS. Perhaps we could even release XSL stylesheets to convert from BlogML to RSS 0.92/1.0, or other “proprietary” blog archival formats, like the one that Radio keeps.

    That’s all I have for now. I’d like to see the ball get rolling on this project. Hopefully it’ll take off and get some support from the “important” people. Feel free to post anything in this message to your respective blogs.

    TC


    Tony Collen
    Ham Journalism – http://radio.weblogs.com/0100630/

    Thanks Tony. Here’s my response:

    From: Steven Garrity
    To: The same bunch of people interested in BlogML

    First, I would suggest setting up a web-based home for this discussion so we don’t end up clogging eachother’s inboxes and so others can participate. Any suggestions? dan [dan@d-log.net] emailed me about the availablity of blogml.org – could be a good place to discussion – and the domain would lend some credibilty.

    My general throughts on the prospects of something like BlogML:

    Tony nails if when he says “Simplicity. Ease of use is key here.” We could work of a beautifully complete schema and it would be worthtless if people can’t use it quickly and easily. Balance is key here – if you try and make it robust and flexible and it becomes to complex and doesn’t catch on, you’ve shot yourself in the foot. My thoughts: keep if bone-head simple. Aim low and surprise yourself. Do one thing and do it well, blah, blah. All I was asking for was a way for search engines to tell one post from another and know where to link.

    I also think that the key to something like this actually having an impact is adoption by major Blog players (Blogger, Radio, MT, etc.). Otherwise, it would take years to catch on, and it years (hopefully) we’ll all be working in XHTML anyhow. For fast and broad adoption, you need backwords compatablity (doesn’t screw up my blog in NS4), and simplicity. Dave Winer (master of all things Userland) did post a link to the article, but in response to an email I sent him he asked “What’s wrong with RSS?”. I’ve love to hear what Even Williams (blogger) would think about it since his work on the Blogger API brushes up with these concepts.

    On my participation: let me be clear, I’m one of those annoying people would talk a lot and never does anything (actually, I don’t think this is my typical behaviour, but it is in this case I’m afraid). I’m very interested in the idea – and I’ll talk about it as long as people will listen, but I don’t plan to dedicate much time to this (I’d say I “don’t have time”, but I’m no busier than any of the rest of you – so the truth is, I’m just not interested in putting a lot of time into this). I’m not bailing, I’m looking forward to the discussions, but don’t expect much (I’m no expert anyhow).

    I was pleasantly surprised that the post about BlogML on Acts of Volition gotten lot’s of attention (last night it was #9 on the DayPop top 40). The purspose of the post was to ask why this hadn’t been done already (I knew that if I thought of it, I couldn’t be the first – and I wasn’t), what’s out there like this, and doesn’t it make any sense. I got some great answers. Yes, people did think of it before me. People have put some work and thought into it. Let’s hope pooling the thoughts and efforts of these people can make something happen.

    Thanks for all of you for your feedback on aov – I guess this is what the internet is all about – a group of unrelated individuals converging on a shared concept. Cool.

    (I’m gonna post some of Tony’s original mail and my response on the thread on actsofvoltition.com so any other interested parties will know what’s going on)

    Thanks,
    Steven Garrity

  25. I’d love to see an open source clone of Radio. I’m tired of the 30-day trial periods, but I don’t want to pay $40 it. $30, maybe, but $40’s ridiculous.

    Also, as to the group of people interested in blogML: I’d be willing to donate some time to it. If someone could email me about that, it’d be great. 🙂 Thanks again…

  26. Sorry for jumping in the discussion so late. I am having problems with my Internet connection, was without mail for the last 24 hours and still my blog is unreachable :(.
    By the way, yes, there are problems with my RSS feed and namespaces. They are probably due to a bug in Xerces or Xalan. If I prefix HTML elements with “html:”, all link href’s are changed from:

    html:a href=”some-URI” to

    html:a some-URI=”some-URI”

    Bizarre!

  27. (This is to anyone interested in this, but directed mainly at Mr. Frost):

    The idea drawn up on BlogWorks (http://www.blogworks.com/blogml.asp) looks very nice–simple, elegant, definitely gets the job done, so it just might work. However, it brought one question to mind for me: where would the comments go? I noticed there was a commentable option (yes or no), but there was no actual place for comments in the file itself.

    So, I took the draft Mr. Frost had already drawn up and added in a few tags–nothing major at all, by any means, but it works inline comments into this draft. check it out here: http://www.interalia.org/m3ta/filebin/blogML.xml. Also, the link just under my name goes to the post where I mentioned this, along with an explanation of the tags I added and what purpose they serve.

    Comments are welcome. Flames are not. 😉

    Cheers,
    Phil Ulrich

  28. The problem cited (bad search engine matches) is not unique to blogs. The same thing will happen to any page which includes summaries of multiple base articles. Like, uh, the navigation pages of any website.

  29. The one element most blogs contain in a permalink is a # mark trailing at the end of the URL with a unique item number to identify it.

    Why isn’t Google or any other search service set up to identify the ‘#’+number nearest to the entry and jump straight to it?

    Or to the nearest anchor?

    In blogger, you can set an anchor to sit at the top of the entry with the a name=blogitemnumber

    I use Atomz search locally on my site, The Copydesk, and it can jump straight to an individually indexed entry searched for on my site.

    I’d say this is less about webloggers having to change practice, and more about Google et al doing so.

  30. (I know this is slightly off-topic, but there should be a ‘description’ attribute for every link. Perhaps the ‘title’ attribute does this anyway. If present, Google would log this rather than the link-text. This way, blogs could still use links in flowing paragraphs – using words like ‘this’ and so on as anchor text – whilst also adding to the Google database of knowledge about sites.)

Comments are closed.