Programming Reality

Life in C#
posts - 81, comments - 14, trackbacks - 348

My Links

News

Article Categories

Archives

Post Categories

Image Galleries

Blogs

CRM

RSS aggregation bottleneck and the Bloglines sync API

Dare Obasanjo, creator of RSS Bandit wrote an article on the RSS Bandit and the Bloglines sync API. It can be found here. Basically Bloglines is using this as a way to eliminate the bandwidth hogging RSS aggregators have been known to produce. Here's a snippet of what he's talking about:

There are two aspects of this press release I'm skeptical about. The first is that having desktop aggregators fetch feeds from Bloglines versus the original sources of the feeds somehow "eliminates the RSS bandwidth bottleneck". It seems to me that the Bloglines proposal does the opposite. Instead of thousands of desktop aggregators fetching tens of thousands to hundreds of thousands of feeds from as many websites instead it is proposed that they all ping the Bloglines server. This seems to be creating a bottleneck to me, not the other way around

The only time it would not create a bottleneck is if the service would do something like this:

  • User 1 caches feed into caching server.
  • Feed is checked based on bloglines internal do-dah. Each feed is only checked once by it's internal do-dah, not duplicated for Users 500, 498, and 928 who all have the same feed.
  • User 500, 498, and 928 query the feed.
  • User 500 already has the latest RSS, so Bloglines says "Bad cop, no dough nut" and doesn't return anything.
  • User 498 and 928 need the feed because their copy is outdated.
  • Bloglines internal do-dah sends the RSS feed to them.

In this scenario, the aggregator would most likely report that User 500 has an error with the feed. I've not dived much into RSS Bandit, but I believe all aggregators expect a feed when they query for it. They return error messages when the feed is broken or whatever but most of them wouldn't know how to handle a "Hey, you already have that item cached, nothing has been updated, FOOL" response from the server. I don't think the server or RSS in general is in a position to make those kind of responses because how would a server know if you have the cached copy? Bloglines might but it may use it's internal structure to tell if a user's copy is the latest “version“ or not.

So what is a solution to the RSS aggregator bottleneck?

To find a solution we have to first look at a sample RSS file. I'll use my full text RSS feed as an example. Items aren't listed as they're not really relevant.

channel
  title Programming Reality
  link http://geekswithblogs.net/jbrayton/
  description Life in C#
  managingEditor Jeremy Brayton 
  dc:language en-US 
  generator .Text Version 0.95.2004.102
  item
  item
  item
/channel

The above snippet is not an actual RSS feed because .Text wasn't showing the text correctly when I saved the post. Rather than hope it showed up in publication, I thought it'd be best to change the example, not like the XML proves the point anyway.

This is a typical RSS feed and uses pretty much only required field information. There are 2 very important channel fields that would be perfect for RSS bandwidth bottlenecks: pubDate and lastBuildDate. pubDate is the date the content of the channel was published. This field ties into the item's pubDate field heavily and is used in the same fashion. lastBuildDate is a little more obscure in my logic and seems more like it's used mainly for updates to existing content items. So pubDate would be used to tell when new content has been added to the feed and lastBuildDate would be used to tell if there was some updates made in between the pubDate and whenever the user checked the feed.
 
This doesn't answer the problem though. Why? Because you have to download the entire RSS feed, parse it, then you can tell which content channels have been updated and when. The RSS aggregator would have to do the work here, comparing it's local cached feed to the feed it just downloaded. If the pubDate or lastBuildDate is different, then it should apply the new feed to the cached copy, overwriting it entirely. This may be what RSS aggregators do already if a pubDate or lastBuildDate is even included in the feed.
 
The solution here would be to have 2 feeds. The main feed would be a shell rss feed which would include only the necessary information and the dates in question. All bulk content and items would be stripped out to save space. The secondary feed would be the full text feed, including everything from the main feed. RSS aggregators would first query the main feed, then compare it with the cached copy of the secondary feed. If the RSS aggregator's cached copy date doesn't match the main feed date, then and only then would it download the full text secondary feed and cache that copy only. Comments are viewed in RSS Bandit in roughly this same fashion, with a separate feed for the comments. Something like this wouldn't be too much of a stretch to implement from a aggregator standpoint. The problem would be getting those that build the RSS feed to include pubDate and lastBuildDate which may or may not be an easy thing to do.
 
Conclusion
 
To be honest, I don't know why this issue hasn't been tackled and completed already. I know when I open Outlook, I don't want to download every single email in my Inbox multiple times, so why on earth would I want to re-download the same exact RSS feed multiple times? In the case of my blog, I check it regularly in RSS Bandit but it rarely gets updated due to my busy schedule. It'd seem like a waste for me to download the feed I know hasn't been updated, multiple times in the same day. I'd rather my RSS aggregator check the feed every so often and go “Nope, nothings changed” than go “Oh, I'll download this for the 50th time today even though I know it's the same thing I already have”. It just doesn't seem to make much sense but RSS is still very much in it's infancy to teenage years. I expect issues like this to be worked through eventually.

Print | posted on Wednesday, September 29, 2004 4:42 PM | Filed Under [ Information Technology Software ]

Feedback

Gravatar

# re: RSS aggregation bottleneck and the Bloglines sync API

Most aggregators including RSS Bandit support HTTP conditional GET requests meaning that the feed is only downloaded when it has changed. So RSS Bandit does NOT download your feed multiple times a day if it hasn't changed. It does ask your web server multiple times a day if it has changed though.

People like Sam Ruby are engaging in proposals to make this even more optimal by suggesting that if the feed has changed only the items the aggregator hasn't seen are shown in the feed so there is no redundancy in the feed. This is basically requiring servers to support a degree of query ability when fetching RSS feeds.
9/30/2004 1:27 PM | Dare Obasanjo
Gravatar

# re: RSS aggregation bottleneck and the Bloglines sync API

As I read more into Sam's blog, rfc3229 with feeds is the best way to do it. Getting delta changes from HTTP will solve a bandwidth problem for both xml (feed) and normal http content.

I suppose the aggregator could specify how far back they'd want to see content. There are a couple of blogs that I wouldn't mind archiving or going back as far as I can without having to go to the web for it. On the flip side there may be some blogs I'd rather see only the last 15 items that follows the rss feed exactly, caching nothing. I'm sure a degree of this is in RSS Bandit already so implementation would be minimal.

Server side may be fun but I wouldn't think it'd take much. I believe rfc3229 is implemented in Apache 1.x and IIS 5.x but I'm probably wrong. Introducing those would be a bit of a challenge but I suspect one could just write up a rfc3229 with feed implementation for the server they need.
9/30/2004 3:27 PM | Jeremy Brayton
Gravatar

# re: RSS aggregation bottleneck and the Bloglines sync API

I believe you are only viewing the bandwidth problem from the client perspective. As far as those hosting individual blogs, having Bloglines as the main aggregator of feeds, which desktop aggregators then query, would reduce the bandwidth issues seen by the individual blog hosters. Bloglines does, though, become the single point of failure.

I'm pretty sure Bloglines does maintain a consolidated feed repository, which is referenced by each subscriber. It would be silly and more expensive to maintain duplicate feed content per subscriber.
1/5/2005 4:39 PM | Ryan Cromwell
Post A Comment
Title:
Name:
Email:
Website:
Comment:
Verification:
 
 

Powered by: