Dare Obasanjo, creator of RSS Bandit wrote an article on the RSS Bandit and the Bloglines sync API. It can be found here. Basically Bloglines is using this as a way to eliminate the bandwidth hogging RSS aggregators have been known to produce. Here's a snippet of what he's talking about:
There are two aspects of this press release I'm skeptical about. The first is that having desktop aggregators fetch feeds from Bloglines versus the original sources of the feeds somehow "eliminates the RSS bandwidth bottleneck". It seems to me that the Bloglines proposal does the opposite. Instead of thousands of desktop aggregators fetching tens of thousands to hundreds of thousands of feeds from as many websites instead it is proposed that they all ping the Bloglines server. This seems to be creating a bottleneck to me, not the other way around
The only time it would not create a bottleneck is if the service would do something like this:
- User 1 caches feed into caching server.
- Feed is checked based on bloglines internal do-dah. Each feed is only checked once by it's internal do-dah, not duplicated for Users 500, 498, and 928 who all have the same feed.
- User 500, 498, and 928 query the feed.
- User 500 already has the latest RSS, so Bloglines says "Bad cop, no dough nut" and doesn't return anything.
- User 498 and 928 need the feed because their copy is outdated.
- Bloglines internal do-dah sends the RSS feed to them.
In this scenario, the aggregator would most likely report that User 500 has an error with the feed. I've not dived much into RSS Bandit, but I believe all aggregators expect a feed when they query for it. They return error messages when the feed is broken or whatever but most of them wouldn't know how to handle a "Hey, you already have that item cached, nothing has been updated, FOOL" response from the server. I don't think the server or RSS in general is in a position to make those kind of responses because how would a server know if you have the cached copy? Bloglines might but it may use it's internal structure to tell if a user's copy is the latest “version“ or not.
So what is a solution to the RSS aggregator bottleneck?
To find a solution we have to first look at a sample RSS file. I'll use my full text RSS feed as an example. Items aren't listed as they're not really relevant.
channel
title Programming Reality
link http://geekswithblogs.net/jbrayton/
description Life in C#
managingEditor Jeremy Brayton
dc:language en-US
generator .Text Version 0.95.2004.102
item
item
item
/channel
The above snippet is not an actual RSS feed because .Text wasn't showing the text correctly when I saved the post. Rather than hope it showed up in publication, I thought it'd be best to change the example, not like the XML proves the point anyway.
This is a typical RSS feed and uses pretty much only required field information. There are 2 very important channel fields that would be perfect for RSS bandwidth bottlenecks: pubDate and lastBuildDate. pubDate is the date the content of the channel was published. This field ties into the item's pubDate field heavily and is used in the same fashion. lastBuildDate is a little more obscure in my logic and seems more like it's used mainly for updates to existing content items. So pubDate would be used to tell when new content has been added to the feed and lastBuildDate would be used to tell if there was some updates made in between the pubDate and whenever the user checked the feed.
This doesn't answer the problem though. Why? Because you have to download the entire RSS feed, parse it, then you can tell which content channels have been updated and when. The RSS aggregator would have to do the work here, comparing it's local cached feed to the feed it just downloaded. If the pubDate or lastBuildDate is different, then it should apply the new feed to the cached copy, overwriting it entirely. This may be what RSS aggregators do already if a pubDate or lastBuildDate is even included in the feed.
The solution here would be to have 2 feeds. The main feed would be a shell rss feed which would include only the necessary information and the dates in question. All bulk content and items would be stripped out to save space. The secondary feed would be the full text feed, including everything from the main feed. RSS aggregators would first query the main feed, then compare it with the cached copy of the secondary feed. If the RSS aggregator's cached copy date doesn't match the main feed date, then and only then would it download the full text secondary feed and cache that copy only. Comments are viewed in RSS Bandit in roughly this same fashion, with a separate feed for the comments. Something like this wouldn't be too much of a stretch to implement from a aggregator standpoint. The problem would be getting those that build the RSS feed to include pubDate and lastBuildDate which may or may not be an easy thing to do.
Conclusion
To be honest, I don't know why this issue hasn't been tackled and completed already. I know when I open Outlook, I don't want to download every single email in my Inbox multiple times, so why on earth would I want to re-download the same exact RSS feed multiple times? In the case of my blog, I check it regularly in RSS Bandit but it rarely gets updated due to my busy schedule. It'd seem like a waste for me to download the feed I know hasn't been updated, multiple times in the same day. I'd rather my RSS aggregator check the feed every so often and go “Nope, nothings changed” than go “Oh, I'll download this for the 50th time today even though I know it's the same thing I already have”. It just doesn't seem to make much sense but RSS is still very much in it's infancy to teenage years. I expect issues like this to be worked through eventually.