without an e

feed list quality [07/30/2007 04:38:27]

Sam Ruby ran an analysis of the Persai Feed Corpus, showing how many of each HTTP status code he got back when he requested each feed.

Even after looking at the site, I don't really know what Persai is, but I have some experience with long lists of blog URLs, because I wrote the first blog aggregator, linkwatcher.

Linkwatcher predates RSS, so only a portion of the URLs it checks are feeds. (It doesn't know or care about feeds - it just hits the page and compares it to the last version it saw.) Also, the submission form hasn't worked in years, and there's no autodiscovery, so it's a small list. The first blog was added Jun 06 2000 and the last blog was added Jan 3 2001. So basically the list of linkwatcher blogs is ancient data.

I thought it would be interesting to compare the relative quality of the two corpi corpora lists. So I ran Sam's code against my list, and made a little table:

Message Status Persai Linkwatcher
OK 200 45692 411
No Content 204 1
Multiple Choices 300 57 1
Moved Permanently 301 42569 75
Found 302 7589 95
See Other 303 14
Not Modified 304 7
Temporary Redirect 307 338 1
Bad Request 400 83 7
Unauthorized 401 95 1
Payment Required 402 1
Forbidden 403 702 13
Not Found 404 1437 315
Not Acceptable 406 9
Request Timeout 408 16284 288
Gone 410 45
Precondition Failed 412 7
Locked 423 3 1
Internal Server Error 500 3559 140
Bad Gateway 502 4
Service Unavailable 503 28 1

Better yet, here are some bar charts.

Status Codes for 118,254 Persai Feeds

Status Codes for 1,349 Linkwatcher Blogs

Persai Feeds vs Linkwatcher Blogs

I'm not surprised to see the linkwatcher list has so many 40X and 50X codes. This is only to be expected, since many of the early blogs have long ago expired. I suspect many of the blogs with status code 200 point to spam sites created when the old bloggers let their domains lapse, but I haven't quantified this.

I don't know what any of this proves, other than that the Persai list is much bigger and quite a bit better than the linkwatcher list. I am surprised at the huge number of 301's in their list compared to mine, though. Weird.

Post a comment:
name: (shows up on site)
link: (shows up on site)
mail: (for michal only)
no html allowed yet. sorry: