kottke.org home archives + xml about kottke.org contact me
kottke.org - home of fine hypertext products

PHP5, DOM, scraping web pages, and McSweeney's Lists RSS

In preparation for a larger project, I recently spent some time playing around with PHP5's DOM support to scrape web pages. Basically you point your script at a page and use the DOM methods to root around in it. This little chunk of code gets you a tree of the contents of all the <p> tags in document.html:

$dom = new DomDocument();
$file = 'document.html';
$dom->loadHTMLFile($file);
$pgs = $dom->getElementsByTagName("p");

I never learn anything like this without a little project to do, so I decided to use the above to make an RSS feed for McSweeney's Lists (which currently doesn't have one and now that I'm using a newsreader to keep up with the web, I never remember to visit there on anything resembling a regular basis). I've got a cron job set up that goes out and gets the lists page each night (using Tidy to convert their circa-1999 HTML to proper XHTML that can be easily parsed with the DOM), scans it for new lists (and if it finds new ones, puts them in a DB), and then writes an RSS file.

Anyway, here's the RSS feed for McSweeney's Lists. Since it relies on screen scraping, my meagre PHP skills, and the good graces of McSweeney's in not asking me to shut it down, there's no guarantee this will work forever, so enjoy it while you can. I'm trying out Feedburner as well, so we'll see how that goes.

Update: my code snippet was incorrect and is now fixed. Thanks to Eliot for pointing that out.

Update: As some of you may have noticed, the above RSS feed has not worked for some months now...it broke at some point and I never got around to fixing it. Additionally, McSweeney's has contacted me and asked me to discontinue the feed, so it won't ever be fixed. They're looking at doing their own RSS feeds and hopefully that will happen sooner rather than later.

More about this page

This entry was published on April 25, 2005 at 11:07 am.

kottke.org is a weblog about the liberal arts 2.0 edited by Jason Kottke since March 1998. You can read about me and kottke.org here. If you've got questions, concerns, or an interesting link for me, send them along. Here's the kottke.org RSS feed kottke.org RSS feed.

Advertisement

dot dot dot

Advertise on kottke.org via The Deck.

Looking for work? Tags, tags, tags!

Many posts on kottke.org have been "tagged" with keywords, which activity results in collections of related posts like sports, infoviz, or bestof.

Recently popular tags (last 3 weeks)

swimming   olympics   movies   video   sports   trackandfield   photography   design   lists   free   books   tv   science   language   food

All-time popular tags

movies   photography   books   nyc   science   food   lists   design   business   sports   video   weblogs   music   bestof   art

Some of my favorite tags

photography   economics   lists   bestof   infoviz   food   nyc   firstworldproblems   cities   restaurants   video   timelapse   interviews   language   maps   fashion   nsfw   remix  

Random tags

sunshine   prison   cities   barcade   marypoppins   lifeafterpeople   realestate   cars   fundraising   hosseinderakhshan   fridakahlo   sony   pentagram   movies   im

kottke.org

You're visiting kottke.org. All content by Jason Kottke (contact me) unless otherwise noted, with some restrictions on its use. Good luck will come to those who dig around in the archives. If you've reached this point by accident, I suggest panic.