kottke.org

...is a weblog about the liberal arts 2.0 edited by Jason Kottke since March 1998 (archives). You can read about me and kottke.org here. If you've got questions, concerns, or interesting links, send them along.

PHP5, DOM, scraping web pages, and McSweeney's Lists RSS

In preparation for a larger project, I recently spent some time playing around with PHP5's DOM support to scrape web pages. Basically you point your script at a page and use the DOM methods to root around in it. This little chunk of code gets you a tree of the contents of all the <p> tags in document.html:

$dom = new DomDocument();
$file = 'document.html';
$dom->loadHTMLFile($file);
$pgs = $dom->getElementsByTagName("p");

I never learn anything like this without a little project to do, so I decided to use the above to make an RSS feed for McSweeney's Lists (which currently doesn't have one and now that I'm using a newsreader to keep up with the web, I never remember to visit there on anything resembling a regular basis). I've got a cron job set up that goes out and gets the lists page each night (using Tidy to convert their circa-1999 HTML to proper XHTML that can be easily parsed with the DOM), scans it for new lists (and if it finds new ones, puts them in a DB), and then writes an RSS file.

Anyway, here's the RSS feed for McSweeney's Lists. Since it relies on screen scraping, my meagre PHP skills, and the good graces of McSweeney's in not asking me to shut it down, there's no guarantee this will work forever, so enjoy it while you can. I'm trying out Feedburner as well, so we'll see how that goes.

Update: my code snippet was incorrect and is now fixed. Thanks to Eliot for pointing that out.

Update: As some of you may have noticed, the above RSS feed has not worked for some months now...it broke at some point and I never got around to fixing it. Additionally, McSweeney's has contacted me and asked me to discontinue the feed, so it won't ever be fixed. They're looking at doing their own RSS feeds and hopefully that will happen sooner rather than later.

By Jason Kottke    Apr 25, 2005 at 11:07 am

kottke.org, quickly...

The best way to get a sense of what kottke.org is all about is to head to the front page or check out some random entries from the archives. Follow kottke.org via RSS or Twitter.

Want to share your something special with kottke.org's readers? Sponsor the RSS feed for a week!

Looking for work?

See more on the Job Board.

Recommended sites

David Archer    Matthew Paul Thomas    Rebecky    greg.org    jimr(ay)    evhead    panopticist    strange maps    Nivi    Type for you.    Airbag    Ikeepadiary    The Pop!Tech Blog    Eater    tremble.com    Frumination    Personism    NYT Science    Idle Words    The Laboratorium