Tag frequency and popularity acceleration  NOV 03 2006

As many of you don't know, I've been working less-than-diligently1 on a project with the eventual goal of adding tags to kottke.org. I posted some early results back in August of 2005. The other day, I started thinking about how tags could help people get a sense of what's been talked about recently on the site, like Flickr's listing of hot tags. I started by compiling a list of tags from the last 200 entries and ordering them by how many times they were used over that period. Here is the top 20 (with # of instances in parentheses)

photography (33), books (26), art (26), science (22), tv (21), movies (21), lists (20), video (17), nyc (16), weblogs (15), design (14), interviews (13), bestof (13), business (12), thewire (12), food (11), sports (11), games (10), language (10), music (9)

The items in bold also appear in the top 50 of the all-time popular tags, so obviously this list isn't telling us anything new about what's going on around here. To weed those always-popular tags from the list, I compared the recent frequency of each tag with its all-time frequency and came up with a list of tags that are freakishly popular right now compared to how popular they usually are. Call this list a measure of the popularity acceleration of each tag. The top 20:

blindside (3), pablopicasso (3), ghostmap (3), davidsimon (5), poptech2006 (4), thewire (12), andywarhol (3), michaellewis (4), education (4), youtube (4), richarddawkins (5), realestate (3), crime (8), working (8), school (3), dvd (4), georgewbush (4), stevenjohnson (5), writing (4), photoshop (3)

(Note: I also removed tags with less than three instances from this list and the ones below.) Now we're getting somewhere. None of these appear in the top 50 all-time list. But it's still not that accurate a list of what's been going on here recently. I've posted 3 times about Photoshop, but you can't discount entirely the 33 posts about photography. What's needed is a mix of the two lists: generally popular tags that are also popular right now (first list) + generally unpopular tags that are popular right now (second list). So I blended the two lists together in different proportions:

75% recent / 25% all-time:
davidsimon (5), poptech2006 (4), ghostmap (3), pablopicasso (3), blindside (3), thewire (12), andywarhol (3), michaellewis (4), education (4), photography (33), art (26), youtube (4), tv (21), richarddawkins (5), books (26), crime (8), video (17), working (8), realestate (3), science (22)

67% recent / 33% all-time:
davidsimon (5), poptech2006 (4), pablopicasso (3), ghostmap (3), blindside (3), thewire (12), andywarhol (3), photography (33), art (26), michaellewis (4), education (4), tv (21), books (26), youtube (4), video (17), science (22), richarddawkins (5), crime (8), movies (21), lists (20)

50% recent / 50% all-time:
thewire (12), davidsimon (5), photography (33), poptech2006 (4), blindside (3), ghostmap (3), pablopicasso (3), art (26), books (26), tv (21), science (22), movies (21), lists (20), andywarhol (3), video (17), michaellewis (4), education (4), nyc (16), weblogs (15), crime (8)

25% recent / 75% all-time:
photography (33), art (26), books (26), tv (21), science (22), movies (21), lists (20), thewire (12), video (17), nyc (16), weblogs (15), davidsimon (5), poptech2006 (4), design (14), interviews (13), bestof (13), blindside (3), ghostmap (3), pablopicasso (3), business (12)

The 75%-66% recent lists look like a nice mix of the newly & perenially popular and a fairly accurate representation of the last 3 weeks of posts on kottke.org.

Digression for programmers and math enthusiastists only: I'm curious to know how others would have handled this issue. I approached the problem in the most straighforward manner I could think of (using simple algebra) and the results are pretty good, but it seems like an approach that makes use an equation that approximates the distribution of the popularity of the tags (which roughly follows a power law curve) would work better. Here's what I did for each tag (using the nyc tag as an example):

# of recent entries: 300
# of total entries: 3399
# of recent instances of the nyc tag: 16
# of total instances of the nyc tag: 247
# of instances of the most frequent recent tag: 33
# of instances of the most frequent tag, all-time: 272

Calculate the recent and all-time frequencies of the nyc tag:
16/300 = 0.0533
247/3399 = 0.0726

Then divide the recent frequency by the all-time frequency to get the popularity acceleration:
0.0533/0.0726 = 0.7342

That's how much more popular the nyc tag is now than it has been all-time. In other words, the nyc tag is 0.7342 times as popular over the last 300 entries as it has been overall...~1/4 less popular than it usually is. To get the third list with the 75% emphasis on population acceleration and 25% on all-time popularity, I stated by normalizing the popularity acceleration and all-time frequency by dividing the nyc tag values by the top value of the group in each case (11.33 is the popularity acceleration of the blindside tag and 0.11 is the recent frequency of the photography tag (33/300)):

0.7342/11.33 = 0.0647
0.0533/0.11 = 0.4845

So, the nyc tag has a popularity acceleration of 0.0647 times that of the blindside tag and has a recent frequency that is 0.4845 times that of the most popular recent tag. Then:

0.0647*0.75 + 0.4845*0.25 = 0.1696

Calculate this number for each recent tag, rank them from highest to lowest, and you get the third list above. Now, it seems to me that I may have fudged something in the last two steps, but I'm not exactly sure. And if I did, I don't know what got fudged. Any help or insight would be appreciated.

[1] Great artists ship. Mediocre artists ship slowly.

Read more posts on kottke.org about:
kottke.org   statistics   tags

kottke.org

Front page
About + contact
Site archives

Subscribe

Follow kottke.org on Twitter

Follow kottke.org on Tumblr

Like kottke.org on Facebook

Subscribe to the RSS feed

Advertisement

Ads by The Deck

Support kottke.org shop at Amazon

And more at Amazon.com

Looking for work?

More at We Work Remotely

Kottke @ Quarterly

Subscribe to Quarterly and get a real-life mailing from Jason every three months.

 

Enginehosting

Hosting provided EngineHosting