Okay, I'll chase ONE new story today. But it's about this fundamental problem of converting old media objects into new ones, and I get to dig up some old blog posts too, I feel like I'm still in character.
Google's counting method relies entirely on its enormous metadata collection--almost one billion records--which it winnows down by throwing out duplicates and non-book items like CDs. The result is a book count that's arrived at by a kind of process of elimination. It's not so much that Google starts with a fixed definition of "book" and then combs its records to identify objects with those characteristics; rather, the GBS algorithm seeks to identify everything that is clearly not a book, and to reject all those entries. It also looks for collections of records that all identify the same edition of the same book, but that are, for whatever reason (often a data entry error), listed differently in the different metadata collections that Google subscribes to.
But the problem with Google's count, as is clear from the GBS count post itself, is that GBS's metadata collection is a riddled with errors of every sort. Or, as linguist and GBS critic Geoff Nunberg put it last year in a blog post, Google's metadata is "train wreck: a mish-mash wrapped in a muddle wrapped in a mess."
It's not just Google that has a problem. I wrote a post for Wired.com last week ("Why Metadata Matters for the Future of E-books") about how increased reliance on metadata was affecting publishers of new books, who also depend heavily on digital search -- and generally how bibliographic and legal arcana around e-books affects what we see and how we come to see it more than you'd think.
But I wish I'd added Google's woeful records to the piece. It's not like I didn't know about it; here's the title of a post I wrote a year ago, also citing Nunberg's post when it first appeared at Language Log: "Scholars to Google: Your Metadata Sucks".