Advertise here with Carbon Ads

This site is made possible by member support. 💞

Big thanks to Arcustech for hosting the site and offering amazing tech support.

When you buy through links on kottke.org, I may earn an affiliate commission. Thanks for supporting the site!

kottke.org. home of fine hypertext products since 1998.

Beloved by 86.47% of the web.

🍔  💀  📸  😭  🕳️  🤠  🎬  🥔

Normalized data is for sissies

I’m not a hot shot programmer by any means, but I’ve done quite a bit of playing around with getting data in and out of databases programmatically. Something that’s always confused me is the near-religion of data normalization among programmers and database admins. Every developer I’ve ever worked with has told me that when you’re building a database, you need to normalize your data — basically this means organizing your data in such a way that removes redundancy — and failure to do so would result in public ridicule and possible revocation of access to any computing device. But I’ve always wondered, given that hard drives are cheap and getting cheaper, what’s the problem with using more storage space in exchange for greater speed?

I was delighted when I read Cal Henderson’s take on normalized data from a recent talk he gave about Flickr (from page 27):

- Normalised data is for sissies
- Keep multiple copies of data around
- Makes searching faster
- Have to ensure consistency in the application logic

To which I would add: hard drives are cheap.

Cal presents normalization as a trade-off that, depending on your circumstances, might be worth looking at…which is a much more useful way of approaching the situation than what I’ve typically heard (normalize or die!). Want faster access to your data? Replicate it in the database but be aware that it’ll cost you some storage space and you’ll need to keep track of the extra data in your application (which can be a pain in the ass). In Flickr’s case, they have 13 SELECTs for every INSERT, DELETE, and UPDATE statement hitting their database. Normalization can slow SELECT speed down while denormalization makes your I/D/Us more complicated and slower. Since the application part of Flickr depends so heavily on SELECTs from the database, it makes sense for them to denormalize their data somewhat to speed things up.

Here’s an analogy for the smokers in the audience…what sucks worse than realizing you left your lighter at home and you’re stuck in traffic on the way to the office? The solution is to buy a bunch of lighters, one for your car, one for your coat pocket, one for the drawer in your office, one for your purse, etc. It’s a trade-off. Your initial cash investment is greater (but lighters, like hard drives, are cheap) and you need to be diligent about leaving each lighter in its proper place, but you’re never without a lighter when you need one.