I'm not a hot shot programmer by any means, but I've done quite a bit of playing around with getting data in and out of databases programmatically. Something that's always confused me is the near-religion of data normalization among programmers and database admins. Every developer I've ever worked with has told me that when you're building a database, you need to normalize your data -- basically this means organizing your data in such a way that removes redundancy -- and failure to do so would result in public ridicule and possible revocation of access to any computing device. But I've always wondered, given that hard drives are cheap and getting cheaper, what's the problem with using more storage space in exchange for greater speed?
I was delighted when I read Cal Henderson's take on normalized data from a recent talk he gave about Flickr (from page 27):
- Normalised data is for sissies
- Keep multiple copies of data around
- Makes searching faster
- Have to ensure consistency in the application logic
To which I would add: hard drives are cheap.
Cal presents normalization as a trade-off that, depending on your circumstances, might be worth looking at...which is a much more useful way of approaching the situation than what I've typically heard (normalize or die!). Want faster access to your data? Replicate it in the database but be aware that it'll cost you some storage space and you'll need to keep track of the extra data in your application (which can be a pain in the ass). In Flickr's case, they have 13 SELECTs for every INSERT, DELETE, and UPDATE statement hitting their database. Normalization can slow SELECT speed down while denormalization makes your I/D/Us more complicated and slower. Since the application part of Flickr depends so heavily on SELECTs from the database, it makes sense for them to denormalize their data somewhat to speed things up.
Here's an analogy for the smokers in the audience...what sucks worse than realizing you left your lighter at home and you're stuck in traffic on the way to the office? The solution is to buy a bunch of lighters, one for your car, one for your coat pocket, one for the drawer in your office, one for your purse, etc. It's a trade-off. Your initial cash investment is greater (but lighters, like hard drives, are cheap) and you need to be diligent about leaving each lighter in its proper place, but you're never without a lighter when you need one.
In some ways, normalizing databases is functionally similar to semantic markup, and we know how you feel about that.
Of course, this is the argument of people who preach abstinence, too, so maybe it's not the best technique.
True, but you will end up spending more cash in the long run because you'll be smoking more cigarettes!
Sure, I denormalize for speed, but note that denormalizing implies that you have normal data in the first place.
Before preaching "oh, don't worry about it", I suggest you fix a few screwed up schemas full of ambigious data first.
Which is a quote taken out of context...I followed that up with "which can be a pain in the ass"...that's hardly glossing. But the point is that whether to normalize or not is a trade-off between several factors which depends on the situation at hand. And yes, keeping tabs on duplicate data is quite an important factor, probably the most important one in most situations.
I think the reason normalization is preached as religion is because it's a rule that you ought not break until you understand where/when to break it.
Yeah, my developer friends probably kept me in the kiddie pool on this one. I wish they hadn't though...the bigger picture is much more interesting, worthy of discussion, and is an opportunity for real creativity in database/app design.
Admittedly, I don't know squat about db architecting, but it seems like database normalization is aimed at fixing a lot of problems up front. Keeping track of all the redundant insert/delete/updating leaves a loooooooot of room for error if you (or the several people working on the db) forget something. In the cigarette lighter analogy, it seems like once the db got big enough, you'd have to carry around a crate of lighters and drop them behind you Hansel-and-Gretel-style.
There is one justification for denormalization in limited cases, and that is when you have a historical table that should not change when the normalized data changes. For instance, in an invoice table, you do not want the invoices to change retroactively if your pricing changes, so you will copy a snapshot of the pricing data at invoice time. In this case, you specifically do NOT want changes to be propagated, and the two columns are actually slightly different in semantics anyway (frozen vs. mutable values), so strictly speaking this isn't really a form of denormalization.
There are probably many situations where such an in-depth analysis of a problem is not required to attain a solution, but most systems still benefit the most from this type of architecural process. Implying that the concept of data normalization is one that should be tossed out the window is going overboard in the same way that it would be to say every acceptable schema needs to acheive Third Normal Form. Finally, in an enterprise-level system where a huge amount of data is operated on by large pre-defined sets of queries that work together to build the output, stored procedures would improve the efficiency of those transactions.
What about the perf impact of locking and concurrency? Is Flickr executing their data modifications in transactions? Are these transactions holding locks on 13 tables while they are executing? How does that impact the perf of the select queries?
I always start out db design by normalizing as much as possible, and then denormalize only when actual testing indicates it will boost performance.
Good comment, Fazal. One of my colleagues always makes fun of me for spending time thinking out how to save cycles in my string manipulation code, while ignoring the fact that the app might be making several remote machine calls via db access or web services ;).
Also with the explosion of web development becoming accessible to the masses via easier tools you have alot of folks out there who just don't know any better. Top that off with even more customers who know even less about what to ask when a software company comes in and pitches a product and I imagine that many(most?) web app db's are not normalized appropriately.
We had the option of buying one superfast kick-ass ridiculous webserver or load balance several pretty fast, fairly kick-ass servers. We opted for the cluster because with our traffic and the content we have, it makes much more sense. If a box fails, it's not catastrophic. In our day to day, yes, it is a pain in the ass to update the code but the heart and soul of our site is our databases and with the amount of new content we are generating daily, it serves us best to address speed issues with relatively cheap hardware than with having to deal with ballooning redundancies in the db.
I haven't run into anybody that is a hardcore normalization nut with no room to budge on the matter. Most understand that you normalize as far as it makes sense to. It has little to do with hard drive space and more to do with a managable and flexible database design.
Then:
1. Are you able to get acceptable performance without denormalizing (indexes, clusters)?
2. Will denormalization make the system be less reliable?
3. Will system performance after denormalizing still be unacceptable?
Any denormalized database creates administrative overhead. This includes but is not limited to documenting decisions and ensuring data integrity across dupes programatically. Storage shouldn't be a reason for denormalization. If it is a primary justification, it is an excuse for lazy (and often poor) design. Take into account the I/O saved, CPU saved, complexity of update programming, cost of returning to a normalized design and end user performance.
Denormalization is what is says: denormalization. It is a tool in database design, not the design itself. Normalize it first, then look to increase performance by looking at all the tools available to you. Denormalization has higher maintenance costs (often in people time) than other techniques. Furthering what Tom said, most applications won't need analysis after normalization. Chances are, if yours is one of them, you've already got a DBA who knows the trade-offs. If you don't need performance tweaks, the benefits from normalization will far outweigh glorified Excel spreadsheets.
But in the real world, where you need to generate accurate reports from reliable information about sales, inventory and customers, you favor reliability over performance every time. I don't normalize data because I'm looking to conserve disk space - I do it because I'm trying to preserve consistency and accuracy.
It's already enough that you have to worry whether USERS are entering accurate information. You shouldn't also have to worry about the app keeping track of it consistently across multiple redundant tables.
On complexity, a lot of developers I've worked with take the attitude that that programmers should make the extra effort in order to produce the most efficient program. But, complexity has serious consequences: less portability (see pointer arithmetic), more bugs, and far slower development.
If efficiency was really king, we'd develop web applications in C or even assembler. Cheap disk space, processor time, and memory lets us us high level tools and practices (like normalization) to build bigger and more robust applications that we could otherwise.
the bigger picture is much more interesting, worthy of discussion, and is an opportunity for real creativity in database/app design.
The thing is, there isn't a tremendous amount of room for creativity in database design. Most of the problems have not only been solved, they've been solved for 25 or 30 years. Normalizing your data is still best practice in most cases.
It makes more sense for simplicity, accuracy, and data integrity to keep one piece of data in one place. This applies to markup as well: think back to the web before PHP was common (or JSP, or ASP, or whatever) and each and every page needed to be altered to fix one change. Modifying 13 tables or thousands of records is even more nightmarish.
Moreover, what is the performance cost of all the triggers (or other mechanism used to similar effect) necessary to enforce integrity constraints within your denormalized database? "Ensure consistency in the application logic" is not going to be free.
You can jump out of a plane without a parachute and say parachutes are for sissies and they're a heavy burden. That might make sense if you're just 10 feet above water and the plane is not moving too fast. In most cases however, I know which side of the "tradeoff" I want to be on... Normalizing or not is a bit more complex than how you described it.
I don't think there has ever been a software project that, when it changed hands, the new coders told the end client how great a job the previous programmers did and what a wise set of choices they made! So I just try to comment any time I'm doing something unorthodox and hope that the next guy will sympathize with me.
Well said.
And yet, we have this discussion. Can't be all that devoid of creativity...
The primary goal in preventing redundancy through normalization is not to save storage space but to enforce data integrity.
Yeah, I get it...like 20 people have said this. Doesn't anyone read the thread before posting?
The current discussion has really been about welcoming new users to the complex world of databases (which many programmers take for granted) rather than any creative discussion (Jason, I dont mean that in a snarky way!) The cutting edge stuff in the database world these days is probably around stuff like data/design optimization for hardware clusters.
The system described in the slides is scalable for reads but that isnt hard to do - you can always build downstream read-only slaves and in fact this hub-and-spoke model is how many large businesses provide data access to many of their departments. The thing that is glossed over is that as transaction volume increases (even assuming the percentage of I/U/Ds stays the same) there is no equivalent easy way to replicate the Master. The term "Replication" as it is used in the presentation is not what is meant by Replication in the database world. Creating read-slaves is just a copy. Replication involves multiple, distributed Masters with all the design and conflict resolution that that involves.
Cal does recognize this when he writes "We can scale horizontally, at least for a while"
The only reason there are so many posts is that your readership is loyal and want to set you straight. By the way, what happens when a blogger changes his or her opinion?
Yes, every time a newbie db designer starts asserting statements, we have this discussion. Every single dba has gone through this.
Yes, hard drives are cheap.
Yes, processors are cheap.
Yes, bandwidth is cheap.
Quality of service and scalability, however, are not cheap. Assume the following table:
PHOTOID int NOT NULL PRIMARY KEY,
FILENAME varchar(255) NOT NULL,
FILESYSTEM varchar(255) NOT NULL
vs this table:
PHOTOID int NOT NULL PRIMARY KEY,
FILESYSTEMID int NOT NULL,
FILENAME varchar(255) NOT NULL
Assume you have an index on all three columns. Inserting a row in the first table will have a non-trivial amount of overhead (for updating the index on the FILESYSTEM varchar column, as opposed to updating an int column in the second table). While processors are certainly fast, this sort of wasteful database design lends itself to enormous scalability problems. What happens when 10 inserts per second are happening on that table? The overhead of updating a varchar index will degrade your quality of service, as more and more queries get thrown at that table.
And just throwing another machine at the database isn't exactly that easy. Ask any MySQL developer about creating a cluster of database that can all handle inserts. They'll laugh at you, because (as of version 4.X) it can't be done. You can do it in Oracle, but then you have to deal with the cost of Oracle licenses, which aren't exactly cheap.
If you're concerned that the (select filesystem.filesystemname,photo.* from filesystem,photo where photo.filesystemid = filesystem.filesystemid) query would run slow, don't be. Most modern databases will cache those sorts of joins in memory, or if you're really concerned then just build a view that encompasses that query, and select from the view. You can additionally do logic on the application server to cache those results at the app server layer. Which gets back to my original point-- scalability. A well-built database (with correct normalization) will scale much much better than a database with just flat data.
the actual strength of a database lies its individual design and relationship schema. if the DB designer can look far enough ahead to include flexibility and inter-connectivity in his DB, chances are a little redundancy will not affect the final implementation.
normalization is almost never practiced in lower level DB2 based databases (especially those running on mid-range servers), simply because processing speed and hard disk space factors are not a concern.
By the way, what happens when a blogger changes his or her opinion?
There is a theory which states that if any blogger discovers that he is wrong and changes his opinion, that he will instantly disappear and be replaced by another blogger that is even more bizarre and inexplicable. Then there is a theory which states that this has already happened.
This was just a throwaway line used to make a point (it was a powerpoint bullet in a talk on PHP, not a position paper on DB design). None of us advocate doing things all crazy and stupid. And selective de-normalization (underscoring the point that things are normalized to begin with) is a design decision.
We're not using a database to store and manipulate relational data, we're using it to make an app run. I mean, of *course* we're doing the former, but only as a means to doing the latter. (Sort of like saying that it's not computer science, it's making a product. Or, I'm not putting one foot in front of the other while continuously falling, I'm walking to the store to get some milk.)
Also - you can do some pretty fancy things with replication: replicated data doesn't need to use the same table types or even necessarily the same tables - you can run a de-normalized mirror of your primary database just for searching, which is pretty nifty.
Also - you don't just go de-normalizing everything, and don't touch things that are hard to reliably manage in business logic. Flickr is definitely not a bank or a weapons control system or anything like that, but people's photos (and their privacy preferences and all of that) are still important. When you have paying customers, it's never really ok to screw up :)
Also: Slow? Is not!
(OK .... sometimes image serving is slow, and we're working on that - it's a symptom of needing to scale faster than we originally imagined. But that has nothing to do with the database - even pages that are doing a lot of DB hits render sub-second.)
I believe most people find it obvious that you don't store data in an RDBMS to conserve disk space. That would be stupid. And neither because you have extreme speed requirements, in which case misusing SQL is the *wrong* thing to do, the right thing would be to look beyond that and using some distributed embedded engine.
Just because this guy made a big shot web site, burning a lot of cash along the way, doesn't make him right.
And Jonas - "The whole *point* of using a RDBMS is to keep all logic that ensures consitent data in *one* central point." No, see, that's idiotic. If that was the *point* of you using a RDB then who was paying you? And what for? The *point* comes from the application.
Contrary to what rg assumes, there is no such absolute dichotomy. You can be a hotshot programmer and a terrific DB architect :)/
And you'd normalize your database unless there were specific, compelling scenarios in which denormalization is a better alternative.
Case in point:
Take a database, large number of interrelated tables, optimally indexed. If you wish to make a large number of those fields in those tables searchable, it is not practical in MySQL to do a combination of SELECTs and UNION ALLs across them. Even with UNION ALL, where MySQL is not supposed to sort through the data, each additional UNION ALL adds significant overhead and the time taken for the query increases non-(worse than)-linearly.
For a solution, now we can consider trading storage space for speed, namely by pulling searchable data out of those tables into a UNION'ed table, index the fields on it, and just query THAT when needed. The search table can be recreated whenever need be, and will not be relied upon for other uses, so there is little problem with data inconsistency.
Knowing when there are exceptions to the guidelines of data normalization is the wisest position. Spouting ridiculous and somewhat pretentious (either that, or at least intentionally inciteful) mantras such as "Normalization is for sissies" is as much a admission of newbiehood as is a statement such as "Always normalize, no matter what the situation". Which no one ever really says, anyway. It's a straw man.
And Mr. Kottke, in algorithms class, way back when, we learned that no matter how much you increase the quality and quantity of machines you have working for you, they can never replace a bad algorithm. Unless you can disprove that, I don't think the majority of your veteran DB design readership will find Cal Henderson's presentation very compelling: 13 SELECTs per I/U/D is ridiculous, when you could interrelate all your tables (heh, normalize) such that you don't need a 13 separate database transactions overheading an I/U/D. You might not even need any.
I'm glad Stewart somewhat clarified what they're doing over at Flickr. It appears that it's mostly senor Kottke's take on Cal's presentation that is the problem. There are fewer issues at stake here than this thread would immediately imply; most counterargument is directed to Jason's interpretation and hasty conclusion, (hopefully) rather than a haphazard implementation of a database architecture in use at Flickr.
Let me tell you a story. I used to work for an academic institution where the data was (and still is) embarrassingly de-normalised. In fact, I hesitiate to use that phrase as it implies that once upon a time in the dim and distant mists of the past it approached normalisation. I can guarantee that this is as far from the truth as the data is from "normal".
So, that's the setup. Let's chat about the data. It holds student records. So badly that it was possible for students to be marked as graduates WITHOUT THEIR NAME BEING PASSED TO ANYBODY FOR INCLUSION IN THE GRADUATION MAILING LIST. This would never be caught until the student called up to ask why they had their pass results but hadn't been told that the graduation was last week...
They also created a new record for each student FOR EACH YEAR OF STUDY. They did this by duplicating records and changing the year. Wanna change something across a post-doc student's history? That's potentially 10 years worth of massively duplicated data. And they didn't have a system for doing this in a managed way.
Oh, and the real kicker? Each department had rw access to this database, and they all had different interfaces and different limits on what they could and couldn't edit. Want to add a new table to reference a student's background records? Gotta remember to grant the right access to the right people - and if you forget? No checks. And if THEY forget (becuase, of course, they got no training - how do you train someone to nail jelly to the wall?) then potentially everyone's data's screwed.
Trust me, just because denormalisation is possible and - in some, specifically defined and understood circumstances - can actually lead to performance improvements, that doesn't mean we should say "Aha! Someone important said it doesn't matter! So it must be okay?"
I once logged into our server as root and didn't type in anything dangerous, though I was terrified. Does that mean I should start handing out root logins to all our clients as their FTP details? I mean, it would make the password managment process so much simpler, right?
You must realize that denormalizing won't save you any time during inserts and updates, and will likely create extra overhead for those operations. Is this worth it? Before you say yes, test your database's performance, QUANTITATIVELY, before and after denormalizing. It all depends on your particular data and usage, but I think the cases where denormalizing helps are very specific and very rare. And in most of these cases the tiny performance gain isn't even worth the extra bugs and coding hassle.
This is the same argument we used to have about tables and css. There will always be purists, and those who simply don't care, and then there is the larger middle ground. Because in the real world, solutions are rarely perfect and as is the case with all design, the end result is usually a compromise between flexibility and speed.
I think Jason had a point in bringing up the fact that denormalization is a valid performance tuning technique. Its not often discussed. Its not so much the conclusion drawn here, but the fact that the discussion was published so nubies can find it later. ;)
All things aside, knowing with 100% certainly that all unique values are, in fact, unique and all keys relate properly to data in related tables provides a level of certainty that is the whole point of storing data in a relational database. If you have normalized data, you can create complex queries at will without worry. Your results will be accurate and you will save a lot of time spent wondering about the accuracy of your results.
I manage an esoteric contract with an MS Access database (tables and a few stored procedures only) and an ASP front end. The size of the database is not large, but the relationships border on the bizarre at times. This is a reflection of how things are run at the management level and I have to mimic this in a structural way. Without hard-core normalization, I would have shot myself by now trying to roll out user modules. I can confidently use a variety of SQL statements and session variables to fire up all kinds mini-applications that keep my users happy (ish ;-). At times I am required to export data to spreadsheets and always there are questions. Any of these exports I have been able to move on-line in real time cause the questions disappear. This is a human factor. This is what I design for.
So I don't have experience with huge databases and can see a point where performance might be an issue, so I can't rule out de-normalization in these cases. If I had to do that, I would probably allow data retrieval and exporting from the de-normalized data, but never any updates to the core data itself. I may not be reading the issue correctly and I apologize if I have glossed over anyone's valid argument.
Note: I know a lot of you will scoff at MS Access, but by keeping the design tight and the file free of code, I can honestly say that most of you would be very surprised to find out how stable it can be. If you store tables only, you actually rarely, if ever, need to compact/repair your database, especially with a browser-based front end.
It's about making sure you don't have information about the same thing in 2 places and with different values.
The lighter analogy doesn't address this issue at all. A better analogy would be having one calendar at home, another one in your car and yet another one at your office so you would always have one at hand. But you would never be sure which one has the correct information.
How totally stupid and unuseful can you get? Sure, we could play the 'my abstraction level is superior to yours' game all day ("no you're not, you're using it to allow people to share pictures with each other over the internet! neener neener!") but who cares?
The point of databases from a programmatic view is to provide an optimal data storage experience. In the experience of people who have implemented thousands more databases than you, including web sites on the Internet that are or were hundreds of times more popular than Flickr, data normalization is the best way unless (a) the db implementation is pretty bad (e.g., mysql), or (b) you need to do an optimization and you don't have the option of paying for a db service. Note that of the 58 mysql databases I've put up, only 5 required munging. Flickr-like functionality is utterly not within that realm.
Well said. This is kind of a niche arguement, apparently, but lack of normalization makes relational analysis of data REALLY difficult. If you're into that kind of work, of course. For example, let an average user get a copy of Access, and they'll wind up building a database to store all kinds of neat stuff, like serial murder investigations and terrorist money laundering. Problem is, the data is often stored in such a nasty, denormalized, unrelational form that analysis, six degrees of Kevin Bacon style, is practically impossible. We have plenty of data, we're just powerless to do anything useful with it.
True. This thread is mostly about DB guys defending their turf. I would advise Stewart to stay out since this isnt really about Flickr. :) Based on a set of slides and bare sketches, I don't think anyone here is qualified to judge further on whether they are doing the right thing. Cal is a smart guy and just based on his track record, what they are doing is probably the right decision.
Of course, denormalization is the right thing to do under certain circumstances. I think there is agreement on that too. The only "problem" is that Database Design is a very mature field and Jason stepped right in and made some uninformed comments which he presented as a revelation. Saying that normalization has anything to do with "disk space" is like saying, well, good web design has anything to do with disk space - its a complete non sequitur (Yes, I know its been mentioned thirty times in this thread...but thats the core of why everyone here is so agitated :) )
You'll have to normalise the commentators first ;)
That's what you get when professional dbas go slumming into Amateur Hour. Most of the neo-dbas around here think of SQL as insert/update/delete/select only because that's all MySQL can do. ;)
But it's good to see that sub-selects finally hit production grade in MySQL. ;)
And that stored procedures, views, pseudo-triggers, etc, for MySQL are planned for late 2005. ;)
Managing data, particularly ensuring that it is captured appropriately and then KEPT consistent, is often ignored until things are horribly broken. Nothing causes confusion like someone using out-of date data/classifications/structures etc. because they weren't concerned with data integrity/use.
The university example is great, and many of us can tell you similar horror stories...
In my opinion, the two are meant for different audiences, and should be designed differently (for instance, you'd want to denormalize when building your fulltext indices)
Btw, I like boobies.
Normally, you should try to keep all data non-redundant (what is called "third normal form" in database theory). However, do not be afraid to duplicate information or create summary tables if necessary to gain more speed.
Plus as Dave Fitch remarks, removing redundant data is only one advantage of a normalised database; duplicated data increases the chances that you're not pulling out consistent or integrally assured data.
I think for the sort of massive systems being mooted, a far better approach would be to abandon relational databases completely and instead use serialised objects. Then you're really talking about massive speed increases.
We would like to ship you a lifetime supply of lighters, however we have 17 different address for you in our database. Can you please verify your shipping address with us?
Sincerely,
Denormalized Lighters, Inc.
As many others have already pointed out, what happens when each location of the "same" data is no longer the same? For one reason or another there will be discrepancies and when that happens how much of your database can you continue to trust?
Best of luck in your recovery, Dr. Kottke, but keep in mind that the dbas speaking out here are not adherent to a faith so much as they dislike flawed reasoning. (We could discuss how all science is religion, but that's somewhat beyond the scope of this thread.)
"Most of the meltdown was due to a monolithic application addressing everything in a single, monolithic database. One application held all the applications of eBay, with the exception of search.
But it didn't work."
This thread is closed to new comments. Thanks to everyone who responded.

