BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Data normalization, is it really that good?

Data normalization, is it really that good?

Normalization is one of the corner-stones of database design. Recently some discussion emerged on the need for normalization suggesting denormalization as a more scalable solution.

A couple of weeks ago Pat Helland blogged about a short presentation he gave named "Normalization is for Sissies". Pat started by stating (the obvious) that we normalize to eliminate update anomalies. but then proceeded to make a couple interesting observations:
  • Many types of computing problems are append only in nature or in his words "Accountant don't use erasers". Hence there is no need to normalize immutable data (unless you really need the space)
  • Developers work with "complete" business objects. These Business Objects are denormalized complete views anyway
It's easy to dismiss Pat as just another blogger - until you remember that Pat has a long history with database systems including for example servicing as the chief architect for SQL Server's Service Broker and as one of the co-founders of the COM+ team. Pat also spent a couple of years or so working for Amazon - and indeed Werner Vogels (Amazon's CTO) mentioned that you can consider denormalizing entities to get better scalability and performance in his QCon presentation on Availability and Consistency (recently published here).

Andres Aguiar commented that in addition to the "no need to normalize immutable data", you don't even need to delete:
Another [idea] is that you actually don't need to delete/update the database. 'Deleting' a row means setting a InvalidationTimestamp = now(), and updating a row means setting InvalidationTimestamp = now() and inserting a new row with SinceTimestamp = now() and InvalidationTimestamp = null (you actually need two sets of dates, but that's for another post). Now, if you put the two ideas together, all the data is immutable, so you don't need to normalize anything.
Analyzing this, one barrier Andres mentioned was that databases are not ready for the large number of columns per table such a scenario will generate, which is solvable if you apply archiving to older data.

Dare Obasanjo thinks that
"Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching. "
However Dare agrees that when you need massive scale you probably need to denormalize and give Flickr as an example.

As Cal Henderson (Flickr's web development lead) mentioned in a presentation he gave in 2004 on Flickr architecture that Joins are slow (slide 27) and as a result:
* Normalised data is for sissies
* Keep multiple copies of data around
* Makes searching faster
* Have to ensure consistency in the application logic
A 2004 blog post by Jason Kottke, which bares the same title like Pat's post and Cal's first point discusses Cal's presentations. Jason explains that in Flickr's case there is a ration of 13 Selects to each Insert/Delete or Update which is why they chose to denormalize.

To sum up as Pat finished his presentation:
    "people normalize 'cuz their professor said to"
Is it time yet to rethink the mantra of database normalization? Some of the larger web-sites seem to think so, what's your take?
BT