Normalization | Smoak Signals | Data Analytics Blog

The practice of normalization is widely regarded as the standard methodology for logically organizing data to reduce anomalies in database management systems. Normalization involves deconstructing information into various sub-parts that are linked together in a logical way. Malaika and Nicola (2011) state, “..data normalization represents business records in computers by deconstructing the record into many parts, sometimes hundreds of parts, and reconstructing them again as necessary. Artificial keys and associated indexes are required to link the parts of a single record together.“ Although there are successively stringent forms of normalization, best practice involves decomposing information into the 3^rd normal form (3NF). Subsequent higher normal forms provide protection from anomalies that most practitioners will rarely ever encounter.

Background

The normalization methodology was the brainchild of mathematician and IBM researcher Dr. Edgar Frank Codd. Dr. Codd developed the technique while working at IBM’s San Jose Research Laboratory in 1970 (IBM Archives, 2003). Early databases employed either inflexible hierarchical designs or a collection of pointers to data on magnetic tapes. “While such databases could be efficient in handling the specific data and queries they were designed for, they were absolutely inflexible. New types of queries required complex reprogramming, and adding new types of data forced a total redesign of the database itself.” (IBM Archives, 2003). In addition, disk space in the early days of computing was limited and highly expensive. Dr. Codd’s seminal paper “A Relational Model of Data for Large Shared Data Banks” proposed a flexible structure of rows and columns that would help reduce the amount of disk space necessary to store information. Furthermore, this revolutionary new methodology provided the benefit of significantly reducing data anomalies. These aforementioned benefits are achieved by ensuring that data is stored on disk exactly once.

Normal Forms (1NF to 3NF):

Normalization is widely regarded as the best practice when developing a coherent flexible database structure. Adams & Beckett (1997) state that designing a normalized database structure should be the first step taken when building a database that is meant to last. There are seven different forms of normalization; each lower form is a subset of the next higher form. Thus a database in 2^nd normal form (2NF) is also in 1^st normal form (1NF), although with additional satisfying conditions. Normalization best practice holds that databases in 3^rd normal form (3NF) should suffice for the widest range of solutions. Adams & Beckett (1997) called 3NF “adequate for most practical needs.” When Dr. Codd initially proposed the concept of normalization, 3NF was the highest form introduced (Oppel, 2011).

A database table has achieved 1NF if does not contain any repeating groups and its attributes cannot be decomposed into smaller portions (atomicity). Most importantly, all of the data must relate to a primary key that uniquely indentifies a respective row. “When you have more than one field storing the same kind of information in a single table, you have a repeating group.” (Adams & Beckett, 1997). A higher level of normalization is often needed for tables in 1NF. Tables in 1NF are often subjected to “data duplication, update performance degradation, and update integrity problems..” (Teorey, Lightstone, Nadeau & Jagadish, 2011).

A database table has achieved 2NF if it meets the conditions of 1NF and if all of the non-key fields depend on ALL of the key fields (Stephens, 2009). It is important to note that tables with only 1 primary key that satisfy 1NF conditions are automatically in 2NF. In essence, 2NF helps data modelers determine if 2 tables have been combined into one table.

A database has achieved 3NF if it meets the conditions of 2NF and it contains no transitive dependencies. “A transitive dependency is when one non‐key field’s value depends on another non‐key field’s value” (Stephens, 2009). If any of the fields in the database table are dependent on any other fields, then the dependent field should be placed into another table.

If for example field B is functionally dependent on field A, (e.g. A->B), then add field A and B to a new table, with field A designated as a key which will provide linkage to the original table.

In short, 2NF and 3NF help determine the relationship between key and non-key attributes. Williams (1983) states, “Under second and third normal forms, a non-key field must provide a fact about the key, just the whole key, and nothing but the key.” A variant of this definition is typically supplemented with the remark “so help me Codd”.

Benefits:

Adams & Beckett (1997) assert that the normalization method provides benefits such as database efficiency & flexibility, the avoidance of redundant fields, and an easier to maintain database structure. Hoberman, (2009) adds that normalization provides a modeler with a stronger understanding of the business. The normalization process ensures that many questions are asked regarding data elements so these elements may be assigned to entities correctly. Hoberman also agrees that data quality is improved as redundancy is reduced.
Assessment

Although engaging in normalization is considered best practice, many sources advocate that normalization to 3NF is sufficient for the majority of data modeling solutions. Third normal form is deemed sufficient because the anomalies covered in higher forms occur with much less frequency. Beckett & Adams (1997) describe 3NF as “adequate for most practical needs.” Stephens (2009) affirms that many designers view 3NF as the form that combines adequate protection from recurrent data anomalies with relative modeling ease. Levels of normalization beyond 3NF can yield data models that are overly engineered, overly complicated and hard to maintain. The risk inherent in higher form constructs is that the performance can degrade to a level that is worse than less normalized designs. Hoberman (2009) asserts that, “Even though there are higher levels of normalization than 3NF, many interpret the term ‘normalized’ to mean 3NF.”

There are examples in data modeling literature where strict adherence to normalization is not advised. Fotache (2006) posited that normalization was initially a highly rigorous, theoretical methodology that was of not much practical use to real world development. Fotache provides the example of an attribute named ADDRESS, which is typically stored by many companies as an atomic string per 1NF requirements. The ADDRESS data could be stored in one field (violating 1NF) if the data is only needed for better person identification or mailing purposes. Teorey, Lightstone, Nadeau, & Jagadish (2011) advise that denormalization should be considered when performance considerations are in play. Denormalization introduces a trade off of increased update cost versus lower read cost depending upon the levels of data redundancy. Date (1990) downplays strict adherence to normalization and sets a minimum requirement of 1NF. “Normalization theory is a useful aid in the process, but it is not a panacea; anyone designing a database is certainly advised to be familiar with the basic techniques of normalization…but we do not mean to suggest that the design should necessarily be based on normalization principles alone” (Date, 1990).
Conclusion

Normalization is the best practice when designing a flexible and efficient database structure. The first three normal forms can be remembered by recalling a simple mnemonic. All attributes should depend upon a key (1NF), the whole key (2NF) and nothing but the key (3NF).

The advantages of normalization are many. Normalization ensures that modelers have a strong understanding of the business, it greatly reduces data redundancies and it improves data quality. When there is less data to store on disk, updating and inserting becomes a faster process. In addition, insert, delete and update anomalies disappear when adhering to normalization techniques. “The mantra of the skilled database designer is, for each attribute, capture it once, store it once, and use that one copy everywhere” (Stephens 2009).

It is important to remember that normalization to 3NF is sufficient for the majority of data modeling solutions. Higher levels of normalization can overcomplicate a database design and have the potential to provide worse performance.

In conclusion, begin the database design process by using normalization techniques. For implementation purposes, normalize data to 3NF compliance and then consider if data retrieval performance reasons necessitate denormalizing to a lower form. Denormalization introduces a trade off of increased update cost versus lower read cost depending upon the levels of data redundancy.

From the DAMA Guide to the Data Management Body of Knowledge:

First normal form (1NF): Ensures each entity has a valid primary key , every data element depends on the primary key, and removes repeating groups , and ensuring each data element is atomic (not multi-valued).
Second normal form (2NF): Ensures each entity has the minimal primary key and that every data element depends on the complete primary key.
Third normal form (3NF): Ensures each entity has no hidden primary keys and that each data element depends on no data element outside the key (―the key, the whole key and nothing but the key).

Glossary:

Delete Anomaly: “A delete anomaly is a situation where a deletion of data about one particular entity causes unintended loss of data that characterizes another entity.” (Stephens, 2009)

Denormalization: Denormalization involves reversing the process of normalization to gain faster read performance.

Insert Anomaly: “An insert anomaly is a situation where you cannot insert a new tuple into a relation because of an artificial dependency on another relation.” (Stephens, 2009)

Normalization: “Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.” (Microsoft Knowledge Base)

Primary Key: “Even though an entity may contain more than one candidate key, we can only select one candidate key to be the primary key for an entity. A primary key is a candidate key that has been chosen to be the unique identifier for an entity.” (Hoberman, 2009)

Update Anomaly: “An update anomaly is a situation where an update of a single data value requires multiple tuples (rows) of data to be updated.” (Stephens, 2009)

Bibliography

Adams, D., & Beckett, D. (1997). Normalization Is a Nice Theory. Foresight Technology Inc. Retrieved from http://www.4dcompanion.com/downloads/papers/normalization.pdf

DAMA International (2009). The DAMA Guide to the Data Management Body of Knowledge (1st Edition).

Fotache, M. (2006, May 1) Why Normalization Failed to Become the Ultimate Guide for Database Designers? Available at SSRN: http://ssrn.com/abstract=905060 or http://dx.doi.org/10.2139/ssrn.905060

Hoberman, S. (2009). Data modeling made simple: a practical guide for business and it professionals, second edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=34408.

IBM Archives (2003): Edgar F. Codd. Retrieved from http://www-03.ibm.com/ibm/history/exhibits/builders/builders_codd.html

Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory. Communications of the ACM 26(2). Retrieved from http://www.bkent.net/Doc/simple5.htm

Malaika, S., & Nicola, M. (2011, December 15). Data normalization reconsidered, Part 1: The history of business records. Retrieved from http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/

Microsoft Knowledge Base. Article ID: 283878. Description of the database normalization basics. Retrieved from http://support.microsoft.com/kb/283878

Oppel, A. (2011). Databases demystified, 2nd edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=72521.

Stephens, Rod. (2009). Beginning database design solutions. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=29584.

Teorey, T. & Lightstone, S. & Nadeau, T. & Jagadish, H.V.. ( © 2011). Database modeling and design: logical design, fifth edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=41847.

Image courtesy of David Castillo Dominici at FreeDigitalPhotos.net

Smoak Signals | Data Analytics Blog

Sharing Knowledge: Data, Visualization & Business

Tag / Normalization

Normalization: A Database Best Practice