How to Dynamically Pivot Data in SQL Server

 

SQL is the lifeblood of any data professional. If you can’t leverage SQL and you work with data, your life will be more difficult than it needs to be.

In this video I am using SQL Server Express to turn a simple normalized dataset into a pivoted dataset. This is not a beginner video as I assume you are familiar with basic SQL concepts.

T-SQL is Microsoft’s SQL language that contains additional functions and capabilities over and above ANSI standards. We’ll use some of these functions to turn the following data set that displays average rents in major American cities into a pivoted denormalized dataset.

The City values in the City column will become individual columns in a new pivoted dataset with their respective Average Rent values appearing underneath.

We’re going to transform this:

Normalized Data

Into this:

Pivoted Data

Notice how the city values are now column heads and the respective Average Rent values are underneath.

Make sure you watch the video but here is the code used in the example.

IF OBJECT_ID('tempdb..##TBL_TEMP') IS NOT NULL
DROP TABLE ##TBL_TEMP

--This parameter will hold the dynamically created SQL script
DECLARE   @SQLQuery AS NVARCHAR(MAX)

--This parameter will hold the Pivoted Column values
DECLARE   @PivotColumns AS NVARCHAR(MAX)

SELECT   @PivotColumns= COALESCE(@PivotColumns + ',','') + QUOTENAME([City])
FROM [dbo].[tbl_Rent]

/* UNCOMMENT TO SEE THE NEW COLUMN NAMES THAT WILL BE CREATED */
--SELECT   @PivotColumns

--Create the dynamic query with all the values for
--pivot column at runtime
--LIST ALL FILEDS EXCEPT PIVOT COLUMN

SET   @SQLQuery =
   N'SELECT [City Code],[Metro],[County],[State],[Population Rank],' +   @PivotColumns + '
   INTO ##TBL_TEMP
   FROM [dbo].[tbl_Rent]
   
   PIVOT( MAX([Average Rent])
      FOR [City] IN (' + @PivotColumns + ')) AS Q'

/* UNCOMMENT TO SEE THE DYNAMICALLY CREATED SQL STATEMENT */
--SELECT   @SQLQuery
--Execute dynamic query
EXEC sp_executesql @SQLQuery

/* VIEW PIVOTED TABLE RESULTS */
Select * from ##TBL_TEMP

 

Big shoutout to StackOverflow for help with this example.

 

Advertisements

How to Fix an Import Specification Error in Microsoft Access

There are certain aspects of Microsoft Access that can be downright frustrating and puzzling to debug. I want to share a tip with you that will hopefully save you hours of frustration. There is nothing more foundational than importing data into Microsoft Access so most likely you’ll appreciate the fix for this run-time error if you are attempting to use VBA.

If you encounter the following Microsoft Access Error:

“Run-Time error ‘3625’: The text file specification ‘My Saved Access Import Spec’ does not exist. You cannot import, export, or link using that specification.”

Most likely you have confused a saved set of “import steps” with a saved “Import/Export specification” while trying to use the Docmd.TransferText command; or at least I did.

Consider the following sample VBA code that uses the Docmd.TransferText command to import a delimited file (from a path stored in string variable strInputFileName) into a table named “tbl_Access_Import_Data” using an import specification.

Private Sub cmd_Import_Table_Click()

Dim strInputFileName As String
'Set Path to Local CSV File. This file will be imported into an Access Table.
strInputFileName = "C:\Users\Desktop\Access Data\Access_Import_Data"

' Use a Macro to Import a delimited file
' "My Saved Access Import Spec" = Import Spec
' "tbl_Access_Import_Data" = Destination Access Table
' strInputFileName = hardcoded path to source csv file

DoCmd.TransferText acImportDelim, "My Saved Access Import Spec", "tbl_Access_Import_Data", strInputFileName

End Sub

4. Error 3625 Edited 2

Let me show you where I went off track. I saved “import steps” and then tried to reference the saved “import steps” with the Docmd.TransferText method. You cannot reference “import steps” with this method, only “Import/Export specifications”.

1. Import Text Wizard Edited

I used the Import Text Wizard to define and delimit the columns in a specified .csv file and indicated the table I desired to have that data imported into. Afterwards, I pressed the finish button.

2. Import Text Wizard Blurred

Once I hit “Finish”, on the very next screen I saved the “import steps” that I previously defined. Notice the verbiage next to step 1 (i.e. “Save import steps”).

3. Saved Fake Spec Blurred

As you can see above, I created a saved “import step” erroneously named “My Saved Access Import Spec”. This name was the value that I erroneously passed to the Docmd.TransferText method in code.

4. Error 3625 Edited 2

These actions result in ‘Run-time error 3625’ that we will fix.

5. Import data Secification Edited

In order to save a legitimate Import/Export specification that can be successfully referenced with the Docmd.TransferText method, make sure to hit the “Advanced” button before you hit “Finish” when you come to the last window of the Import Text Wizard.

Make sure to hit “Save As” (Step 2 above) on the right hand side of the window.

6. Capture Edited

At this point, name and then save your true Import/Export specification name and hit “OK”.

Now when you come to the same window again you can hit the “Specs…” button to observe the names of all of the saved Import/Export specifications.

7. Specs Button Edited

In the pic above I only have 1 Import/Export specification named “My Real Saved Access Import Spec”.

7.5 Import Complete Edited 2

Observe, once the true Import/Export specification is referenced in VBA code, the code executes as intended.

Additional Tips

I am not aware of how to edit Import/Export specifications. The best advice that I have is to recreate and then overwrite the existing specification or save the new revised specification with a different name.

If you place the following SQL code in a blank Select Query, you can view all the true specification names along with field names and respective field widths.

SELECT
MSysIMEXSpecs.SpecName,
MSysIMEXColumns.FieldName,
MSysIMEXColumns.Start,
MSysIMEXColumns.Width,
MSysIMEXColumns.SkipColumn
FROM MSysIMEXColumns
INNER JOIN MSysIMEXSpecs
ON MSysIMEXColumns.SpecID = MSysIMEXSpecs.SpecID

ORDER BY MSysIMEXSpecs.SpecName,
MSysIMEXColumns.Start,
MSysIMEXColumns.Width;

8. SQL Results Edited

The results of that query from my example database are shown above. All due credit goes to stackoverflow for this SQL tip.

https://stackoverflow.com/questions/34295360/the-text-file-specification-does-not-exist-when-importing-into-access

3. Saved Fake Spec Blurred

Furthermore, if you are intent on referencing saved import steps in VBA code (not to be confused with the aforementioned Import/Export specification), then use the Docmd.RunSavedImportExport method.

To execute the “import step” shown in the picture above using VBA, I would use the following command:

DoCmd.RunSavedImportExport "My Saved Access Import Spec"

I hope this helps solve your “how to fix Run-Time error 3625 in Microsoft Access” question. Good luck!

Return Unmatched Records with SQL and Microsoft Access

Over the course of many years of building SQL scripts, I’ve tended to help SQL novices perform the set difference operation on their data. This post will not provide in-depth coverage on SQL run plans and tuning minutiae, but I do want to provide a high level overview for the novice.

If we define set A as the three numbers {1, 2, 3} and set B as the numbers {2, 3, 4} then the set difference, denoted as A \ B, is {1}. Notice that the element 1 is only a member of set A.

A picture is worth a thousand words as they say. A Venn diagram will be effective at illustrating what we’re trying to accomplish in this post.

Venn Diagram Difference

This blog post will cover using SQL and Microsoft Access to address capturing the shaded records in set A. If you have a database table named A and wanted to determine all of the rows in this table that DO NOT reside in another table named B, then you would apply the set difference principle.

LEFT OUTER JOIN & IS NULL SYNTAX

There are multiple ways to implement the set difference principle. It helps if there is a common join key between both sets of data when performing this analysis.

If I were working with two tables, one containing inventory data and one containing order data. I could write the following SQL script to return all the inventory rows that do not reside in the orders table.

SELECT table_inventories.* 
 FROM   table_inventories 
        LEFT OUTER JOIN table_orders 
                     ON table_inventories.id = table_orders.id 
 WHERE  table_orders.id IS NULL  

MICROSOFT ACCESS EXAMPLE

Consider the following tables in Microsoft Access. Observe that table_orders has fewer records than table_inventories.

Access Example Inventory Access Example Orders

We can construct a set difference select query using these tables to return all of the products in table_inventories that have not been ordered. Create a query in Microsoft Access in a similar fashion as shown below.

Access SQL Difference Join

The result of this query would produce the following two products that are not in table_orders.

Access Example Query Result

The Microsoft Access Query & View Designer would automatically generate the following SQL if you cared to open the Access SQL editor.

SELECT table_inventories.*
FROM   table_inventories
LEFT JOIN table_orders
ON table_inventories.id = table_orders.id
WHERE  (( ( table_orders.id ) IS NULL ));

Notice that LEFT JOIN is automatically created instead of LEFT OUTER JOIN. In Microsoft Access, the OUTER operation is optional. Also notice that Access loves to add additional parentheses for reasons known only to Microsoft.

Per Microsoft Access SQL Reference:

“Use a LEFT JOIN operation to create a left outer join. Left outer joins include all of the records from the first (left) of two tables, even if there are no matching values for records in the second (right) table [1].”

NOT EXISTS SYNTAX

Let’s step away from Microsoft Access for the remainder of this post. The NOT EXISTS approach provides similar functionality in a more performance friendly manner as compared to the LEFT OUTER JOIN & IS NULL syntax.

SELECT table_inventories.*
FROM   table_inventories
WHERE  table_inventories.id NOT EXISTS (SELECT table_orders.id
FROM   table_orders);

EXCEPT SYNTAX (T-SQL)

Alternatively, we could use the SQL EXCEPT operator which would also accomplish the task of returning inventory ids that do not reside in the orders table (i.e. inventory items that were never ordered). This syntax would be appropriate when using SQL Server.

SELECT table_inventories.id
FROM   table_inventories
EXCEPT
SELECT table_orders.id
FROM   table_orders

Per Microsoft:

“EXCEPT
Returns any distinct values from the query to the left of the EXCEPT operator that are not also returned from the right query [2].”

MINUS SYNTAX (ORACLE)

The following script will yield the same result as the T-SQL syntax. When using Oracle, make sure to incorporate the MINUS operator.

SELECT table_inventories.id
FROM   inventories
MINUS
SELECT table_orders.id
FROM   table_orders

Now take this tip and get out there and do some good things with your data.

Anthony Smoak

References:

[1] Access 2007 Developer Reference. https://msdn.microsoft.com/en-us/library/bb208894(v=office.12).aspx

[2] Microsoft T-SQL Docs. Set Operators – EXCEPT and INTERSECT (Transact-SQL). https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql

[3] Oracle Help Center. The UNION [ALL], INTERSECT, MINUS Operators. http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm

Venn diagram courtesy of http://math.cmu.edu/~bkell/21110-2010s/sets.html

The Need For Speed: Improve SQL Query Performance with Indexing

This article is also published on LinkedIn.

How many times have you executed a SQL query against a million plus row table and then engaged in a protracted waiting game for your results? Unfortunately, a poor database table indexing strategy can counteract the gains of the best hardware and server architectures. The positive impact that strategically applied indexes can provide to query performance should not be ignored just because one isn’t wearing a DBA hat. “You can obtain the greatest improvement in database application performance by looking first at the area of data access, including logical/physical database design, query design, and index design” (Fritchey, 2014). Understanding the basics of index application should not be eschewed and treated as an esoteric art best left to DBAs.

Make use of the Covering Index

It is important that regularly used, resource intensive queries be subjected to “covering indexes”. The aim of a covering index is to “cover” the query by including all of the fields that are referenced in WHERE or SELECT statements. Babbar, Bjeletich, Mackman, Meier and Vasireddy (2004) state, “The index ‘covers’ the query, and can completely service the query without going to the base data. This is in effect a materialized view of the query. The covering index performs well because the data is in one place and in the required order.” The benefit of a properly constructed covering index is clear; the RDBMS can find all the data columns it needs in the index without the need to refer back to the base table which drastically improves performance. Kriegel (2011) asserts, “Not all indices are created equal — If the column for which you’ve created an index is not part of your search criteria, the index will be useless at best and detrimental at worst.”

Apply a Clustered Index

More often than not, a table should have a clustered index applied so as to avoid expensive table scans by the query optimizer. It is advisable to create one clustered index per table preferably on the PRIMARY KEY column. In theory, since the primary key is the unique identifier for a row, query writers will employ the primary key in order to aid with record search performance.

“When no clustered index is present to establish a storage order for the data, the storage engine will simply read through the entire table to find what it needs. A table without a clustered index is called a heap table. A heap is just an unordered stack of data with a row identifier as a pointer to the storage location. This data is not ordered or searchable except by walking through the data, row by row, in a process called a scan” (Fritchey, 2014).

However, the caveat to applying clustered indexes on a transactional table is that the index must be reordered after every INSERT or UPDATE to the key which can add substantial overhead to those processes. Dimensional or static tables which are only accessed for join purposes are optimal for this indexing strategy.

Apply a Non-Clustered Index

Another consideration in regard to SQL performance tuning is to apply non-clustered indexes on foreign keys within frequently accessed tables. Babbar et al. (2004) advise, “Be sure to create an index on any foreign key. Because foreign keys are used in joins, foreign keys almost always benefit from having an index.”

Indexing is an Art not a Science

Always remember that indexing is considered an art and not a science. Diverse real world scenarios often call for different indexing strategies. In some instances, indexing a table may not be required. If a table is small (on a per data page basis), then a full table scan will be more efficient than processing an index and then subsequently accessing the base table to locate the rest of the row data.

Conclusion

One of the biggest detriments to SQL query performance is an insufficient indexing strategy. On one hand, under-indexing can potentially cause queries to run longer than necessary due to the costly nature of table scans against unordered heaps. This scenario must be counterbalanced by the tendency to over-index, which will negatively impact insert and update performance.

When possible, SQL practitioners and DBAs should collaborate to understand query performance as a whole; especially in a production environment. DBAs left to their own devices have the potential to create indexes without any knowledge of the queries that will utilize those indexes. This uncoordinated approach has the potential to render indexes inefficient on arrival. Conversely, it is equally important that SQL practitioners have a basic understanding of indexing as well. Placing “SELECT *” in every SQL query will negate the effectiveness of covering indexes and add additional processing overhead as compared to specifically listing the subset of fields desired.

Even if you do not have administrative access to the tables that constitute your queries, approaching your DBA with a basic understanding of indexing strategies will lead to a more effective conversation.

References

Babbar, A., Bjeletich, S., Mackman, A., Meier, J., & Vasireddy, S. (May, 2004). Improving .NET Application Performance and Scalability. Retrieved from https://msdn.microsoft.com/en-us/library/ff647793.aspx

Fritchey, Grant. ( © 2014). Sql server query performance tuning (4th ed.).

Kriegel, Alex. ( © 2011). Discovering sql: a hands-on guide for beginners.

SQL: Think in Sets not Rows

This article is also posted on LinkedIn.

Structured Query Language, better known as SQL, is regarded as the working language of relational database management systems (RDBMS). As was the case with the relational model and the concepts of normalization, the language developed as result of IBM research in the nineteen seventies.

Left to their own devices, the early RDBMSs (sic) implemented a number of languages, including SEQUEL, developed by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s while working at IBM; and QUEL, the original language of Ingres. Eventually these efforts converged into a workable SQL, the Structured Query Language” (Kriegel, 2001).

For information professionals and database practitioners, SQL is regarded as a foundational skill that enables raw data to be manipulated within a RDBMS. “This is a declarative type of language. It instructs the database about what you want to do, and leaves details of implementation (how to do it) to the RDBMS itself” (Kriegel, 2001).

Before the advent of commercially accessible databases, data was typically stored in a proprietary file format manner. Each vendor had detailed specific access mechanisms, which could not be easily configured and customized for access by alternate applications. As databases began to adopt the relational model, the arrival and eventual standardization of SQL by ANSI (American National Standards Institute) and ISO (International Standards Institute) helped foster access, manipulation and retrieval consistency across many products.

Think in Sets not Rows!

SQL provides users the ability to query and manipulate data within the RDBMS without having to solely rely on a graphical user interface. There are powerful extensions in the many variant structured query languages (e.g. T-SQL, DB2, PL/SQL, etc.) that provide functionality above and beyond ISO and ANSI standards. However, SQL practitioners must first and foremost remember that SQL is a SET BASED construct. The most efficient SQL code regards table data as a whole and refrains from manipulating individual row elements one at a time unless absolutely necessary.

“Thinking in sets, or more precisely, in relational terms, is probably the most important best practice when writing T-SQL code. Many people start coding in T-SQL after having some background in procedural programming. Often, at least at the early stages of coding in this new environment, you don’t really think in relational terms, but rather in procedural terms. That’s because it’s easier to think of a new language as an extension to what you already know as opposed to thinking of it as a different thing, which requires adopting the correct mindset” (Ben-Gan, 2012).

Working with a relational language based upon the relational data model demands a set based mindset. Iterative cursor based processing, if used, should be used sparingly.

“By preferring a cursor-based (row-at-a-time) result set—or as Jeff Moden has so aptly termed it, Row By Agonizing Row (RBAR; pronounced ‘ree-bar’)—instead of a regular set-based SQL query, you add a large amount of overhead to SQL Server” (Fritchey, 2014).

If all other set based options have been exhausted and a row-by-row cursor must be employed, then make sure to use an “efficient” (relatively speaking) cursor type. The fast-forward only cursor type provides some performance advantages with respect to other cursor types in a SQL server environment. Fast forward cursors are read only and they only move forward within a data set (i.e. they do not support multi-direction iteration). Furthermore, according to Microsoft Technet (2015), fast forward only cursors automatically close when they reach the end of the data. The application driver does not have to send a close request to the server, which saves a roundtrip across the network.

References:

Ben-Gan, I.  (Apr, 2012). T-SQL Foundations: Thinking in Sets. Why this line of thought is important when addressing querying tasks. Retrieved from http://sqlmag.com/t-sql/t-sql-foundations-thinking-sets

Fritchey, Grant. ( © 2014). Sql server query performance tuning (4th ed.).

Kriegel, Alex. ( © 2011). Discovering sql: a hands-on guide for beginners.

Microsoft Technet. Fast Forward-Only Cursors (ODBC). Retrieved April 23, 2015, from https://technet.microsoft.com/en-us/library/aa177106(v=sql.80).aspx

Normalization: A Database Best Practice

The practice of normalization is widely regarded as the standard methodology for logically organizing data to reduce anomalies in database management systems. Normalization involves deconstructing information into various sub-parts that are linked together in a logical way. Malaika and Nicola (2011) state, “..data normalization represents business records in computers by deconstructing the record into many parts, sometimes hundreds of parts, and reconstructing them again as necessary. Artificial keys and associated indexes are required to link the parts of a single record together.“ Although there are successively stringent forms of normalization, best practice involves decomposing information into the 3rd normal form (3NF). Subsequent higher normal forms provide protection from anomalies that most practitioners will rarely ever encounter.

Background

The normalization methodology was the brainchild of mathematician and IBM researcher Dr. Edgar Frank Codd. Dr. Codd developed the technique while working at IBM’s San Jose Research Laboratory in 1970 (IBM Archives, 2003). Early databases employed either inflexible hierarchical designs or a collection of pointers to data on magnetic tapes. “While such databases could be efficient in handling the specific data and queries they were designed for, they were absolutely inflexible. New types of queries required complex reprogramming, and adding new types of data forced a total redesign of the database itself.” (IBM Archives, 2003). In addition, disk space in the early days of computing was limited and highly expensive. Dr. Codd’s seminal paper “A Relational Model of Data for Large Shared Data Banks” proposed a flexible structure of rows and columns that would help reduce the amount of disk space necessary to store information. Furthermore, this revolutionary new methodology provided the benefit of significantly reducing data anomalies. These aforementioned benefits are achieved by ensuring that data is stored on disk exactly once.

Normal Forms (1NF to 3NF):

Normalization is widely regarded as the best practice when developing a coherent flexible database structure. Adams & Beckett (1997) state that designing a normalized database structure should be the first step taken when building a database that is meant to last. There are seven different forms of normalization; each lower form is a subset of the next higher form. Thus a database in 2nd normal form (2NF) is also in 1st normal form (1NF), although with additional satisfying conditions. Normalization best practice holds that databases in 3rd normal form (3NF) should suffice for the widest range of solutions. Adams & Beckett (1997) called 3NF “adequate for most practical needs.” When Dr. Codd initially proposed the concept of normalization, 3NF was the highest form introduced (Oppel, 2011).

A database table has achieved 1NF if does not contain any repeating groups and its attributes cannot be decomposed into smaller portions (atomicity). Most importantly, all of the data must relate to a primary key that uniquely indentifies a respective row. “When you have more than one field storing the same kind of information in a single table, you have a repeating group.” (Adams & Beckett, 1997). A higher level of normalization is often needed for tables in 1NF. Tables in 1NF are often subjected to “data duplication, update performance degradation, and update integrity problems..” (Teorey, Lightstone, Nadeau & Jagadish, 2011).

A database table has achieved 2NF if it meets the conditions of 1NF and if all of the non-key fields depend on ALL of the key fields (Stephens, 2009). It is important to note that tables with only 1 primary key that satisfy 1NF conditions are automatically in 2NF. In essence, 2NF helps data modelers determine if 2 tables have been combined into one table.

A database has achieved 3NF if it meets the conditions of 2NF and it contains no transitive dependencies. “A transitive dependency is when one non‐key field’s value depends on another non‐key field’s value” (Stephens, 2009). If any of the fields in the database table are dependent on any other fields, then the dependent field should be placed into another table.

If for example field B is functionally dependent on field A, (e.g. A->B), then add field A and B to a new table, with field A designated as a key which will provide linkage to the original table.

In short, 2NF and 3NF help determine the relationship between key and non-key attributes. Williams (1983) states, “Under second and third normal forms, a non-key field must provide a fact about the key, just the whole key, and nothing but the key.” A variant of this definition is typically supplemented with the remark “so help me Codd”.

Benefits:

Adams & Beckett (1997) assert that the normalization method provides benefits such as database efficiency & flexibility, the avoidance of redundant fields, and an easier to maintain database structure. Hoberman, (2009) adds that normalization provides a modeler with a stronger understanding of the business. The normalization process ensures that many questions are asked regarding data elements so these elements may be assigned to entities correctly. Hoberman also agrees that data quality is improved as redundancy is reduced.
Assessment

Although engaging in normalization is considered best practice, many sources advocate that normalization to 3NF is sufficient for the majority of data modeling solutions. Third normal form is deemed sufficient because the anomalies covered in higher forms occur with much less frequency. Beckett & Adams (1997) describe 3NF as “adequate for most practical needs.” Stephens (2009) affirms that many designers view 3NF as the form that combines adequate protection from recurrent data anomalies with relative modeling ease. Levels of normalization beyond 3NF can yield data models that are overly engineered, overly complicated and hard to maintain. The risk inherent in higher form constructs is that the performance can degrade to a level that is worse than less normalized designs. Hoberman (2009) asserts that, “Even though there are higher levels of normalization than 3NF, many interpret the term ‘normalized’ to mean 3NF.”

There are examples in data modeling literature where strict adherence to normalization is not advised. Fotache (2006) posited that normalization was initially a highly rigorous, theoretical methodology that was of not much practical use to real world development. Fotache provides the example of an attribute named ADDRESS, which is typically stored by many companies as an atomic string per 1NF requirements. The ADDRESS data could be stored in one field (violating 1NF) if the data is only needed for better person identification or mailing purposes. Teorey, Lightstone, Nadeau, & Jagadish (2011) advise that denormalization should be considered when performance considerations are in play. Denormalization introduces a trade off of increased update cost versus lower read cost depending upon the levels of data redundancy. Date (1990) downplays strict adherence to normalization and sets a minimum requirement of 1NF. “Normalization theory is a useful aid in the process, but it is not a panacea; anyone designing a database is certainly advised to be familiar with the basic techniques of normalization…but we do not mean to suggest that the design should necessarily be based on normalization principles alone” (Date, 1990).
Conclusion

Normalization is the best practice when designing a flexible and efficient database structure. The first three normal forms can be remembered by recalling a simple mnemonic. All attributes should depend upon a key (1NF), the whole key (2NF) and nothing but the key (3NF).

The advantages of normalization are many. Normalization ensures that modelers have a strong understanding of the business, it greatly reduces data redundancies and it improves data quality. When there is less data to store on disk, updating and inserting becomes a faster process. In addition, insert, delete and update anomalies disappear when adhering to normalization techniques. “The mantra of the skilled database designer is, for each attribute, capture it once, store it once, and use that one copy everywhere” (Stephens 2009).

It is important to remember that normalization to 3NF is sufficient for the majority of data modeling solutions. Higher levels of normalization can overcomplicate a database design and have the potential to provide worse performance.

In conclusion, begin the database design process by using normalization techniques. For implementation purposes, normalize data to 3NF compliance and then consider if data retrieval performance reasons necessitate denormalizing to a lower form. Denormalization introduces a trade off of increased update cost versus lower read cost depending upon the levels of data redundancy.

From the DAMA Guide to the Data Management Body of Knowledge:

  • First normal form (1NF): Ensures each entity has a valid primary key , every data element depends on the primary key, and removes repeating groups , and ensuring each data element is atomic (not multi-valued).
  • Second normal form (2NF): Ensures each entity has the minimal primary key and that every data element depends on the complete primary key.
  • Third normal form (3NF): Ensures each entity has no hidden primary keys and that each data element depends on no data element outside the key (―the key, the whole key and nothing but the key).

Glossary:

Delete Anomaly: “A delete anomaly is a situation where a deletion of data about one particular entity causes unintended loss of data that characterizes another entity.” (Stephens, 2009)

Denormalization: Denormalization involves reversing the process of normalization to gain faster read performance.

Insert Anomaly: “An insert anomaly is a situation where you cannot insert a new tuple into a relation because of an artificial dependency on another relation.” (Stephens, 2009)

Normalization: “Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.” (Microsoft Knowledge Base)

Primary Key: “Even though an entity may contain more than one candidate key, we can only select one candidate key to be the primary key for an entity. A primary key is a candidate key that has been chosen to be the unique identifier for an entity.” (Hoberman, 2009)

Update Anomaly: “An update anomaly is a situation where an update of a single data value requires multiple tuples (rows) of data to be updated.” (Stephens, 2009)

Bibliography

Adams, D., & Beckett, D. (1997). Normalization Is a Nice Theory. Foresight Technology Inc. Retrieved from http://www.4dcompanion.com/downloads/papers/normalization.pdf

DAMA International (2009). The DAMA Guide to the Data Management Body of Knowledge (1st Edition).

Fotache, M. (2006, May 1) Why Normalization Failed to Become the Ultimate Guide for Database Designers? Available at SSRN: http://ssrn.com/abstract=905060 or http://dx.doi.org/10.2139/ssrn.905060

Hoberman, S. (2009). Data modeling made simple: a practical guide for business and it professionals, second edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=34408.

IBM Archives (2003): Edgar F. Codd. Retrieved from http://www-03.ibm.com/ibm/history/exhibits/builders/builders_codd.html

Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory. Communications of the ACM 26(2). Retrieved from http://www.bkent.net/Doc/simple5.htm

Malaika, S., & Nicola, M. (2011, December 15). Data normalization reconsidered, Part 1: The history of business records. Retrieved from http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/

Microsoft Knowledge Base. Article ID: 283878. Description of the database normalization basics. Retrieved from http://support.microsoft.com/kb/283878

Oppel, A. (2011). Databases demystified, 2nd edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=72521.

Stephens, Rod. (2009). Beginning database design solutions. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=29584.

Teorey, T. & Lightstone, S. & Nadeau, T. & Jagadish, H.V.. ( © 2011). Database modeling and design: logical design, fifth edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=41847.

Image courtesy of David Castillo Dominici at FreeDigitalPhotos.net