How to Import a BAK File into SQL Server

If you’ve ever asked “How do I import a .BAK file into SQL Server” or “What is a BAK file and how do I open it” then this is your video. Additionally, I also demonstrate how to create a .BAK file to backup your database. If you work with data then you need to know this tip!

Do Great Things with Your Data

Anthony B. Smoak

Please like and subscribe on the Anthony B. Smoak YouTube channel!
Definitely pick up some merch if you’ve enjoyed this blog and YouTube channel over the years.

All views and opinions are solely my own and do not necessarily reflect those my employer.

Free SQL Data Profiling Tool: Idera SQL Data Profiler

In this tip I will provide an overview of a completely free data profiling tool (at least as of the time of this post) that is easy to use if you are in the Microsoft database stack. You can download the IDERA SQL Profiling tool and immediately put it to work to perform basic column profiling and analyses. The tool will display a summary of the data contained in a selected table and each of its columns. 

Use this tool on the following systems:

  • Microsoft SQL Server: 2008 R2, 2012, 2014, 2016; 2017 Windows & Linux (provisional); Express, Standard, Enterprise editions
  • Microsoft Azure SQL Database
  • Amazon Relational Database Service (RDS)

IDERA SQL Data Profiler has some minor quirks but you can’t beat the price. Check out my review in the video above.

Just remember that data profiling should always be done initially before you start analyzing a new dataset or designing a new visualization. Always start with the basics.

If you find this type of instruction valuable make sure to subscribe to my Youtube channel.

T-SQL Tips and Quick Reference

Whenever I have to fire up SQL Server to perform some analyses there are a few functions, keywords and capabilities that I always find myself referring to in order to analyze my data. As is the case with most T-SQL users, even those of us that have been using T-SQL for over a decade, in our heads we always know what we want to do but will refer to our favorite syntax reference sources in order to progress. I decided to make a handy reference sheet for myself and then decided to post it here for anyone else.

How to Create a Temporary Table in T-SQL / SQL Server

Temporary (i.e., temp) tables enable the storage of result sets from SQL scripts yet require less record locking overhead and thus increase performance. They remain in effect until they are explicitly dropped, or until the connection that created them is discontinued.

As I see it, their main benefit is that they preclude me from writing difficult to comprehend nested queries since I can place a result set inside a temp table and then join it back to a normal table at-will.

In this example, the results of permanent table ‘TABLE1’ will be placed into global temporary table ##TEMPTABLE:


Temp tables are stored in the tempdb system database.


“The tempdb system database is a global resource that is available to all users connected to the instance of SQL Server or connected to SQL Database.”

Additional Reference:

What do the Hashtags Mean in T-SQL Temp Table Creation?

 The number of hash signs “#” preceding the name of the temp table affects whether the scope of the table is local or global.

  • If you precede the temp table name with “#”, then the table will be treated as a local temp table.
  • If you precede the temp table with “##”, then the table will be treated as a global temp table.

“You can create local and global temporary tables. Local temporary tables are visible only in the current session, and global temporary tables are visible to all sessions. Temporary tables cannot be partitioned. Prefix local temporary table names with single number sign (#table_name), and prefix global temporary table names with a double number sign (##table_name).”

Additional References:

How to Drop a Temp Table
in T-SQL / SQL Server

 There are times when you will need to rerun code that creates a temp table. If the temp table has already been created, you will encounter an error.

“There is already an object named ‘##TEMP_TABLE_NAME’ in the database.”

Place the following code above the creation of your temp tables to force SQL Server to drop the temp table if it already exists. Change ##TEMP_TABLE_NAME to your table name and use the correct number of hashtags as applicable to a local (#) or global (##) temp table.


How to Add a New Field to a Temp Table in T-SQL / SQL Server (ALTER TABLE)

Here is example T-SQL that illustrates how to add a new field to a global temp table. The code below adds a simple bit field (holds either 1 or 0) named FIELD1 to the temp table, declares it as NOT NULL (i.e., it must have a value) and then defaults the value to 0.


The following code changes the data type of an existing field in a global temp table. FIELD1 has its data type changed to NVARCHAR(2) and is declared as NOT NULL.


Additional References:

How to Use a CASE Statement in T-SQL / SQL Server

The following information on the CASE statement is direct from Microsoft:

The CASE expression evaluates a list of conditions and returns one of multiple possible result expressions. The CASE expression has two formats:

  • The simple CASE expression compares an expression to a set of simple expressions to determine the result.
  • The searched CASE expression evaluates a set of Boolean expressions to determine the result.

Both formats support an optional ELSE argument.

CASE can be used in any statement or clause that allows a valid expression. For example, you can use CASE in statements such as SELECT, UPDATE, DELETE and SET, and in clauses such as select_list, IN, WHERE, ORDER BY, and HAVING.

Examples from Microsoft:

Category =
CASE ProductLine
WHEN 'R' THEN 'Road'
WHEN 'M' THEN 'Mountain'
WHEN 'T' THEN 'Touring'
WHEN 'S' THEN 'Other sale items'
ELSE 'Not for sale'
FROM Production.Product
ORDER BY ProductNumber;

"Price Range" =
WHEN ListPrice =  0 THEN 'Mfg item - not for resale'
WHEN ListPrice = 50 and ListPrice = 250 and ListPrice < 1000 THEN 'Under $1000'
ELSE 'Over $1000'
FROM Production.Product
ORDER BY ProductNumber ;

Here is a link to great post that highlights some of the unexpected results when using the CASE statement.

How to Use the Cast Function in T-SQL / SQL Server

When you need to convert a data field or expression to another data type then the cast function can be helpful. I typically have the need to take imported text fields and evaluate them as a datetime. The cast statement below helps me resolve this issue.

Select cast(txtOrder_Date as datetime) as Order_Date

This statement can also be used in a WHERE clause to filter the text as if it were a true datetime field/.

Where cast(txtOrder_Date as datetime)) between '20170101' and '20181231'

Furthermore, you can cast a literal string to an integer or decimal as needed.

Select cast(‘12345’ as int) as Integer_Field
Select cast(‘12345.12’’ as decimal (9,2)) as Decimal_Field

When your FIELDNAME is a text value, you can use the cast function to change its data type to an integer or decimal, and then sum the results. Here are a few examples I have had to use in the past with the sum function.

sum(cast(FIELDNAME as int)) as Sum_Overall_Qty

sum(cast(ltrim(rtrim(FIELDNAME2)) as decimal(38,2))) as Sum_Sales_Price

Additional Reference:

Using the REPLACE Function in T-SQL / SQL Server

The Replace function is useful when you need to replace all occurrences of one character or substring with another character or substring. The following select will replace the string ‘Anthony’ with ‘Tony’.

Select REPLACE(‘My name is Anthony’, ‘Anthony’, ‘Tony’);

REPLACE Function

Additional Reference:

How to Convert a Negative Text Number in Parenthesis Format to a Numeric Data Type (T-SQL / SQL Server)

I’ve found this particular expression useful when trying to convert a negative number in text format to a decimal value when the text is enclosed in parentheses; i.e., changing (123.45) to -123.45

It makes use of the REPLACE function to find the leading parenthesis and replace it with a negative sign. This first REPLACE is nested inside another REPLACE function in order to find the trailing parenthesis and replace

Select cast(replace(replace('(123.45)','(','-'),')','') as money);

REPLACE function2

You can also use the convert function to accomplish the same result. Below I used this line of code to sum the negative formatted text (represented by FIELD_NAME) by converting it to the money data type after replacing the parenthesis.

sum(convert(money,replace(replace(FIELD_NAME,'(','-'),')',''))) as Sum_Domestic_Price

COALESCE Function in T-SQL / SQL Server

The COALESCE function is very useful when replacing NULL field values with a substitute value. Per Microsoft, the COALESCE function evaluates in order a comma delimited list of expressions and returns the current value of the first expression that initially does not evaluate to NULL.

For example,

SELECT COALESCE(NULL, NULL, 'third_value', 'fourth_value');

returns the third value because the third value is the first value that is not null. I will use the COALESCE function at times to replace NULL values with 0 for use in calculations.


How to Dynamically Pivot Data in SQL Server


SQL is the lifeblood of any data professional. If you can’t leverage SQL and you work with data, your life will be more difficult than it needs to be.

In this video I am using SQL Server Express to turn a simple normalized dataset into a pivoted dataset. This is not a beginner video as I assume you are familiar with basic SQL concepts.

T-SQL is Microsoft’s SQL language that contains additional functions and capabilities over and above ANSI standards. We’ll use some of these functions to turn the following data set that displays average rents in major American cities into a pivoted denormalized dataset.

The City values in the City column will become individual columns in a new pivoted dataset with their respective Average Rent values appearing underneath.

We’re going to transform this:

Normalized Data

Into this:

Pivoted Data

Notice how the city values are now column heads and the respective Average Rent values are underneath.

Make sure you watch the video but here is the code used in the example.


--This parameter will hold the dynamically created SQL script

--This parameter will hold the Pivoted Column values

SELECT   @PivotColumns= COALESCE(@PivotColumns + ',','') + QUOTENAME([City])
FROM [dbo].[tbl_Rent]

--SELECT   @PivotColumns

--Create the dynamic query with all the values for
--pivot column at runtime

SET   @SQLQuery =
   N'SELECT [City Code],[Metro],[County],[State],[Population Rank],' +   @PivotColumns + '
   FROM [dbo].[tbl_Rent]
   PIVOT( MAX([Average Rent])
      FOR [City] IN (' + @PivotColumns + ')) AS Q'

--SELECT   @SQLQuery
--Execute dynamic query
EXEC sp_executesql @SQLQuery

Select * from ##TBL_TEMP


Big shoutout to StackOverflow for help with this example.


Return Unmatched Records with SQL and Microsoft Access

Over the course of many years of building SQL scripts, I’ve tended to help SQL novices perform the set difference operation on their data. This post will not provide in-depth coverage on SQL run plans and tuning minutiae, but I do want to provide a high level overview for the novice.

If we define set A as the three numbers {1, 2, 3} and set B as the numbers {2, 3, 4} then the set difference, denoted as A \ B, is {1}. Notice that the element 1 is only a member of set A.

A picture is worth a thousand words as they say. A Venn diagram will be effective at illustrating what we’re trying to accomplish in this post.

Venn Diagram Difference

This blog post will cover using SQL and Microsoft Access to address capturing the shaded records in set A. If you have a database table named A and wanted to determine all of the rows in this table that DO NOT reside in another table named B, then you would apply the set difference principle.


There are multiple ways to implement the set difference principle. It helps if there is a common join key between both sets of data when performing this analysis.

If I were working with two tables, one containing inventory data and one containing order data. I could write the following SQL script to return all the inventory rows that do not reside in the orders table.

SELECT table_inventories.* 
 FROM   table_inventories 
        LEFT OUTER JOIN table_orders 
                     ON = 


Consider the following tables in Microsoft Access. Observe that table_orders has fewer records than table_inventories.

Access Example Inventory Access Example Orders

We can construct a set difference select query using these tables to return all of the products in table_inventories that have not been ordered. Create a query in Microsoft Access in a similar fashion as shown below.

Access SQL Difference Join

The result of this query would produce the following two products that are not in table_orders.

Access Example Query Result

The Microsoft Access Query & View Designer would automatically generate the following SQL if you cared to open the Access SQL editor.

SELECT table_inventories.*
FROM   table_inventories
LEFT JOIN table_orders
ON =
WHERE  (( ( ) IS NULL ));

Notice that LEFT JOIN is automatically created instead of LEFT OUTER JOIN. In Microsoft Access, the OUTER operation is optional. Also notice that Access loves to add additional parentheses for reasons known only to Microsoft.

Per Microsoft Access SQL Reference:

“Use a LEFT JOIN operation to create a left outer join. Left outer joins include all of the records from the first (left) of two tables, even if there are no matching values for records in the second (right) table [1].”


Let’s step away from Microsoft Access for the remainder of this post. The NOT EXISTS approach provides similar functionality in a more performance friendly manner as compared to the LEFT OUTER JOIN & IS NULL syntax.

SELECT table_inventories.*
FROM   table_inventories
FROM   table_orders);


Alternatively, we could use the SQL EXCEPT operator which would also accomplish the task of returning inventory ids that do not reside in the orders table (i.e. inventory items that were never ordered). This syntax would be appropriate when using SQL Server.

FROM   table_inventories
FROM   table_orders

Per Microsoft:

Returns any distinct values from the query to the left of the EXCEPT operator that are not also returned from the right query [2].”


The following script will yield the same result as the T-SQL syntax. When using Oracle, make sure to incorporate the MINUS operator.

FROM   inventories
FROM   table_orders

Now take this tip and get out there and do some good things with your data.

Anthony Smoak


[1] Access 2007 Developer Reference.

[2] Microsoft T-SQL Docs. Set Operators – EXCEPT and INTERSECT (Transact-SQL).

[3] Oracle Help Center. The UNION [ALL], INTERSECT, MINUS Operators.

Venn diagram courtesy of

The Need For Speed: Improve SQL Query Performance with Indexing

This article is also published on LinkedIn.

How many times have you executed a SQL query against a million plus row table and then engaged in a protracted waiting game for your results? Unfortunately, a poor database table indexing strategy can counteract the gains of the best hardware and server architectures. The positive impact that strategically applied indexes can provide to query performance should not be ignored just because one isn’t wearing a DBA hat. “You can obtain the greatest improvement in database application performance by looking first at the area of data access, including logical/physical database design, query design, and index design” (Fritchey, 2014). Understanding the basics of index application should not be eschewed and treated as an esoteric art best left to DBAs.

Make use of the Covering Index

It is important that regularly used, resource intensive queries be subjected to “covering indexes”. The aim of a covering index is to “cover” the query by including all of the fields that are referenced in WHERE or SELECT statements. Babbar, Bjeletich, Mackman, Meier and Vasireddy (2004) state, “The index ‘covers’ the query, and can completely service the query without going to the base data. This is in effect a materialized view of the query. The covering index performs well because the data is in one place and in the required order.” The benefit of a properly constructed covering index is clear; the RDBMS can find all the data columns it needs in the index without the need to refer back to the base table which drastically improves performance. Kriegel (2011) asserts, “Not all indices are created equal — If the column for which you’ve created an index is not part of your search criteria, the index will be useless at best and detrimental at worst.”

Apply a Clustered Index

More often than not, a table should have a clustered index applied so as to avoid expensive table scans by the query optimizer. It is advisable to create one clustered index per table preferably on the PRIMARY KEY column. In theory, since the primary key is the unique identifier for a row, query writers will employ the primary key in order to aid with record search performance.

“When no clustered index is present to establish a storage order for the data, the storage engine will simply read through the entire table to find what it needs. A table without a clustered index is called a heap table. A heap is just an unordered stack of data with a row identifier as a pointer to the storage location. This data is not ordered or searchable except by walking through the data, row by row, in a process called a scan” (Fritchey, 2014).

However, the caveat to applying clustered indexes on a transactional table is that the index must be reordered after every INSERT or UPDATE to the key which can add substantial overhead to those processes. Dimensional or static tables which are only accessed for join purposes are optimal for this indexing strategy.

Apply a Non-Clustered Index

Another consideration in regard to SQL performance tuning is to apply non-clustered indexes on foreign keys within frequently accessed tables. Babbar et al. (2004) advise, “Be sure to create an index on any foreign key. Because foreign keys are used in joins, foreign keys almost always benefit from having an index.”

Indexing is an Art not a Science

Always remember that indexing is considered an art and not a science. Diverse real world scenarios often call for different indexing strategies. In some instances, indexing a table may not be required. If a table is small (on a per data page basis), then a full table scan will be more efficient than processing an index and then subsequently accessing the base table to locate the rest of the row data.


One of the biggest detriments to SQL query performance is an insufficient indexing strategy. On one hand, under-indexing can potentially cause queries to run longer than necessary due to the costly nature of table scans against unordered heaps. This scenario must be counterbalanced by the tendency to over-index, which will negatively impact insert and update performance.

When possible, SQL practitioners and DBAs should collaborate to understand query performance as a whole; especially in a production environment. DBAs left to their own devices have the potential to create indexes without any knowledge of the queries that will utilize those indexes. This uncoordinated approach has the potential to render indexes inefficient on arrival. Conversely, it is equally important that SQL practitioners have a basic understanding of indexing as well. Placing “SELECT *” in every SQL query will negate the effectiveness of covering indexes and add additional processing overhead as compared to specifically listing the subset of fields desired.

Even if you do not have administrative access to the tables that constitute your queries, approaching your DBA with a basic understanding of indexing strategies will lead to a more effective conversation.


Babbar, A., Bjeletich, S., Mackman, A., Meier, J., & Vasireddy, S. (May, 2004). Improving .NET Application Performance and Scalability. Retrieved from

Fritchey, Grant. ( © 2014). Sql server query performance tuning (4th ed.).

Kriegel, Alex. ( © 2011). Discovering sql: a hands-on guide for beginners.

Protection Against Injection: The SQL Injection Attack

As we are all well aware, data is everywhere. Every organization generates and stores data and unfortunately too many bad apples are willing to exploit application weaknesses.  A very popular technique used by hackers of all hats to compromise data confidentiality and integrity is the SQL injection attack. “In terms of the attack methods used by hackers, SQL injection remains the number one exploit, accounting for nearly one-fifth of all vulnerabilities used by hackers” (Howarth, 2010). Don’t believe the hype? Visit the SQL Injection Hall of Fame.

Not everyone is a DBA or a security expert but if you care about data, you need to have a basic understanding of how this attack can be used to potentially compromise your web exposed data. In 2009 infamous hacker Albert Gonzalez was indicted by grand juries in Massachusetts and New York for stealing data from companies such as Dave & Buster’s Holdings, TJ Maxx, BJ’s Wholesale Club, OfficeMax, Barnes & Noble and The Sports Authority by using SQL injection attacks. All of these attacks were enabled due to poorly coded web application software (Vijayan, 2009). He masterminded “the combined credit card theft and subsequent reselling of more than 170 million card and ATM numbers from 2005 through 2007—the biggest such fraud in history” (Wikipedia, Albert Gonzalez). As an aside, Mr. Gonzalez is serving 20 years in prison for his crimes.

In short, a SQL injection is a malicious hacking method used to compromise the security of a SQL database. Invalid parameters are entered into a user input field on a website and that user input is submitted to a web application database server for execution. A successful exploit will allow the hacker to remotely shell into the server and take control or simply obtain sensitive information from a hacked SQL SELECT statement. The exploiter may be able to further exploit SQL commands and escalate privileges to read, modify or even delete information at will.

A popular method to test the vulnerability of a site is to place a single quote character, ‘, into the query string of a URL (Krutz, R. L. & Vines, R. D., 2008). The desired response is to see an error message that contains an ODBC (Open Database Connectivity) reference. ODBC is a standard database access protocol used to interact with applications regardless of the underlying database management system. Krutz et. al (2008) offer the example of typical ODBC error message:

Microsoft OLE DB Provider for ODBC Drivers error ‘80040e14’
[Microsoft][ODBC SQL Server Driver][SQL Server]Incorrect syntax near the
keyword ‘and’. /wasc.asp, line 68

An error message like this contains a wealth of information that an ill-intentioned hacker can use to exploit an insecure system. It would be in the best interests of a secure application to return a custom generic error response. Furthermore, it is not necessary to be an experienced hacker to take advantage of this exploit; there are automated SQL injection tools available that can make carrying out this attack fairly simple for someone with a script kiddie level of understanding.

There are ways to protect against SQL injection attacks; the most obvious way is to apply input validation. Rejecting unreasonably long inputs may prevent exploitation of a buffer overflow scenario. Programmers due to the extra work involved, sometimes skip validation steps, however the extra safety margin may be worth the cost. Encrypting the database contents and limiting privileges on those accounts which execute user input queries is also ideal (Daswani, N., Kern, C., & Kesavan, A., 2007)

From a SQL Server perspective, here are a few best practice tips shared from Microsoft TechNet to consider for input validation:

    • You should review all code that calls EXECUTE, EXEC, or sp_executesql
    • Test the size and data type of input and enforce appropriate limits. This can help prevent deliberate buffer overruns.
    • Test the content of string variables and accept only expected values. Reject entries that contain binary data, escape sequences, and comment characters. This can help prevent script injection and can protect against some buffer overrun exploits.
    • Never build Transact-SQL statements directly from user input.
    • Use stored procedures to validate user input.
    • In multitiered environments, all data should be validated before admission to the trusted zone. Data that does not pass the validation process should be rejected and an error should be returned to the previous tier.
    • Implement multiple layers of validation. Validate input in the user interface and at all subsequent points where it crosses a trust boundary. For example, data validation in a client-side application can prevent simple script injection. However, if the next tier assumes that its input has already been validated, any malicious user who can bypass a client can have unrestricted access to a system.
    • Never concatenate user input that is not validated. String concatenation is the primary point of entry for script injection.

Albert Gonzalez. In Wikipedia.

Howarth.F. (2010). Emerging Hacker Attacks. Faulkner Information Services.

Krutz, R. L. & Vines, R. D., ( © 2008). The CEH Prep Guide: The Comprehensive Guide to Certified Ethical Hacking.

Microsoft TechNet. SQL Injection.

Vijayan, J. (2009). “U.S. says SQL injection caused major breaches.” Computerworld, 43(26), 4-4.

SQL: Think in Sets not Rows

This article is also posted on LinkedIn.

Structured Query Language, better known as SQL, is regarded as the working language of relational database management systems (RDBMS). As was the case with the relational model and the concepts of normalization, the language developed as result of IBM research in the nineteen seventies.

Left to their own devices, the early RDBMSs (sic) implemented a number of languages, including SEQUEL, developed by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s while working at IBM; and QUEL, the original language of Ingres. Eventually these efforts converged into a workable SQL, the Structured Query Language” (Kriegel, 2001).

For information professionals and database practitioners, SQL is regarded as a foundational skill that enables raw data to be manipulated within a RDBMS. “This is a declarative type of language. It instructs the database about what you want to do, and leaves details of implementation (how to do it) to the RDBMS itself” (Kriegel, 2001).

Before the advent of commercially accessible databases, data was typically stored in a proprietary file format manner. Each vendor had detailed specific access mechanisms, which could not be easily configured and customized for access by alternate applications. As databases began to adopt the relational model, the arrival and eventual standardization of SQL by ANSI (American National Standards Institute) and ISO (International Standards Institute) helped foster access, manipulation and retrieval consistency across many products.

Think in Sets not Rows!

SQL provides users the ability to query and manipulate data within the RDBMS without having to solely rely on a graphical user interface. There are powerful extensions in the many variant structured query languages (e.g. T-SQL, DB2, PL/SQL, etc.) that provide functionality above and beyond ISO and ANSI standards. However, SQL practitioners must first and foremost remember that SQL is a SET BASED construct. The most efficient SQL code regards table data as a whole and refrains from manipulating individual row elements one at a time unless absolutely necessary.

“Thinking in sets, or more precisely, in relational terms, is probably the most important best practice when writing T-SQL code. Many people start coding in T-SQL after having some background in procedural programming. Often, at least at the early stages of coding in this new environment, you don’t really think in relational terms, but rather in procedural terms. That’s because it’s easier to think of a new language as an extension to what you already know as opposed to thinking of it as a different thing, which requires adopting the correct mindset” (Ben-Gan, 2012).

Working with a relational language based upon the relational data model demands a set based mindset. Iterative cursor based processing, if used, should be used sparingly.

“By preferring a cursor-based (row-at-a-time) result set—or as Jeff Moden has so aptly termed it, Row By Agonizing Row (RBAR; pronounced ‘ree-bar’)—instead of a regular set-based SQL query, you add a large amount of overhead to SQL Server” (Fritchey, 2014).

If all other set based options have been exhausted and a row-by-row cursor must be employed, then make sure to use an “efficient” (relatively speaking) cursor type. The fast-forward only cursor type provides some performance advantages with respect to other cursor types in a SQL server environment. Fast forward cursors are read only and they only move forward within a data set (i.e. they do not support multi-direction iteration). Furthermore, according to Microsoft Technet (2015), fast forward only cursors automatically close when they reach the end of the data. The application driver does not have to send a close request to the server, which saves a roundtrip across the network.


Ben-Gan, I.  (Apr, 2012). T-SQL Foundations: Thinking in Sets. Why this line of thought is important when addressing querying tasks. Retrieved from

Fritchey, Grant. ( © 2014). Sql server query performance tuning (4th ed.).

Kriegel, Alex. ( © 2011). Discovering sql: a hands-on guide for beginners.

Microsoft Technet. Fast Forward-Only Cursors (ODBC). Retrieved April 23, 2015, from

Normalization: A Database Best Practice

The practice of normalization is widely regarded as the standard methodology for logically organizing data to reduce anomalies in database management systems. Normalization involves deconstructing information into various sub-parts that are linked together in a logical way. Malaika and Nicola (2011) state, “ normalization represents business records in computers by deconstructing the record into many parts, sometimes hundreds of parts, and reconstructing them again as necessary. Artificial keys and associated indexes are required to link the parts of a single record together.“ Although there are successively stringent forms of normalization, best practice involves decomposing information into the 3rd normal form (3NF). Subsequent higher normal forms provide protection from anomalies that most practitioners will rarely ever encounter.


The normalization methodology was the brainchild of mathematician and IBM researcher Dr. Edgar Frank Codd. Dr. Codd developed the technique while working at IBM’s San Jose Research Laboratory in 1970 (IBM Archives, 2003). Early databases employed either inflexible hierarchical designs or a collection of pointers to data on magnetic tapes. “While such databases could be efficient in handling the specific data and queries they were designed for, they were absolutely inflexible. New types of queries required complex reprogramming, and adding new types of data forced a total redesign of the database itself.” (IBM Archives, 2003). In addition, disk space in the early days of computing was limited and highly expensive. Dr. Codd’s seminal paper “A Relational Model of Data for Large Shared Data Banks” proposed a flexible structure of rows and columns that would help reduce the amount of disk space necessary to store information. Furthermore, this revolutionary new methodology provided the benefit of significantly reducing data anomalies. These aforementioned benefits are achieved by ensuring that data is stored on disk exactly once.

Normal Forms (1NF to 3NF):

Normalization is widely regarded as the best practice when developing a coherent flexible database structure. Adams & Beckett (1997) state that designing a normalized database structure should be the first step taken when building a database that is meant to last. There are seven different forms of normalization; each lower form is a subset of the next higher form. Thus a database in 2nd normal form (2NF) is also in 1st normal form (1NF), although with additional satisfying conditions. Normalization best practice holds that databases in 3rd normal form (3NF) should suffice for the widest range of solutions. Adams & Beckett (1997) called 3NF “adequate for most practical needs.” When Dr. Codd initially proposed the concept of normalization, 3NF was the highest form introduced (Oppel, 2011).

A database table has achieved 1NF if does not contain any repeating groups and its attributes cannot be decomposed into smaller portions (atomicity). Most importantly, all of the data must relate to a primary key that uniquely indentifies a respective row. “When you have more than one field storing the same kind of information in a single table, you have a repeating group.” (Adams & Beckett, 1997). A higher level of normalization is often needed for tables in 1NF. Tables in 1NF are often subjected to “data duplication, update performance degradation, and update integrity problems..” (Teorey, Lightstone, Nadeau & Jagadish, 2011).

A database table has achieved 2NF if it meets the conditions of 1NF and if all of the non-key fields depend on ALL of the key fields (Stephens, 2009). It is important to note that tables with only 1 primary key that satisfy 1NF conditions are automatically in 2NF. In essence, 2NF helps data modelers determine if 2 tables have been combined into one table.

A database has achieved 3NF if it meets the conditions of 2NF and it contains no transitive dependencies. “A transitive dependency is when one non‐key field’s value depends on another non‐key field’s value” (Stephens, 2009). If any of the fields in the database table are dependent on any other fields, then the dependent field should be placed into another table.

If for example field B is functionally dependent on field A, (e.g. A->B), then add field A and B to a new table, with field A designated as a key which will provide linkage to the original table.

In short, 2NF and 3NF help determine the relationship between key and non-key attributes. Williams (1983) states, “Under second and third normal forms, a non-key field must provide a fact about the key, just the whole key, and nothing but the key.” A variant of this definition is typically supplemented with the remark “so help me Codd”.


Adams & Beckett (1997) assert that the normalization method provides benefits such as database efficiency & flexibility, the avoidance of redundant fields, and an easier to maintain database structure. Hoberman, (2009) adds that normalization provides a modeler with a stronger understanding of the business. The normalization process ensures that many questions are asked regarding data elements so these elements may be assigned to entities correctly. Hoberman also agrees that data quality is improved as redundancy is reduced.

Although engaging in normalization is considered best practice, many sources advocate that normalization to 3NF is sufficient for the majority of data modeling solutions. Third normal form is deemed sufficient because the anomalies covered in higher forms occur with much less frequency. Beckett & Adams (1997) describe 3NF as “adequate for most practical needs.” Stephens (2009) affirms that many designers view 3NF as the form that combines adequate protection from recurrent data anomalies with relative modeling ease. Levels of normalization beyond 3NF can yield data models that are overly engineered, overly complicated and hard to maintain. The risk inherent in higher form constructs is that the performance can degrade to a level that is worse than less normalized designs. Hoberman (2009) asserts that, “Even though there are higher levels of normalization than 3NF, many interpret the term ‘normalized’ to mean 3NF.”

There are examples in data modeling literature where strict adherence to normalization is not advised. Fotache (2006) posited that normalization was initially a highly rigorous, theoretical methodology that was of not much practical use to real world development. Fotache provides the example of an attribute named ADDRESS, which is typically stored by many companies as an atomic string per 1NF requirements. The ADDRESS data could be stored in one field (violating 1NF) if the data is only needed for better person identification or mailing purposes. Teorey, Lightstone, Nadeau, & Jagadish (2011) advise that denormalization should be considered when performance considerations are in play. Denormalization introduces a trade off of increased update cost versus lower read cost depending upon the levels of data redundancy. Date (1990) downplays strict adherence to normalization and sets a minimum requirement of 1NF. “Normalization theory is a useful aid in the process, but it is not a panacea; anyone designing a database is certainly advised to be familiar with the basic techniques of normalization…but we do not mean to suggest that the design should necessarily be based on normalization principles alone” (Date, 1990).

Normalization is the best practice when designing a flexible and efficient database structure. The first three normal forms can be remembered by recalling a simple mnemonic. All attributes should depend upon a key (1NF), the whole key (2NF) and nothing but the key (3NF).

The advantages of normalization are many. Normalization ensures that modelers have a strong understanding of the business, it greatly reduces data redundancies and it improves data quality. When there is less data to store on disk, updating and inserting becomes a faster process. In addition, insert, delete and update anomalies disappear when adhering to normalization techniques. “The mantra of the skilled database designer is, for each attribute, capture it once, store it once, and use that one copy everywhere” (Stephens 2009).

It is important to remember that normalization to 3NF is sufficient for the majority of data modeling solutions. Higher levels of normalization can overcomplicate a database design and have the potential to provide worse performance.

In conclusion, begin the database design process by using normalization techniques. For implementation purposes, normalize data to 3NF compliance and then consider if data retrieval performance reasons necessitate denormalizing to a lower form. Denormalization introduces a trade off of increased update cost versus lower read cost depending upon the levels of data redundancy.

From the DAMA Guide to the Data Management Body of Knowledge:

  • First normal form (1NF): Ensures each entity has a valid primary key , every data element depends on the primary key, and removes repeating groups , and ensuring each data element is atomic (not multi-valued).
  • Second normal form (2NF): Ensures each entity has the minimal primary key and that every data element depends on the complete primary key.
  • Third normal form (3NF): Ensures each entity has no hidden primary keys and that each data element depends on no data element outside the key (―the key, the whole key and nothing but the key).


Delete Anomaly: “A delete anomaly is a situation where a deletion of data about one particular entity causes unintended loss of data that characterizes another entity.” (Stephens, 2009)

Denormalization: Denormalization involves reversing the process of normalization to gain faster read performance.

Insert Anomaly: “An insert anomaly is a situation where you cannot insert a new tuple into a relation because of an artificial dependency on another relation.” (Stephens, 2009)

Normalization: “Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.” (Microsoft Knowledge Base)

Primary Key: “Even though an entity may contain more than one candidate key, we can only select one candidate key to be the primary key for an entity. A primary key is a candidate key that has been chosen to be the unique identifier for an entity.” (Hoberman, 2009)

Update Anomaly: “An update anomaly is a situation where an update of a single data value requires multiple tuples (rows) of data to be updated.” (Stephens, 2009)


Adams, D., & Beckett, D. (1997). Normalization Is a Nice Theory. Foresight Technology Inc. Retrieved from

DAMA International (2009). The DAMA Guide to the Data Management Body of Knowledge (1st Edition).

Fotache, M. (2006, May 1) Why Normalization Failed to Become the Ultimate Guide for Database Designers? Available at SSRN: or

Hoberman, S. (2009). Data modeling made simple: a practical guide for business and it professionals, second edition. [Books24x7 version] Available from

IBM Archives (2003): Edgar F. Codd. Retrieved from

Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory. Communications of the ACM 26(2). Retrieved from

Malaika, S., & Nicola, M. (2011, December 15). Data normalization reconsidered, Part 1: The history of business records. Retrieved from

Microsoft Knowledge Base. Article ID: 283878. Description of the database normalization basics. Retrieved from

Oppel, A. (2011). Databases demystified, 2nd edition. [Books24x7 version] Available from

Stephens, Rod. (2009). Beginning database design solutions. [Books24x7 version] Available from

Teorey, T. & Lightstone, S. & Nadeau, T. & Jagadish, H.V.. ( © 2011). Database modeling and design: logical design, fifth edition. [Books24x7 version] Available from

Image courtesy of David Castillo Dominici at