Database

B.I. Basics: Create an SSIS Data Profiling Task In SQL Server

Data Profiling is necessary when trying to gain an understanding of a given data set. A data profiling assessment should begin before any reporting or application development work begins. My video will demonstrate how to create a basic SSIS Data Profiling Task using SQL Server Data Tools.

According to the DAMA Guide to the Data Management Body of Knowledge:

“Before making any improvements to data, one must be able to distinguish between good and bad data…. A data analyst may not necessarily be able to pinpoint all instances of flawed data. However, the ability to document situations where data values look like they do not belong provides a means to communicate these instances with subject matter experts, whose business knowledge can confirm the existences of data problems.”

Here is additional information direct from Bill Gates’s former startup outfit regarding the types of data profiling tasks available in SSIS: https://msdn.microsoft.com/en-us/library/bb895263.aspx

If you’re interested in Business Intelligence & Tableau please subscribe and check out my videos either here on this site or on my Youtube channel.

The Need For Speed: Improve SQL Query Performance with Indexing

This article is also published on LinkedIn.

How many times have you executed a SQL query against a million plus row table and then engaged in a protracted waiting game for your results? Unfortunately, a poor database table indexing strategy can counteract the gains of the best hardware and server architectures. The positive impact that strategically applied indexes can provide to query performance should not be ignored just because one isn’t wearing a DBA hat. “You can obtain the greatest improvement in database application performance by looking first at the area of data access, including logical/physical database design, query design, and index design” (Fritchey, 2014). Understanding the basics of index application should not be eschewed and treated as an esoteric art best left to DBAs.

Make use of the Covering Index

It is important that regularly used, resource intensive queries be subjected to “covering indexes”. The aim of a covering index is to “cover” the query by including all of the fields that are referenced in WHERE or SELECT statements. Babbar, Bjeletich, Mackman, Meier and Vasireddy (2004) state, “The index ‘covers’ the query, and can completely service the query without going to the base data. This is in effect a materialized view of the query. The covering index performs well because the data is in one place and in the required order.” The benefit of a properly constructed covering index is clear; the RDBMS can find all the data columns it needs in the index without the need to refer back to the base table which drastically improves performance. Kriegel (2011) asserts, “Not all indices are created equal — If the column for which you’ve created an index is not part of your search criteria, the index will be useless at best and detrimental at worst.”

Apply a Clustered Index

More often than not, a table should have a clustered index applied so as to avoid expensive table scans by the query optimizer. It is advisable to create one clustered index per table preferably on the PRIMARY KEY column. In theory, since the primary key is the unique identifier for a row, query writers will employ the primary key in order to aid with record search performance.

“When no clustered index is present to establish a storage order for the data, the storage engine will simply read through the entire table to find what it needs. A table without a clustered index is called a heap table. A heap is just an unordered stack of data with a row identifier as a pointer to the storage location. This data is not ordered or searchable except by walking through the data, row by row, in a process called a scan” (Fritchey, 2014).

However, the caveat to applying clustered indexes on a transactional table is that the index must be reordered after every INSERT or UPDATE to the key which can add substantial overhead to those processes. Dimensional or static tables which are only accessed for join purposes are optimal for this indexing strategy.

Apply a Non-Clustered Index

Another consideration in regard to SQL performance tuning is to apply non-clustered indexes on foreign keys within frequently accessed tables. Babbar et al. (2004) advise, “Be sure to create an index on any foreign key. Because foreign keys are used in joins, foreign keys almost always benefit from having an index.”

Indexing is an Art not a Science

Always remember that indexing is considered an art and not a science. Diverse real world scenarios often call for different indexing strategies. In some instances, indexing a table may not be required. If a table is small (on a per data page basis), then a full table scan will be more efficient than processing an index and then subsequently accessing the base table to locate the rest of the row data.

Conclusion

One of the biggest detriments to SQL query performance is an insufficient indexing strategy. On one hand, under-indexing can potentially cause queries to run longer than necessary due to the costly nature of table scans against unordered heaps. This scenario must be counterbalanced by the tendency to over-index, which will negatively impact insert and update performance.

When possible, SQL practitioners and DBAs should collaborate to understand query performance as a whole; especially in a production environment. DBAs left to their own devices have the potential to create indexes without any knowledge of the queries that will utilize those indexes. This uncoordinated approach has the potential to render indexes inefficient on arrival. Conversely, it is equally important that SQL practitioners have a basic understanding of indexing as well. Placing “SELECT *” in every SQL query will negate the effectiveness of covering indexes and add additional processing overhead as compared to specifically listing the subset of fields desired.

Even if you do not have administrative access to the tables that constitute your queries, approaching your DBA with a basic understanding of indexing strategies will lead to a more effective conversation.

References

Babbar, A., Bjeletich, S., Mackman, A., Meier, J., & Vasireddy, S. (May, 2004). Improving .NET Application Performance and Scalability. Retrieved from https://msdn.microsoft.com/en-us/library/ff647793.aspx

Fritchey, Grant. ( © 2014). Sql server query performance tuning (4th ed.).

Kriegel, Alex. ( © 2011). Discovering sql: a hands-on guide for beginners.

SQL: Think in Sets not Rows

This article is also posted on LinkedIn.

Structured Query Language, better known as SQL, is regarded as the working language of relational database management systems (RDBMS). As was the case with the relational model and the concepts of normalization, the language developed as result of IBM research in the nineteen seventies.

Left to their own devices, the early RDBMSs (sic) implemented a number of languages, including SEQUEL, developed by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s while working at IBM; and QUEL, the original language of Ingres. Eventually these efforts converged into a workable SQL, the Structured Query Language” (Kriegel, 2001).

For information professionals and database practitioners, SQL is regarded as a foundational skill that enables raw data to be manipulated within a RDBMS. “This is a declarative type of language. It instructs the database about what you want to do, and leaves details of implementation (how to do it) to the RDBMS itself” (Kriegel, 2001).

Before the advent of commercially accessible databases, data was typically stored in a proprietary file format manner. Each vendor had detailed specific access mechanisms, which could not be easily configured and customized for access by alternate applications. As databases began to adopt the relational model, the arrival and eventual standardization of SQL by ANSI (American National Standards Institute) and ISO (International Standards Institute) helped foster access, manipulation and retrieval consistency across many products.

Think in Sets not Rows!

SQL provides users the ability to query and manipulate data within the RDBMS without having to solely rely on a graphical user interface. There are powerful extensions in the many variant structured query languages (e.g. T-SQL, DB2, PL/SQL, etc.) that provide functionality above and beyond ISO and ANSI standards. However, SQL practitioners must first and foremost remember that SQL is a SET BASED construct. The most efficient SQL code regards table data as a whole and refrains from manipulating individual row elements one at a time unless absolutely necessary.

“Thinking in sets, or more precisely, in relational terms, is probably the most important best practice when writing T-SQL code. Many people start coding in T-SQL after having some background in procedural programming. Often, at least at the early stages of coding in this new environment, you don’t really think in relational terms, but rather in procedural terms. That’s because it’s easier to think of a new language as an extension to what you already know as opposed to thinking of it as a different thing, which requires adopting the correct mindset” (Ben-Gan, 2012).

Working with a relational language based upon the relational data model demands a set based mindset. Iterative cursor based processing, if used, should be used sparingly.

“By preferring a cursor-based (row-at-a-time) result set—or as Jeff Moden has so aptly termed it, Row By Agonizing Row (RBAR; pronounced ‘ree-bar’)—instead of a regular set-based SQL query, you add a large amount of overhead to SQL Server” (Fritchey, 2014).

If all other set based options have been exhausted and a row-by-row cursor must be employed, then make sure to use an “efficient” (relatively speaking) cursor type. The fast-forward only cursor type provides some performance advantages with respect to other cursor types in a SQL server environment. Fast forward cursors are read only and they only move forward within a data set (i.e. they do not support multi-direction iteration). Furthermore, according to Microsoft Technet (2015), fast forward only cursors automatically close when they reach the end of the data. The application driver does not have to send a close request to the server, which saves a roundtrip across the network.

References:

Ben-Gan, I.  (Apr, 2012). T-SQL Foundations: Thinking in Sets. Why this line of thought is important when addressing querying tasks. Retrieved from http://sqlmag.com/t-sql/t-sql-foundations-thinking-sets

Fritchey, Grant. ( © 2014). Sql server query performance tuning (4th ed.).

Kriegel, Alex. ( © 2011). Discovering sql: a hands-on guide for beginners.

Microsoft Technet. Fast Forward-Only Cursors (ODBC). Retrieved April 23, 2015, from https://technet.microsoft.com/en-us/library/aa177106(v=sql.80).aspx

Normalization: A Database Best Practice

The practice of normalization is widely regarded as the standard methodology for logically organizing data to reduce anomalies in database management systems. Normalization involves deconstructing information into various sub-parts that are linked together in a logical way. Malaika and Nicola (2011) state, “..data normalization represents business records in computers by deconstructing the record into many parts, sometimes hundreds of parts, and reconstructing them again as necessary. Artificial keys and associated indexes are required to link the parts of a single record together.“ Although there are successively stringent forms of normalization, best practice involves decomposing information into the 3rd normal form (3NF). Subsequent higher normal forms provide protection from anomalies that most practitioners will rarely ever encounter.

Background

The normalization methodology was the brainchild of mathematician and IBM researcher Dr. Edgar Frank Codd. Dr. Codd developed the technique while working at IBM’s San Jose Research Laboratory in 1970 (IBM Archives, 2003). Early databases employed either inflexible hierarchical designs or a collection of pointers to data on magnetic tapes. “While such databases could be efficient in handling the specific data and queries they were designed for, they were absolutely inflexible. New types of queries required complex reprogramming, and adding new types of data forced a total redesign of the database itself.” (IBM Archives, 2003). In addition, disk space in the early days of computing was limited and highly expensive. Dr. Codd’s seminal paper “A Relational Model of Data for Large Shared Data Banks” proposed a flexible structure of rows and columns that would help reduce the amount of disk space necessary to store information. Furthermore, this revolutionary new methodology provided the benefit of significantly reducing data anomalies. These aforementioned benefits are achieved by ensuring that data is stored on disk exactly once.

Normal Forms (1NF to 3NF): 

Normalization is widely regarded as the best practice when developing a coherent flexible database structure. Adams & Beckett (1997) state that designing a normalized database structure should be the first step taken when building a database that is meant to last. There are seven different forms of normalization; each lower form is a subset of the next higher form. Thus a database in 2nd normal form (2NF) is also in 1st normal form (1NF), although with additional satisfying conditions. Normalization best practice holds that databases in 3rd normal form (3NF) should suffice for the widest range of solutions. Adams & Beckett (1997) called 3NF “adequate for most practical needs.” When Dr. Codd initially proposed the concept of normalization, 3NF was the highest form introduced (Oppel, 2011).

A database table has achieved 1NF if does not contain any repeating groups and its attributes cannot be decomposed into smaller portions (atomicity). Most importantly, all of the data must relate to a primary key that uniquely indentifies a respective row. “When you have more than one field storing the same kind of information in a single table, you have a repeating group.” (Adams & Beckett, 1997). A higher level of normalization is often needed for tables in 1NF. Tables in 1NF are often subjected to “data duplication, update performance degradation, and update integrity problems..” (Teorey, Lightstone, Nadeau & Jagadish, 2011).

A database table has achieved 2NF if it meets the conditions of 1NF and if all of the non-key fields depend on ALL of the key fields (Stephens, 2009). It is important to note that tables with only 1 primary key that satisfy 1NF conditions are automatically in 2NF. In essence, 2NF helps data modelers determine if 2 tables have been combined into one table.

A database has achieved 3NF if it meets the conditions of 2NF and it contains no transitive dependencies. “A transitive dependency is when one non‐key field’s value depends on another non‐key field’s value” (Stephens, 2009). If any of the fields in the database table are dependent on any other fields, then the dependent field should be placed into another table.

If for example field B is functionally dependent on field A, (e.g. A->B), then add field A and B to a new table, with field A designated as a key which will provide linkage to the original table.

In short, 2NF and 3NF help determine the relationship between key and non-key attributes. Williams (1983) states, “Under second and third normal forms, a non-key field must provide a fact about the key, just the whole key, and nothing but the key.” A variant of this definition is typically supplemented with the remark “so help me Codd”.

Benefits:

Adams & Beckett (1997) assert that the normalization method provides benefits such as database efficiency & flexibility, the avoidance of redundant fields, and an easier to maintain database structure. Hoberman, (2009) adds that normalization provides a modeler with a stronger understanding of the business. The normalization process ensures that many questions are asked regarding data elements so these elements may be assigned to entities correctly. Hoberman also agrees that data quality is improved as redundancy is reduced.

Assessment

Although engaging in normalization is considered best practice, many sources advocate that normalization to 3NF is sufficient for the majority of data modeling solutions. Third normal form is deemed sufficient because the anomalies covered in higher forms occur with much less frequency. Beckett & Adams (1997) describe 3NF as “adequate for most practical needs.” Stephens (2009) affirms that many designers view 3NF as the form that combines adequate protection from recurrent data anomalies with relative modeling ease. Levels of normalization beyond 3NF can yield data models that are overly engineered, overly complicated and hard to maintain. The risk inherent in higher form constructs is that the performance can degrade to a level that is worse than less normalized designs. Hoberman (2009) asserts that, “Even though there are higher levels of normalization than 3NF, many interpret the term ‘normalized’ to mean 3NF.”

There are examples in data modeling literature where strict adherence to normalization is not advised. Fotache (2006) posited that normalization was initially a highly rigorous, theoretical methodology that was of not much practical use to real world development. Fotache provides the example of an attribute named ADDRESS, which is typically stored by many companies as an atomic string per 1NF requirements. The ADDRESS data could be stored in one field (violating 1NF) if the data is only needed for better person identification or mailing purposes. Teorey, Lightstone, Nadeau, & Jagadish (2011) advise that denormalization should be considered when performance considerations are in play. Denormalization introduces a trade off of increased update cost versus lower read cost depending upon the levels of data redundancy. Date (1990) downplays strict adherence to normalization and sets a minimum requirement of 1NF. “Normalization theory is a useful aid in the process, but it is not a panacea; anyone designing a database is certainly advised to be familiar with the basic techniques of normalization…but we do not mean to suggest that the design should necessarily be based on normalization principles alone” (Date, 1990).

Conclusion

Normalization is the best practice when designing a flexible and efficient database structure. The first three normal forms can be remembered by recalling a simple mnemonic. All attributes should depend upon a key (1NF), the whole key (2NF) and nothing but the key (3NF).

The advantages of normalization are many. Normalization ensures that modelers have a strong understanding of the business, it greatly reduces data redundancies and it improves data quality. When there is less data to store on disk, updating and inserting becomes a faster process. In addition, insert, delete and update anomalies disappear when adhering to normalization techniques. “The mantra of the skilled database designer is, for each attribute, capture it once, store it once, and use that one copy everywhere” (Stephens 2009).

It is important to remember that normalization to 3NF is sufficient for the majority of data modeling solutions. Higher levels of normalization can overcomplicate a database design and have the potential to provide worse performance.

In conclusion, begin the database design process by using normalization techniques. For implementation purposes, normalize data to 3NF compliance and then consider if data retrieval performance reasons necessitate denormalizing to a lower form. Denormalization introduces a trade off of increased update cost versus lower read cost depending upon the levels of data redundancy.

Glossary:

Delete Anomaly: “A delete anomaly is a situation where a deletion of data about one particular entity causes unintended loss of data that characterizes another entity.” (Stephens, 2009)

Denormalization: Denormalization involves reversing the process of normalization to gain faster read performance.

Insert Anomaly: “An insert anomaly is a situation where you cannot insert a new tuple into a relation because of an artificial dependency on another relation.” (Stephens, 2009)

Normalization: “Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.” (Microsoft Knowledge Base)

Primary Key: “Even though an entity may contain more than one candidate key, we can only select one candidate key to be the primary key for an entity. A primary key is a candidate key that has been chosen to be the unique identifier for an entity.” (Hoberman, 2009)

Update Anomaly: “An update anomaly is a situation where an update of a single data value requires multiple tuples (rows) of data to be updated.” (Stephens, 2009)

Bibliography

Adams, D., & Beckett, D. (1997). Normalization Is a Nice Theory. Foresight Technology Inc. Retrieved from http://www.4dcompanion.com/downloads/papers/normalization.pdf

Fotache, M. (2006, May 1) Why Normalization Failed to Become the Ultimate Guide for Database Designers? Available at SSRN: http://ssrn.com/abstract=905060 or http://dx.doi.org/10.2139/ssrn.905060

Hoberman, S. (2009). Data modeling made simple: a practical guide for business and it professionals, second edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=34408.

IBM Archives (2003): Edgar F. Codd. Retrieved from http://www-03.ibm.com/ibm/history/exhibits/builders/builders_codd.html

Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory. Communications of the ACM 26(2). Retrieved from http://www.bkent.net/Doc/simple5.htm

Malaika, S., & Nicola, M. (2011, December 15). Data normalization reconsidered, Part 1: The history of business records. Retrieved from http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/

Microsoft Knowledge Base. Article ID: 283878. Description of the database normalization basics. Retrieved from http://support.microsoft.com/kb/283878

Oppel, A. (2011). Databases demystified, 2nd edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=72521.

Stephens, Rod. (2009). Beginning database design solutions. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=29584.

Teorey, T. & Lightstone, S. & Nadeau, T. & Jagadish, H.V.. ( © 2011). Database modeling and design: logical design, fifth edition. [Books24x7 version] Available from http://common.books24x7.com.libezproxy2.syr.edu/toc.aspx?bookid=41847.

Image courtesy of David Castillo Dominici at FreeDigitalPhotos.net