February 24, 2019 by anthonysmoak

Free SQL Data Profiling Tool: Idera SQL Data Profiler

In this tip I will provide an overview of a completely free data profiling tool (at least as of the time of this post) that is easy to use if you are in the Microsoft database stack. You can download the IDERA SQL Profiling tool and immediately put it to work to perform basic column profiling and analyses. The tool will display a summary of the data contained in a selected table and each of its columns.

Use this tool on the following systems:

Microsoft SQL Server: 2008 R2, 2012, 2014, 2016; 2017 Windows & Linux (provisional); Express, Standard, Enterprise editions
Microsoft Azure SQL Database
Amazon Relational Database Service (RDS)

IDERA SQL Data Profiler has some minor quirks but you can’t beat the price. Check out my review in the video above.

Just remember that data profiling should always be done initially before you start analyzing a new dataset or designing a new visualization. Always start with the basics.

If you find this type of instruction valuable make sure to subscribe to my Youtube channel.

December 29, 2017December 29, 2017 by anthonysmoak

Data Profiling with Experian Pandora

Experian Pandora is a robust data profiling and data quality tool that enables users to quickly obtain useful statistics related to their data. Once you load your data into the interface, you can identify statistical anomalies and outliers in the data within seconds. To gain these types of insights, normally I have to write SQL scripts or use the SSIS Data Profiling Tool against SQL Server data. Experian Pandora is much easier to use against data in .csv files or Excel spreadsheets since you can simply drag and drop those items into the interface.

A lack of data profiling leads to poor data quality which leads to inaccurate information and poor business performance. I believe you will find this tool a worthy add to your data toolbox.

Download the Free Data Profiler: https://www.edq.com/experian-pandora/data-profiling/

If you’re interested in Business Intelligence & Tableau subscribe and check out all my videos either here on this site or on my Youtube channel.

Anthony B. Smoak

July 16, 2017 by anthonysmoak

Easily Unpivot Your Data in Excel Using Power Query

Use the unpivot functionality in Power Query (a free Excel add-in) to easily turn your cross-tabbed data into a more normalized structure. The normalized data structure will grant you the flexibility to create additional analyses in a more efficient manner.

Perform Fuzzy Lookups / Approximate String Matching in Excel

Video
Data Quality, Video
Data Profiling, Data Quality, Excel, Fuzzy Lookup, Video
1 Comment

Most likely you have a love/hate attitude towards spreadsheets. This highly useful tip will make you fall back in love with Excel. Microsoft does a great job at providing a solid add-in that makes fuzzy lookups relatively easy to perform.

Are there more powerful approximate string matching tools out there? Of course. But if you’re using Excel, this tool should be used before applying more complicated methods.

Everyone loves visualizations but sometimes you have to roll up your sleeves and clean up the data.

Enjoy!

Download: Fuzzy Lookup Add-In for Excel

If you’re interested in Business Intelligence & Tableau please subscribe and check out my videos either here on this site or on my Youtube channel.

April 8, 2017January 12, 2019 by anthonysmoak

Get Out of the Spreadsheet Abyss

When an organization turns a blind eye to the proliferation of spreadsheet based processes, it subjects itself to substantial risks. Have you ever encountered (or enabled) the following scenario?

Enterprise reporting is lacking thus a “power-user” (i.e. analyst) is conscripted into cobbling together an ad-hoc spreadsheet based process to address the management request;
The power user exports data from various business applications and then manipulates the data output typically with macros and formulas;
This initial spreadsheet is manually integrated with business unit data from various other unofficial spreadsheets, further distancing the data from the source business application;
Multiple tabs are added, charts are generated and data is pivoted (all manually);
Management finds value in the report and elevates it to a repeatable process;
The original request increases in complexity over time as it requires more manipulations, calculations and rogue data sources to meet management needs;
Management doubles down with the need for a new request and the process is repeated;
IT proper is NEVER consulted on any of the requests;

The business unit is now supporting a “spreadmart”. The term is considered derogatory in data circles.

“A spreadmart (spreadsheet data mart) is a business data analysis system running on spreadsheets or other desktop databases that is created and maintained by individuals or groups to perform the tasks normally done by a data mart or data warehouse. Typically a spreadmart is created by individuals at different times using different data sources and rules for defining metrics in an organization, creating a fractured view of the enterprise.” [1]

Although the initial intentions of these requests may be reasonable, the business never bothers to approach IT to propose building out a proper data store. Additionally, the conscripted analysts are unhappy with their additional manual responsibilities. Spreadsheet wrangling and manual integration activities shift precious time away from more value-added pursuits such as data analysis and formulating recommendations.

From management’s perspective, why should they pay IT to build out an officially sanctioned solution that will deliver the same information that an internal team of analysts can provide? After all, the spreadmart is responsive (changes can be made quickly) and it’s inexpensive (as opposed to new investments in IT). Eventually, the manual processes are baked into the job description and new hires are enlisted to expand and maintain this system. The business sinks deeper and deeper into the spreadsheet abyss.

The short term rewards of the spreadmart are generally not worth the longer term risks.

Risks:

“It’s not an enterprise tool. The error rates in spreadsheets are huge. Excel will dutifully average the wrong data right down the line. There’s no protection around that.” [2]

The spreadmart can also be bracketed as a “data shadow” system to borrow a term from The Business Intelligence Guidebook, authored by Rick Sherman. Here are the problems associated with “data shadow” systems as paraphrased from The Business Intelligence Guidebook [3]:

Productivity is severely diminished as analysts spend their time creating and maintaining an assortment of manual data processes;
- I would add that team morale suffers as well;
Business units have daggers drawn as they try to reconcile and validate whose numbers are “more correct”;
- As a result of a new silo, the organization has compounded its data governance issues;
Data errors can (and will) occur as a result of manual querying, integrating and calculating;
Data sources can change without notice and the data shadow process is not on IT’s radar for source change notifications;
Embedded business logic becomes stagnant in various complex macros or code modules because they are hidden or simply not understood by inheritors;
The solution doesn’t scale with increasing data volume or number of users;
Audit trail to ensure control and compliance does not exist;
- “It is often ironic that a finance group can pass an audit because the IT processes it uses are auditable, but the data shadow systems that they use to make decisions are not, and are ignored in an internal audit”;
Process and technical documentation does not exist which impacts the ability to update the solution;

Additionally, these processes are not backed up with any regularity, multiple versions may exist on multiple users’ desktops and anyone can make changes to the embedded business logic. The bottom line is that the business is potentially making decisions based upon erroneous data which can have serious financial and reputational impacts.

“F1F9 estimated that 88 percent of all spreadsheets have errors in them, while 50 percent of spreadsheets used by large companies have material defects. The company said the mistakes are not just costly in terms of time and money – but also lead to damaged reputations, lost jobs and disrupted careers.” [4]

Mitigation:

There is nothing wrong with the business responding to an emerging issue by requesting a one-time ad-hoc solution. The highest risks emerge when the ad-hoc process is systematized and a number of repeatable ad-hoc processes proliferate unchecked; and IT is never involved in any discussions.

IT proper is highly effective when it is allowed to merge, integrate and validate data. Business unit analysts and spreadsheets should be out of the collection and integration game for repeatable management reporting. Analysts should focus on analysis, trending and interpretation. Too often analysts get tossed into productivity traps involving hours of cutting, pasting and linking to someone else’s spreadsheet for data integration in order to meet management demands.

When IT is brought into the discussion, they must not point fingers but rather understand why the shadow system was established in the first place. Likewise, the business unit should not point fingers at IT for being unresponsive or limited by budget constraints. Once the peace treaty has been established, IT should analyze and reverse-engineer the cobbled together integration processes and data sources (which admittedly is a time consuming event) and deliver more controlled and scalable processes.

The new data integration processes should culminate in loading data to a business specific, validated, central data mart. The central mart doesn’t try to impose an unfamiliar tool upon the business but rather automates integration activities and references more “trustworthy” data sources. Spreadsheets can still be used by analysts to access the data but the analysts are not expected to be manual aggregators using a sub-standard ETL tool.

“Go with the flow. Some users will never give up their spreadsheets, regardless of how robust an analytic environment you provide. Let them keep their spreadsheets, but configure them as front ends to the data warehouse. This way, they can use their spreadsheets to access corporate-approved data, metrics and reports. If they insist on creating new reports, provide an incentive for them to upload their reports to the data warehouse instead of distributing them via e-mail.” [5]

Have I ever had to “get out” of a situation where data governance was lacking and burned-out, morale depleted analysts spent all of their time collecting and integrating spreadsheets to maintain an inefficient spreadmart?

I’ll never tell!

References:

[1] https://en.wikipedia.org/wiki/Spreadmart

[2] http://ww2.cfo.com/analytics/2012/01/imagine-theres-no-excel/

[3] Sherman, R. (2015). Business intelligence guidebook: from data integration to analytics.

[4] http://www.cnbc.com/id/100923538

[5] https://www.information-management.com/news/the-rise-and-fall-of-spreadmarts

November 25, 2016July 3, 2017 by anthonysmoak

B.I. Basics: Create an SSIS Data Profiling Task In SQL Server

Data Profiling is necessary when trying to gain an understanding of a given data set. A data profiling assessment should begin before any reporting or application development work begins. My video will demonstrate how to create a basic SSIS Data Profiling Task using SQL Server Data Tools.

According to the DAMA Guide to the Data Management Body of Knowledge:

“Before making any improvements to data, one must be able to distinguish between good and bad data…. A data analyst may not necessarily be able to pinpoint all instances of flawed data. However, the ability to document situations where data values look like they do not belong provides a means to communicate these instances with subject matter experts, whose business knowledge can confirm the existences of data problems.”

Here is additional information direct from Bill Gates’s former startup outfit regarding the types of data profiling tasks available in SSIS: https://msdn.microsoft.com/en-us/library/bb895263.aspx

If you’re interested in Business Intelligence & Tableau please subscribe and check out my videos either here on this site or on my Youtube channel.

October 24, 2016April 9, 2017 by anthonysmoak

The Data Quality Face-Off: Who Owns Data Quality, the Business or IT?

This article is also posted on LinkedIn.

Data is the lifeblood of organizations. By now you’ve probably heard the comparative slogans “data is the new oil” or “data is the new water” or “data is the new currency”. A quick type of “data is the new” into the Google search bar and the first result delivered is “data is the new bacon”. I’m not sure how apt this slogan is except to say that both can be highly fulfilling.

With the exception of a well-known Seattle based retailer, most enterprises experience substantial data quality issues as data quality work is typically an exercise in “pass the buck”. An Information Week article shrewdly commented on the risks associated with the lack of data quality:

“There are two core risks: making decisions based on ‘information fantasy,’ and compliance. If you’re not representing the real world, you can be fined and your CFO can be imprisoned. It all comes down to that one point: If your systems don’t represent the real world, then how can you make accurate decisions?” [3]

The Handbook of Data Quality: Research and Practice has reported that 60% of enterprises suffer from data quality issues, 70% of manufacturing orders contained poor quality data and that poor data quality management costs companies roughly $1.4 billion every year [4].

Organizational Hamilton/Burr style face-offs occur in which IT and the business are at loggerheads over the state of data quality and ownership. The business typically believes that since data is co-mingled with the systems that IT already manages, IT should own any data issues. With the high costs of poor data quality I just cited, and the risks of nimble disrupters utilizing data more efficiently to attack incumbents’ market share, both IT and the business need to be on the same team with regard to data quality for the organization’s sake.

“The relationship between IT and the business is a source of tension in many organizations, especially in relation to data management. This tension often manifests itself in the definition of data quality, as well as the question of who is responsible for data quality.” [5]

Anecdotally, IT units do not have the desire to be held responsible for “mysterious” data and/or systems that they had no hand in standing up. In my opinion, the enterprise IT mindset is to make sure the data arrives into the consolidated Enterprise Data Warehouse or other centralized data repository; and if downstream users don’t raise concerns about data quality issues, all the better for IT. Garbage-In, Garbage-Out. If the checksums or record counts from source to target match, then it’s time to call it a day.

The developer or analyst related mindset is to immediately dive in and start building applications or reports with the potential to deliver sub-optimal results because the data was misunderstood or misinterpreted as the “golden copy”. Up-front data profiling isn’t in the equation.

Gartner has suggested that the rise of the Chief Data Officer (particularly in banking, government and insurance industries) has been beneficial towards helping both IT and the business with managing data [2]. The strategic usage of a CDO has the potential to free up the CIO and the enterprise IT organization so they can carry on with managing infrastructure and maintaining systems.

However, most experts will agree that the business needs to define what constitutes high-quality acceptable data and that the business should “own the data”. However, IT is typically the “owner” of the systems that house such data. Thus, a mutually beneficial organizational relationship would involve IT having a better understanding of data content so as to ensure a higher level of data quality [5].

From a working together perspective, I find this matrix from Allen & Cervo (2015) helpful in depicting the risks arising from one sided data profiling activities without business context and vice versa. It illustrates how both “business understanding and data profiling are necessary to minimize any risk of incorrect assumptions about the data’s fitness for use or about how the data serves the business” [1]. Although originally offered in a Master Data Management context, I find the example fitting in illustrating how business and IT expertise should work together.

From: Allen, M. & Cervo, D (2015) Multi-Domain Master Data Management: Advanced MDM and Data Governance in Practice. Morgan Kaufmann Publishers. Chapter 8 – Data Integration.

From the bottom left quadrant, low business knowledge and inadequate data profiling activities leaves the enterprise in a less than optimal position. This is not the quadrant an organization needs to languish within.
The top left quadrant illustrates that business context is high but “knowledge” is unsubstantiated because of the lack of understanding of data quality via profiling exercises. Cue quality guru W. Edwards Deming stating, “Without data, you’re just a person with an opinion.”
The bottom right quadrant illustrates the opposite problem where data profiling without business knowledge doesn’t yield enough context for meaningful analyses. Cue a “bizarro” W. Edwards Deming stating, “Without an opinion, you’re just a person with data.”
The “Goldilocks” quadrant in the upper right yields appropriate understanding of data quality and the necessary context in which to conduct meaningful analyses.

The more engaged that the business and IT are in finding common ground with respect to understanding existing data and solving data quality issues, the better positioned the organization is to avoid regulatory fines, departmental strife, threats from upstart firms and overall bad decision making. Refuse to become complacent in the data space and give data the attention it deserves, your organization’s survival just may depend upon it.

References:

[1] Allen, M. & Cervo, D (2015) Multi-Domain Master Data Management: Advanced MDM and Data Governance in Practice. Morgan Kaufmann Publishers. Chapter 8 – Data Integration.

[2] Gartner, (January 30, 2014). By 2015, 25 Percent of Large Global Organizations Will Have Appointed Chief Data Officers
http://www.gartner.com/newsroom/id/2659215

[3] Morgan, Lisa. (October 14, 2015). Information Week, 8 Ways To Ensure Data Quality. Information Week.
http://www.informationweek.com/big-data/big-data-analytics/8-ways-to-ensure-data-quality/d/d-id/1322239

[4] Sadiq, Shazia (Ed.) (2013). Handbook of Data Quality: Research and Practice: Cost and Value Management for Data Quality.

[5] Sebastian-Coleman, L. (2013). Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework. Morgan Kaufmann Publishers. Chapter 2 – Data, People, and Systems.

Smoak Signals | Data Analytics Blog

Sharing Knowledge: Data, Visualization & Business

Category / Data Quality