10 Items to Know When Starting A New Data Project

Are you a data professional looking to start a new data project?

Then you need to review my 10-point checklist to make sure you’re on the right track. Starting a new data project can be overwhelming. But don’t worry, with my 20 years of experience, I’m here to guide you through it.

Typically when I start to perform a new data related task or analysis for a project, I have to make sure that I meet the expected objectives, which often include identifying patterns, trends, and insights that can be used to drive business decisions.

10 Point Checklist

Point 1: First things first, you need to understand the nature of the deliverable that’s being asked for. Is it a new report, database table, column, data visualization, calculation, or a change to any of the above? In a similar fashion you also need to understand the technologies in play that you have to work with. This could be anything from Tableau, Power BI, SQL Server, Oracle, Teradata or even Microsoft Access (yes, people still use this tool).

Point 2: It’s crucial to know the desired delivery time frame for your project. You don’t want to end up with a longer timeline than what the project manager or client had in mind. Communication is key in this situation.

Point 3: Who is the intended audience for this deliverable? If it’s for an executive audience, you may need to roll the numbers up and take out some detail. If it’s for an analyst or operational audience, you may want to leave in more detail.

Point 4: How much historical data is required? What is the anticipated volume of data that your deliverable is going to generate? Don’t get caught in a situation where your solution can’t handle the trending analyses for a 2 year time frame when you only pulled data for the last 6 months.

Point 5: Understand the volume of data that your solution will generate. For example, a 5 million row output is not conducive to a 100% Excel approach. You will definitely be in the land of database analyses. However you may later present the data at an aggregated level (see point 3) via Excel but hopefully using a real data visualization tool .

Point 6: You need to understand if there’s any Personally Identifiable Information (PII) or sensitive data that you need to access in order to carry out the request. This could include social security numbers, passport numbers, driver’s license numbers, or credit card numbers.

Point 7: It’s important to understand the business processes behind the request. As data people, we tend to focus only on the data piece of the puzzle, but understanding more about the relevant business process can help you deliver the better results for your end users.

Point 8: Try to find and understand any relevant KPIs associated with the business processes on which your data project/task is affecting.

Point 9: Perform data profiling on your datasets! This can’t be stated enough. Profiling leads to understanding data quality issues and can help lead you to the source of the issues so they can be stopped.

Here are a few data profiling videos I’ve created over the years to give you a sense of data profiling in action.

Point 10: Understand how your solution will impact existing business process. By changing a column or calculation, how does this impact upstream or downstream processes? Keep your email inbox clean of those headache emails that are going to ask why the data looks different than it did last week. Most likely there was not a clear communication strategy to inform everyone of the impact of your changes.

Bonus Considerations:

Here are a few bonus considerations since you had the good fortune of reading this blog post and not just stopping at the video.

Bonus Point 1: Consider any external factors that could impact your data project. For example, changes in regulations can impact the data that you collect, analyze, and use. If the government imposes stricter regulations on data privacy (see point 6 above), you may need to change your data sources or analysis methods to comply with these regulations.

Bonus Point 2: Consider internal organizational politics when starting on a project. If you work in a toxic or siloed organization (it happens), access to data can be a challenge. For example, if the marketing department controls customer data, accessing that data for a sales analysis project may be challenging due to internal strife and/or unnecessary burdensome roadblocks.

Internal politics can also lead to potential conflicts of interest, such as when stakeholders have different goals or agendas. For example, if your data analyses could impact a department’s budget, that department may have an incentive to influence your work outcome to their advantage (or try to discredit you or your work by any means necessary).

Bonus Point 3: Finally, make sure to document everything. This includes the business requirements, technical requirements, saved emails, and any changes that were made along the way.

When I started my first office position as an intern at a well known Fortune 500 company, my mentor told me the first rule of corporate life was to C.Y.A. I’m sure you know what that means to cover. Having solid documentation of your work and an email trail for decisions made along the way can keep you out of hot water.

Conclusion

And there you have it, my 10-point checklist for starting a new data project. By following these steps, you’ll be well on your way to delivering high-quality results. Don’t forget to like and subscribe for more data-related content!

I appreciate everyone who has supported this blog and my YouTube channel via merch. Please check out the logo shop here.

Stay in contact with me through my various social media presences.

Keep doing great things with your data!

Anthony B. Smoak

Free SQL Data Profiling Tool: Idera SQL Data Profiler

In this tip I will provide an overview of a completely free data profiling tool (at least as of the time of this post) that is easy to use if you are in the Microsoft database stack. You can download the IDERA SQL Profiling tool and immediately put it to work to perform basic column profiling and analyses. The tool will display a summary of the data contained in a selected table and each of its columns. 

Use this tool on the following systems:

  • Microsoft SQL Server: 2008 R2, 2012, 2014, 2016; 2017 Windows & Linux (provisional); Express, Standard, Enterprise editions
  • Microsoft Azure SQL Database
  • Amazon Relational Database Service (RDS)

IDERA SQL Data Profiler has some minor quirks but you can’t beat the price. Check out my review in the video above.

Just remember that data profiling should always be done initially before you start analyzing a new dataset or designing a new visualization. Always start with the basics.

If you find this type of instruction valuable make sure to subscribe to my Youtube channel.

Data Profiling with Experian Pandora

 

Experian Pandora is a robust data profiling and data quality tool that enables users to quickly obtain useful statistics related to their data. Once you load your data into the interface, you can identify statistical anomalies and outliers in the data within seconds. To gain these types of insights, normally I have to write SQL scripts or use the SSIS Data Profiling Tool against SQL Server data. Experian Pandora is much easier to use against data in .csv files or Excel spreadsheets since you can simply drag and drop those items into the interface.

A lack of data profiling leads to poor data quality which leads to inaccurate information and poor business performance. I believe you will find this tool a worthy add to your data toolbox.

Download the Free Data Profiler: https://www.edq.com/experian-pandora/data-profiling/

If you’re interested in Business Intelligence & Tableau subscribe and check out all my videos either here on this site or on my Youtube channel.

Anthony B. Smoak

Perform Fuzzy Lookups / Approximate String Matching in Excel

Most likely you have a love/hate attitude towards spreadsheets. This highly useful tip will make you fall back in love with Excel. Microsoft does a great job at providing a solid add-in that makes fuzzy lookups relatively easy to perform.

Are there more powerful approximate string matching tools out there? Of course. But if you’re using Excel, this tool should be used before applying more complicated methods.

Everyone loves visualizations but sometimes you have to roll up your sleeves and clean up the data.

Enjoy!

See also: Get Out of the Spreadsheet Abyss

Download: Fuzzy Lookup Add-In for Excel

If you’re interested in Business Intelligence & Tableau please subscribe and check out my videos either here on this site or on my Youtube channel.

 

B.I. Basics: Create an SSIS Data Profiling Task In SQL Server

Data Profiling is necessary when trying to gain an understanding of a given data set. A data profiling assessment should begin before any reporting or application development work begins. My video will demonstrate how to create a basic SSIS Data Profiling Task using SQL Server Data Tools.

According to the DAMA Guide to the Data Management Body of Knowledge:

“Before making any improvements to data, one must be able to distinguish between good and bad data…. A data analyst may not necessarily be able to pinpoint all instances of flawed data. However, the ability to document situations where data values look like they do not belong provides a means to communicate these instances with subject matter experts, whose business knowledge can confirm the existences of data problems.”

Here is additional information direct from Bill Gates’s former startup outfit regarding the types of data profiling tasks available in SSIS: https://msdn.microsoft.com/en-us/library/bb895263.aspx

If you’re interested in Business Intelligence & Tableau please subscribe and check out my videos either here on this site or on my Youtube channel.