As a reader of my blog, no doubt you have wandered over to my YouTube Channel at some point over these many years. For that, I really want to thank you.
This is a thank you and appreciation post for everyone who has subscribed, shared a video with a colleague, or left an encouraging comment on my YouTube channel over these many years. It’s certainly been a marathon, but the channel has finally reached 25,000 subscribers and is approaching 4 million total views!
When I first started, even 250 subscribers seemed out of reach, as that milestone took over a year to achieve.
I originally started the channel as a means to sharpen my own data analytics skills; my videos weren’t intended for anyone other than myself to view for reference purposes. Yet the channel has evolved into a platform for teaching data concepts to viewers around the globe.
My workflow involves finding a public dataset, experimenting with a new SQL / Excel / PowerBI technique or reverse engineering a complex Tableau dashboard that intrigues me (<–definitely not easy), meticulously documenting each step, teaching those steps clearly, and then recording and editing the final product.
Conventional wisdom says the channel should niche down on one tool, but I have more fun learning and teaching the data tools most widely in use.
The channel at times requires a significant amount of my weekend time, and I’m grateful that so many of you appreciate this work.
You’ve helped transform what began as a personal learning project into something far more meaningful, and for that I thank you!
As always, get out there and do some great things with your data!
I had been toying around with writing SQL code to redraft one of the most talent rich drafts in NBA history (my apologies to the 1984 and 2003 NBA drafts) and sharing the results on my YouTube channel. Fortunately, I had a serendipitous coding session with my good buddy Terence down in Austin who introduced me to a cutting-edge AI tool (i.e., Windsurf AI). This initial session fundamentally changed how I think about analytics and AI-assisted coding.
If you’re an NBA, data and SQL enthusiast, then the 2018 NBA draft represents a fascinating case study for retrospective analysis.
We’re talking about a class that produced MVP-caliber talents like Luka Dončić (originally 3rd), Trae Young (5th), and Shai Gilgeous-Alexander (11th). Not to mention Ironman style players like Mikal Bridges (who has never missed an NBA game in his career thus far) and the unexpected emergence of undrafted gems like Duncan Robinson.
Seven years of career data later, we have enough statistical evidence to ask the crucial question: If NBA front offices knew then what we know now, how would this draft unfold?
What I loved about having Windsurf AI as my co-pilot (no relation to Microsoft) is that it handled the heavy lifting while I focused on the strategic analysis. The term “vibe-coding” has entered the lexicon recently, and it is used as a means to let AI instantly translate one’s analytical vision into executable T-SQL code.
When using any generative AI tool, the key is feeding detailed, comprehensive prompts to the models (the more context you give these models, the better your output becomes).
The components of my prompt included the following (you can watch the full video if you want to hear me recite the entire prompt).
My comprehensive prompt included:
-- 1. Role definition (NBA GM + Statistical Expert + T-SQL Specialist)
-- 2. Clear objectives (redraft based on career performance)
-- 3. Data specifications (column names, data types, constraints)
-- 4. Business rules (handle undrafted players as position 0)
-- 5. Output requirements (normalized scores, impact analysis)
I settled on a redraft approach based upon a Career Impact Scoring methodology:
Impact Score Formula:
Points: 1.0x multiplier
Rebounds: 1.2x (slightly more valuable)
Assists: 1.5x (playmaking premium)
Steals/Blocks: 3.0x (defensive impact recognition)
Field Goal %: 100x multiplier
Free Throw %: 50x multiplier
Three-Point %: 100x multiplier
Games Played Bonus: Durability factor
Turnover Penalty: Subtracted from total score (my apologies Trae Young)
Re-Draft Results:
Without spoiling the complete breakdown, I’ll share just enough to demonstrate the methodology’s power:
The most stunning result? Jalen Brunson emerged as the 5th overall pick in our AI-driven redraft, a player who originally went 33rd overall to Dallas.
If Brunson jumped from 33rd to 5th, imagine what happened to the other lottery picks. Some players who went in the top 10 dropped dramatically, while others who were considered “reaches” at the time proved to be incredible value picks.
Player Categorizations:
Although the AI initially suggested several player categories based upon their redrafted positions, I prompted the model to incorporate the specific logic and code I desired into the overall script (because I know how to write SQL).
Category Breakdowns:
Massive Undrafted Steals: Players who weren’t drafted but earned first-round redraft positions
Undrafted Steals: Players who weren’t drafted but earned a first or second-round redraft position
Huge Steals: Late picks who jumped into a top 10 pick
Steals: Later picks who jumped into the lottery
Busts: Top 5 picks who fell to at least the 21st spot
Major Busts: Top 10 picks who fell out of the first round entirely
Expected Value: Players who performed roughly as anticipate
The Broader Implications for Data Professionals
For me, Windsurf served as a force multiplier and not a replacement for my SQL knowledge. It readily served as an amplifier of skills that I already possess. I still had to understand basic database concepts and a little query optimization in order to craft effective prompts and validate the model’s outputs. The difference between mediocre and exceptional AI-generated code lies in prompt quality, so put in the upfront time and effort to craft an initial prompt.
The Future of Analytics: Collaboration, Not Competition
In my opinion, this project undertaking represented something larger than a fun NBA analysis, it’s a glimpse into the future of data analysis work. We’re entering an era where analysts spend less time wrestling with syntax and more time on strategic thinking, hypothesis formation, domain knowledge, storytelling and insight communication.
We’re in this sweet spot right now where these types of agentic AI tools are assisting data analyses as opposed to fully completing data analyses. The key (right now, before the singularity) is embracing this transition rather than resisting it so you can remain ahead on the knowledge curve. Start experimenting with AI-assisted coding tools now, while the learning curve is manageable and the competitive advantage is still emerging.
Your Next Steps:
If you’re intrigued by the possibilities of AI-enhanced analytics, I encourage you to:
Download Windsurf AI and experiment with your own datasets
Practice prompt engineering with progressively complex analytical requests
Combine your domain expertise with AI capabilities for unique insights
Share your experiments to bolster your reputation and build your data portfolio
The full video walkthrough demonstrates every step of this process, from initial prompt crafting through final GitHub deployment.
In the full video you can discover which players claimed the top 4 spots in our redraft.
As always, get out there and do some great things with your data!
The Wait is Over. Watch this Tableau Dashboard Container Layout Video!
This video was a long time coming. If you’ve ever dealt with layout containers in Tableau (which are about as intuitive as a Thomas Pynchon novel) you know they can leave you scratching your head if you’re new to the tool. And yes, I do own a copy of Gravity’s Rainbow.
Luckily for you I was up I was up at 2:30 a.m. on a Sunday recording the second part of my popular Tableau dashboard tutorial, because that’s how passionate (and stubborn) I am about providing you educational videos of value. Quality over Quantity!
Watch the preview video above, you’ll observe that I’m teaching you a step-by-step breakdown of how I laid out the visual components for an advanced Tableau dashboard using layout containers. This is part two of the series, and while you can absolutely follow along here, the full experience (with all the dashboard elements, charts, and finishing touches) is available to members of my YouTube channel.
I demonstrate a floating container technique that I learned from Curtis Harris’s excellent video on the subject (look up “Things I Know About Tableau Layout Containers” on YouTube).
Before I forget, I want to give a shout out to Dmitry Shirikov, whose original design I reverse engineered and rebuilt (with his permission), and Murilo Cremon for original overall inspiration.
Let’s get into what I actually did.
What You’ll Miss If You Don’t Watch the Full Video
There’s much that you won’t fully grasp until you see it in action in the video. Things like:
How I embedded a YouTube link in an image element
The way I sequence container stacking to prevent the annoying TILE factor
How to work around Tableau’s quirks when resizing and aligning objects
What I do when Tableau auto-adds objects I don’t want
And most importantly, how all the complex pieces come together to form a professionally polished dashboard
If you’ve been struggling with dashboard layouts, or you’ve watched my first Tableau dashboard build and you’ve been waiting forever for part 2, here’s what I recommend:
Rewatch Part 1 of this tutorial series (it’s free on my channel).
Becoming a member supports the channel and helps me keep making high-quality Tableau content. Tutorials like this take hours of research, recording, editing, and production. I don’t mind putting in the work because I know how much value it provides (thank you for all your wonderful comments, I read them all) but your support goes a long way.
Thanks for rocking with me all these years; now get out there and do some great things with your data!
-Anthony Smoak
I appreciate everyone who has supported this blog and my YouTube channel.
So I recently had the opportunity to speak about building your online presence during a webinar. I firmly believe that an online presence can open up doors for you, especially if you are looking to break into the data world. I consider myself lucky in that I started on my data journey many years ago before it was cool or impossibly difficult to get a starting opportunity.
If I was starting from scratch today, I would definitely use LinkedIN to raise my visibility and showcase my strengths to potential employers.
Step 1: You Need a LinkedIn Profile: Full Stop
At its core, LinkedIN is the digital equivalent of a modern résumé, but with a far broader reach. If you don’t have a profile, you’re invisible in the professional world. An optimized LinkedIN profile is a must, especially in a competitive field like data analytics. When recruiters or potential employers search for candidates with your skill set, your profile will be a key factor in whether or not they reach out to you.
Actionable Tip: Make sure your profile is complete, with a compelling headline, clear summary, and a detailed work history that includes relevant keywords (think “SQL queries,” “data modeling,” “performance tuning”) to increase discoverability. Don’t underestimate the power of a well-written “About” section; this is your opportunity to tell your professional story in a way that resonates with both humans and algorithms.
For example, if you look at my LinkedIn profile, you’ll see that my about section is full of verbiage describing my skills, education, online presence and current workplace.
Real-World Insight: I had a recruiter contact me for a previous job opportunity simply because my profile contained the right keywords and was well-organized. I can’t stress this point enough, the right visibility can lead to unexpected opportunities.
Step 2: Don’t Leave Your Profile Alone: Engage with Content
Once you’ve set up your profile, you need to stay active. It’s tempting to think that just having a polished profile is enough, but to really stand out, you must engage. Start by sharing informative articles, insights, and news relevant to your industry. Sharing curated content is valuable, but creating your own posts and/or commentary will increase your visibility.
Actionable Tip: Instead of just hitting “reshare” on a post you found useful, add your thoughts. Comment on what you found interesting about the article and how it connects to your work. Reshares without lead in commentary don’t get as much traction on LinkedIN.
Relatable Anecdote: The idea of putting yourself out there can feel daunting; especially if you’re more introverted. But think of it as sharing your knowledge and expertise with others, in a format that is helpful and informative. It doesn’t need to be an elaborate blog post; even a quick tip or a link with a few insightful lines can go a long way.
Step 3: Create Your Own Content: Yes, You Can Do It
I believe this is where you can begin to truly differentiate yourself. If you follow my LinkedIN profile, you’ll see multiple videos of me sharing knowledge on different data analytics tools. I guarantee that you will likely have knowledge that others in your field can benefit from. Remember that your experiences and expertise are unique and valuable.
We seem to take for granted that our personal knowledge is obvious or widely known by others, but believe me it is not. Even if someone is sharing about a similar topic, they can’t share information with your unique point of view!
Actionable Tip: Start small. Post a brief tip, or share an interesting challenge you faced during a recent project and how you solved it. This doesn’t have to be a huge production; just a couple of lines can spark engagement and show that you’re actively contributing to the field.
Overcoming the Introvert’s Challenge
For those of you who consider yourselves introverts, don’t worry, creating content doesn’t require you to be an extroverted social butterfly. In fact, many professionals, myself included, are introverted (yes it’s true). The key is to focus on sharing knowledge, not on putting on a performance. A great number of my YouTube videos don’t even show my face. I just have good screen capture software and a quality XLR microphone.
The Bottom Line: Don’t just have a LinkedIN profile; use it as a tool for professional growth. Start by optimizing your profile, sharing relevant content, and eventually creating your own posts to showcase your expertise.
Remember that visibility brings opportunity. Start small, stay consistent, and watch as your professional presence grows, benefiting both you and your employer in the process.
Until next time.
I appreciate everyone who has supported this blog and my YouTube channel.
I decided to break down some quick tips to help you write faster, more efficient SQL queries.
1. Limit Your Use of SELECT *
While SELECT * seems like a quick and easy solution, it’s a performance killer, especially when you’re dealing with large tables. By requesting all columns, the database has to load unnecessary data, which can slow down execution, especially as your dataset grows.
Tip: Be intentional about the columns you select. Only retrieve the data you need. This not only improves performance but also minimizes the load on your system.
2. Avoid DISTINCT and UNION When Unnecessary
Checking for uniqueness often comes at a cost. Both DISTINCT and UNION use sorting algorithms to remove duplicates, which is inherently time-consuming. If you don’t explicitly need unique records, avoid them.
Instead of UNION, use UNION ALL. While UNION will filter out duplicates, UNION ALL keeps all your records. It can perform your combining of records faster since it bypasses the sorting step.
3. Use Sargable Operators in the WHERE Clause
A sargable query (short for “search argument able”) is one that allows the SQL optimizer to take advantage of indexes, leading to faster data retrieval. When you use operators that are sargable (e.g., =, >, <=, BETWEEN), the query can utilize an index to find data quickly instead of performing a full table scan, which can be slow and resource-intensive.
For example, a query like WHERE customer_id = 12345 is sargable, whereas WHERE YEAR(order_date) = 2023 is not, because the database can’t use an index efficiently for that condition.
4. Use Indexes Appropriately
Indexes are extremely useful when it comes to speeding up query execution. Without them, the database is forced to scan the entire table to retrieve the data you need.
However, it’s essential to use indexes judiciously. Too many indexes can slow down insert and update operations.
Real-World Example: Sales Data Query Optimization
Imagine you’re a data analyst working with sales data, trying to identify the top-selling products for the last quarter. A poorly optimized query might look like this:
SELECT * FROM sales_data WHERE YEAR(sale_date) = 2023;
This query retrieves all columns (SELECT *), performs a non-sargable operation on sale_date, and might even require a DISTINCT if there are duplicates. All this can be painfully slow.
Instead, a more optimized query would look like this:
SELECT product_id, SUM(sales_amount)
FROM sales_data
WHERE sale_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY product_id
ORDER BY SUM(sales_amount) DESC;
In this version:
We select only the necessary columns (product_id, sales_amount).
We use a sargable BETWEEN operator on sale_date.
We aggregate data and order it efficiently to find the top products.
The Takeaway: Stay Sharp, Stay Efficient
Writing efficient SQL queries isn’t just about speed; it’s about making your data retrieval processes more scalable and reliable.
Be intentional about your query design. Use SELECT * sparingly, avoid unnecessary operations like DISTINCT and UNION, and always keep indexing and sargable operations in mind
Here is my hot take, it can be very difficult to write SQL.
Conventional data punditry suggests that SQL isn’t difficult to learn. Of course it’s easy to write SQL that returns data; however, it can be very difficult to write SQL that returns the CORRECT data.
Writing effective SQL requires more than the ability to write basic syntax. You need a sufficient understanding of the data itself, among other factors. And believe me, this is not always an easy endeavor for a single person!
I’ve been placed into situations where I did not have any understanding of the relationships between tables, or possess sufficient metadata/documentation to determine which columns or fields were relevant to the specific question I was tasked to answer. There were no data subject matter experts on hand to collaborate with because they left the company a few months prior.
I’ve literally been handed a SQL Server backup file containing hundreds of tables with no accompanying entity relationship diagram (ERD) or documentation, yet I was expected to single-handedly replicate an outcome under a tight deadline because I’m the “data guy”.
I was eventually able to do so, but not without relying on significant experience, lots of coffee and a little bit of luck.
Obviously SQL practitioners positioned on technical teams have to know more than the business with respect to the technical aspects of the data request.
But they also need to have the relevant business domain knowledge and/or the ability to translate and document the business logic from a domain expert and turn it into accurate working code.
Where Are the Requirements?
Additionally, over the years I’ve seen the practice of providing well thought out and formalized data requirements devolve into data request “vibes”.
By data request “vibes” I mean, submitting requests for data without providing formalized upfront documentation. These are vague ambiguous requests provided to the data professional that’s lacking specific criteria used to evaluate success. The data professional has to figure out what is needed and how best to accomplish it. This is in addition to coding the final solution.
In my opinion, communication and documentation skills are crucial. They enable the divination of accurate requirements from stakeholders. This ability is the “killer app” that separates commodity SQL coders from next level SQL coders. Commodity “order-taker” data resources are often outsourced.
Captain Save a Project
In addition, most organizations struggle with data quality, due to poor data governance. If the SQL analytics that you generate do not “feel” correct, guess who’s on the hook to figure out why?
You’re not only the SQL subject matter expert and the business requirements person, you’re also the data detective because a VIP wants a revised report and an explanation by the end of the day. However, you only signed up to be a consumer and messenger of this data source that you do not own.
Some SQL looks easy on the surface. However, don’t underestimate the effort it takes to understand table structures, data values, table relationships and data granularity. Additionally, domain knowledge is equally essential to write effective SQL.
Also be prepared to undertake requirements documentation, data quality sleuthing, coding, and of course, navigating the pressure of being the “reports person” who should know all things about the data.
Do You Have These Skills to Complete a SQL Analysis ?
In a manner, SQL is like Chess. It is easy to learn the basic moves. However, it is very difficult to master in order to play a worthwhile game. To really excel at SQL within an organization, you have to face challenging situations repeatedly. You must also manage a multitude of additional factors beyond pure technical syntax.
You can learn “SELECT * FROM TABLE_X” fairly quickly, but do you have the following complementary skills to deliver a SQL analysis?
Sufficient domain knowledge
Deciphering jargon and documenting requirements from stakeholders outside your area of expertise
Ability to work around potential data inconsistencies (e.g., the numbers from these different sources don’t match)
Savviness to navigate organizational data silos (e.g., groups that hoard their data and expertise for political reasons)
Managing up the chain with respect to expectations and tight timelines (we have a life outside of work)
Ability to simplify complex data findings
And last but not least, actual SQL coding proficiency.
These are the valuable skills that develop after years of work experience.
Conclusion
Yes, SQL can be easy to learn in a pressure free vacuum with perfect data that’s easy to comprehend .
Just remember that at some point in your data career you will be called upon to be “Captain Save a Project” and the basic SQL you learned in a vacuum will not be enough to get the job done. You’ll also need to call upon the skills I listed above.
In practice, generating effective SQL requires proper context, technical and communication skills, organizational savvy, and teamwork. The end results are not always as easy as your friendly SQL data pro makes it look.
Keep your queries sharp, and your data clean.
Until next time.
Anthony Smoak
I appreciate everyone who has supported this blog and my YouTube channel.
I have always recommended that if you develop SQL based analytics, you should have some understanding of indexes when it comes to speeding up queries.
Sometimes we do not have access to create indexes on the physical tables (especially in a production environment), but approaching your DBA with some basic knowledge of indexing will put you ahead of the game.
if your queries are running longer than they need to be, it’s time to speed them up with some strategic indexing. Very briefly, here’s how you can fix your slow running query issues via indexing your tables.
Step 1: Implement a Covering Index Strategy
A covering index includes all the columns referenced in your SELECT or WHERE clauses. This means the database engine can fulfill your query using just the index, without having to access the base table data. This act can drastically reduce query execution time and improve your data visualization processes if needed. The following is an example of TSQL code for MS SQL Server demonstrating a covering index using the INCLUDE keyword:
— Creates a nonclustered index on the Person.Address table with four included (non-key) columns. — index key column is PostalCode and the non-key columns are — AddressLine1, AddressLine2, City, and StateProvinceID.
CREATE NONCLUSTERED INDEX IX_Address_PostalCode ON Person.Address (PostalCode) INCLUDE (AddressLine1, AddressLine2, City, StateProvinceID);
Step 2: Utilize Clustered Indexes
Clustered indexes should typically be applied to your primary key column. A clustered index sorts and stores the data rows in the table based upon the key values. This organization is particularly efficient for queries that retrieve a range of values.
Step 3: Apply Non-Clustered Indexes
Non-clustered indexes are typically applied to foreign keys. These indexes create a separate structure that points back to the base data rows in the table. They are ideal for speeding up queries that involve joins, which are common in business intelligence operations.
Find the Sweet Spot
With indexing, balance is key. Under-indexing can cause your queries to run longer, while over-indexing can negatively impact data insertion and updates. Striking the right balance ensures your queries remain efficient without compromising the performance of other operations.
Indexing Rule #4080: If it’s linked by it, filtered by it, or sorted by it….index by it.
(I saw that rule on a reddit forum once, and thought it made perfect sense. The 4080 reference is for my ATCQ fans.)
If you want more detail on indexing, you can read this blog post here. I wrote it a few years ago, but it is old gold, still valuable!
Over the years I have conducted a number of interviews with candidates who claimed to know SQL, but could not clearly verbalize their understanding of some basic concepts. I want you to understand these following concepts at a minimum, so your next SQL interview isn’t cut short.
First, distinguish between a left outer join and an inner join. A left outer join retrieves all records from the left table and returns only the matched records from the right table. On the other hand, an inner join retrieves only the matched records from both tables.
Next, familiarize yourself with the WHERE clause versus the HAVING clause. Use the WHERE clause to filter records before any groupings are made, while the HAVING clause filters records after grouping. This understanding will enable you to filter your data correctly.
Also, understand the difference between the UNION and UNION ALL statements. UNION removes duplicate records, whereas UNION ALL includes all records, duplicates and all.
Be ready to discuss Common Table Expressions (CTEs) versus temporary tables for ease of query writing and to forgo complicated sub-queries. CTEs are best used for readability, while temp tables are physically stored in memory (allowing for the use of indexes if necessary to optimize performance).
Understanding clustered indexes versus non-clustered indexes is vital for query performance. Clustered indexes sort and store data rows in the table based upon key values, while non-clustered indexes create a separate structure with efficient pointers back to the base table.
Lastly, ensure your queries are sargable (Search ARGument ABLE). Sargable queries use indexes efficiently, improving performance, unlike non-sargable queries, which slow down data retrieval.
Since Tableau Public was recently upgraded to enable local saving (It’s about time), I decided to look and see what new dashboards were out there that I could learn from, and subsequently teach others to rebuild. My mantra is that you learn the most, by teaching others.
I did find the perfect dashboard built by Dmitry Shirikov on his Tableau Public page! There are also design elements from a dashboard created by Murilo Cremon so I tip my cap to him as well.
I reached out to Dmitry via Linkedin and asked for his permission to use this dashboard as a teaching tool for video lessons and he graciously gave me permission. This further demonstrates the collaborative spirit that thrives on Tableau Public. I offer a big thank you to Dmitry!
What I like about his dashboard is the use of the standard Tableau standard superstore dataset combined with interactive/dynamic year over years metrics and thoughtful design.
It contains deceptively simple KPIs, intermediate level metric swapping and a button selector process that relies on parameter actions. Parameter actions update the value of a parameter based on a user interacting with the marks in the view (in this case, the Sales, Profit or Orders dimensions).
I spent a great deal of after-work and weekend time researching how to put this dashboard back together and rebuilt it piece by piece so I could understand what was done.
It also took some time to record and edit this video as it is coming in at over an hour in length! I can’t believe I put this much time and effort into a free product!
My hope is that by dissecting the dashboard and teaching you how to build the chart elements, you can gain some valuable dashboard building insights to add to your repertoire.
If you think a part 2 video regarding the actual dashboard layout will be valuable, please leave a comment on my YouTube channel!
Let’s learn how to build a histogram in Excel with some interesting NBA data. Personally, I use the histogram in my data analyses to help me understand how my data is distributed and to identify any outliers or extreme values that warrant further investigation. In this blog post, I’ll use Kobe Bryant’s playoff career scoring game log as our data source to create a histogram in Excel. Sound fun? Of course, it does. Let’s go!
Inserting a Histogram Chart in Excel
I advocate that you watch the video above for more detail, but you’ll find this blog post equally informative. To create a histogram in Excel, we’ll need a column of numerical data to analyze. In this case, I have Kobe Bryant’s playoff career scoring game log, which shows how many points he scored in each of his playoff games during his career. I referenced this data from basketballreference.com.
The first step is to highlight the data column and press [Ctrl + Shift + Down] to select the entire range. Then, go to the Insert tab and click on the Histogram icons as shown below. This will insert a histogram chart based on your data.
Formatting the Histogram Chart in Excel
The default histogram chart may not look very attractive or informative. Thus, we will use my personal favorite technique to kick start the formatting (the “easy” way) which is the selection of “Layout 2” as a chart style.
Now this option may be easy (if you’re using Excel 2016 or greater) but you must know which additional options to change in order to make the histogram look more presentable.
The Layout 2 style does an excellent job of removing the horizontal grid lines and vertical axis (so we can keep the Edward Tufte style “chart junk” to a minimum). It also adds data labels to our bar charts as well for interpretation clarity.
Formatting the Histogram Chart in Excel
We can further improve our histogram’s appearance by applying some formatting options. For example, we can:
Delete the grid lines and the vertical axis that we don’t need (already performed by Layout 2)
Increase the font size of the data labels
Change the fill color of the bars
Add a chart title
To access the formatting options, right-click on any element of the chart and select “Format”. Alternatively, we can hit [Ctrl + 1] to open the Format pane.
Adjusting the Bin Size and the Overflow/Underflow Options in Excel
One of the most important aspects of any histogram is the bin size, which determines how the data is grouped into intervals. The bin size affects the shape and the interpretation of our histogram. We can adjust the bin size by selecting the horizontal axis and changing the “Bin width” option in the Format pane.
The default bin size for this data set is 5.7, which means that the data is grouped into intervals of 5.7 points. For example, the first bin includes the values from 0 to 5.7, the second bin includes the values from 5.7 to 11.4, and so on.
However, this bin size may not be very intuitive or meaningful. A better option for our histogram is to use a bin size of 5, which means that the data is grouped into intervals of 5 points. For example, the first bin includes the values from 0 to 5, the second bin includes the values from 6 to 10, and so on. This makes the histogram easier to read and understand.
Another option that we can adjust in the histogram chart is the overflow and the underflow bins. These are special bins that capture the values that are above or below a certain threshold.
For example, we may want to create an overflow bin that includes all the games where Kobe scored more than 40 points, and an underflow bin that includes all the games where he scored less than 10 points. To do this, we can select the horizontal axis and change the Overflow bin and the Underflow bin options in the Format pane.
After applying the overflow and the underflow options, the histogram chart looks like this:
Histogram Axis Notation
You’ll notice the histogram’s horizontal axis includes both brackets “[” and parentheses “(“. I will quote the Microsoft blog to explain this notation.
“In our design, we follow best practices for labeling the Histogram axis and adopt notation that is commonly used in math and statistics. For example, a parenthesis, ‘(‘ or ‘)’, connotes the value is excluded whereas a bracket, ‘[‘ or ‘]’, means the value is included. “
In our histogram for example, the notation (20, 25] indicates that the respective bar includes any value greater than 20 but less than or equal to 25.
Interpreting the Histogram Chart and Finding Outliers in Excel
Our new histogram isn’t just pretty, it’s equally informative allowing us to answer questions that we couldn’t easily determine from a wall of numbers in spreadsheet form. The histogram easily helps us understand Kobe’s playoff scoring distribution. We also gain an understanding of his outlier games. For example, we can now:
..locate the mode, which is the most frequent value or interval. In this case, the mode is the interval from 20 to 25, which means that Kobe Bryant scored between 20 and 25 points in most of his playoff games.
..find the range, which is the difference between the maximum and the minimum values. In this case, the range is 50, which means that Kobe Bryant’s playoff scoring varied from 0 to 50 points.
..find the skewness, which is the asymmetry of the distribution. In this case, the distribution is right-skewed, which means that the longer tail of values is on the right side of the distribution. This indicates that Kobe Bryant’s playoff scoring was more concentrated in the lower values. This makes perfect sense as it is much harder to score more points as opposed to lesser points.
..find the outliers, which are the values that are far away from the rest of the data. In this case, the outliers are the point score values that are greater than 40.
I hope you enjoyed this blog post and learned how to create a histogram in Excel to analyze data distribution and outliers!!!
Let me leave you with this highlight package of Kobe’s best dunks: