Address Matching in Excel Using Levenshtein Distance

For my Data Analysts, in this video I will demonstrate how to perform a column comparison between two address fields so you don’t have to manually review every row. We’ll use a VBA function from Stack Overflow to provide the comparison results.

I should point out that Excel is NOT the preferred method for address matching, but sometimes it is your only option due to lack of time or better tools. Ideally, you should use address correction software that “fixes spelling errors, corrects abbreviations, and standardizes capitalization so each address in your list complies with the USPS official format” – (per the USPS). Once your addresses are standardized, THEN you should perform a comparison, but this rarely happens.

What typically happens is that some poor analyst like you is conscripted into performing address matching manually using some combination of SQL Server and manual Excel processes. That’s why a Google search led you to this page!

If you ever have to perform address matching in Excel, this could be you!

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

For example:

  • The string “HAT” as compared to “hat” would have a Levenshtein Distance of 3
    • Since the function is case sensitive all three characters are different
  • The string “HAT” as compared to “BAT” would have a Levenshtein Distance of 1
    • To turn the first string into the second string it would take 1 substitution of characters (H changed to B or vice-versa)
  • The lower the number, the more the strings are similar
  • The higher the number, the more the strings are dissimilar.

Activate the Developer Tab in Excel

The Developer tab is the place to go when you want to do or use the following:

  • Write macros
  • Run macros that you previously recorded
  • Create VBA Modules and User Defined Functions <– This is our sweet spot
  1. On the File tab, go to Options > Customize Ribbon
  2. Under Customize the Ribbon and under Main Tabs, select the Developer check box

Create a Module in Excel

  1. On the Developer tab select Visual Basic
  2. In the VBA interface select Insert > Module

Insert Levenshtein Distance Function VBA Code

  1. Go to this link at Stack Overflow to view the code as originally referenced
  2. Or, simply copy the code below as developed by user “smirkingman” which is the first answer.
    • Big shoutout to “smirkingman” for this great resource!
Option Explicit
Public Function Levenshtein(s1 As String, s2 As String)

Dim i As Integer
Dim j As Integer
Dim l1 As Integer
Dim l2 As Integer
Dim d() As Integer
Dim min1 As Integer
Dim min2 As Integer

l1 = Len(s1)
l2 = Len(s2)
ReDim d(l1, l2)
For i = 0 To l1
    d(i, 0) = i
Next
For j = 0 To l2
    d(0, j) = j
Next
For i = 1 To l1
    For j = 1 To l2
        If Mid(s1, i, 1) = Mid(s2, j, 1) Then
            d(i, j) = d(i - 1, j - 1)
        Else
            min1 = d(i - 1, j) + 1
            min2 = d(i, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            min2 = d(i - 1, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            d(i, j) = min1
        End If
    Next
Next
Levenshtein = d(l1, l2)
End Function
  1. Paste the code into your newly created Excel module
  2. Debug > Compile VBAProject

You should not experience any errors after compiling the code.

Watch the Video to Use the Function

Using this function in a judicious manner can help you cut down on the mental energy required to manually review the address columns on each row. It is much better to mentally focus on 25% of the rows than 100%. The fewer rows you have to manually review in Excel, the less the chance of you making an error.


Please like and subscribe on the Anthony B. Smoak YouTube channel.

All views and opinions are solely my own and do not necessarily reflect those my employer

Do Great Things with Your Data!



★☆★ Support this Channel: ★☆★

Merch ► shop.spreadshirt.com/AnthonySmoak

★☆★ FOLLOW ME BELOW: ★☆★

This image has an empty alt attribute; its file name is anthony-smoak-twitter.jpg

Twitter ► https://twitter.com/AnthonySmoak

Facebook ► https://www.facebook.com/AnthonyBSmoak/

Tableau Public ►Search for “Anthony B. Smoak”

Photo by Oladimeji Ajegbile from Pexels

How to Change the Text Delimiter in Excel

No more COMMA DRAMA! In this video I will show you how to change the default comma delimiter in Excel when making .csv (comma separated value) files. Sometimes text fields in your file have comments that you need to preserve in a flat file. This means you have to change the delimiter. Watch the video for a demonstration and keep reading for a quick reference.

I should note that as of the time of this post I am using Windows 10 Home.

Step 1

Search for “Set Regional Format”

Step 2

Select Additional date, time & regional settings

Step 3

Select Region

Step 4

Select Additional Settings

Then in the List Separator box, change the comma to a pipe.

When you save your .csv file in Excel, it will now be pipe delimited instead of comma delimited.

Your flat file will transition from this:

to this

Do Great Things with Your Data

Anthony B. Smoak

Please like and subscribe on the Anthony B. Smoak YouTube channel!
Definitely pick up some merch if you’ve enjoyed this blog and YouTube channel over the years.

All views and opinions are solely my own and do not necessarily reflect those my employer.

Add Total Values for Stacked Column and Stacked Bar Charts in Excel

This is the only video you need to learn how to add add totals to stacked bar charts and stacked column charts in Excel. Make sure to watch the video because I have you covered both ways.

I’m more of a video explanation person, so make sure to watch the video so the steps are clearer. Then refer to the bulleted steps as reference once you have the initial basic understanding.

The Keys to Adding Totals to the Stacked Column Chart Above:

  • Add a “Grand Total” column to your data
  • Highlight your data not including the “Grand Total” column
  • On the “Insert” menu select a “2-D Stacked Column” chart
  • Select “Switch Row/Column” as necessary so your first data column is listed on the X axis
  • Select the chart and then expand the data range to include the “Grand Total” column
    • Add “Data Labels” to the Grand Total series on the chart
  • Right click on the Grand Total and “Change Series Chart Type”, the viz becomes a “Combo” chart
  • Change the “Grand Total” series chart type to a “Line”, while leaving all others as a “Stacked Column”
  • Format the line’s data label, changing the Label Position to “Above”
  • Select the line, format the data series and change the “Line” option to “No Line” in order to leave only the totals and hide the line.

The Keys to Adding Totals to the Stacked Bar Chart Above:

  • Add both “Grand Total” and “Spacing” columns to your data
    • Highlight your data including the “Spacing” column but not including the “Grand Total” column
      • The “Spacing” column should have a value of 0 at this point
    • On the “Insert” menu select a “2-D Stacked Bar Chart”
    • Select “Switch Row/Column” as necessary so the “Spacing” values are not listed as an option on the Y axis
    • Change the “Spacing” column values to a number (e.g., 1000) big enough to make a new category visible on the stacked bar chart
    • Right click to “Format Data Labels” and change the “Label Options” to “Value from Cells”
      • In the “Select Data Label Range” pop up box, highlight the values from the “Grand Total” column
      • Change the “Label Position” to “Inside Base”
    • On the chart select the Grand Total Series and right click so the “Fill” option appears
      • Change the “Fill” to “No Fill”
    • Change the values in the “Spacing” column to 0
    • Delete the “Grand Total” entry from the Legend

You can thank me by liking and subscribing to my YouTube Channel

All views and opinions are solely my own and do NOT necessarily reflect those my employer.

Do Great Things with Your Data

– Anthony B. Smoak

shop.spreadshirt.com/AnthonySmoak

INCREASE YOUR FOCUS TRY BRAIN.FM

When I am focusing hard during the day at work or developing visualizations on the weekend. I use Brain.fm to help me focus when it matters. It’s Science-driven and research-backed functional music designed from the ground up to help you you focus, relax, meditate and sleep. If you’d like to try a free trial check out https://brain.fm/anthonyb

Please use coupon code anthonyb for a 20% discount upon checkout. It helps support this blog and my YouTube channel as I receive a small commission for purchases made through links in this post.

Select Random Sample Values and Rows using Excel

In this video I help you solve the dual problems of selecting a random value from an Excel list and selecting a number of random rows from a range of data in Excel. At times when I am generating a data-set to use in my video tutorials, I want to select a random selection of rows. Typically, because my data values are clumped together and are too similar to the data on preceding and subsequent rows.

SELECTING A RANDOM VALUE FROM A LIST

Enter three Excel formulas to save the day for selecting a random value from a list:

  • ROWS()
    • Returns the number of rows in a reference or array.
  • RANDBETWEEN()
    • Returns a random integer number between the numbers you specify. A new random integer number is returned every time the worksheet is calculated.
  • INDEX()
    • The INDEX function returns a value or the reference to a value from within a table or range.

Randow Row in Excel Blog Screenshot

  1. In the screenshot above notice that the ROWS() function returns the value of 20, which corresponds to the number of names listed in the cell range of A6 to A25.
  2. The RANDBETWEEN() function generated a random number between 1 and the value returned from ROWS() (i.e., 20). In this case, RANDBETWEEN() combined with ROWS() returned a value of 3.
  3. By combining the results from the first 2 functions, the INDEX() function searches our list and returns the value of the 3rd cell in the list  (i.e., Flor McCard) because the RANDBETWEEN() function returned a value of 3.

When we put it all together it looks like the following:

=INDEX($A$6:$A$25,RANDBETWEEN(1,ROWS($A$6:$A$25)))

I choose to use the absolute cell reference notation with dollar signs although in this case it is not necessary since we are not copying our results to other cells.

SELECTING RANDOM ROWS FROM A LIST

We’ll only use 1 Excel formula to save the day for selecting random rows from a range:

  • RAND()
    • RAND() returns an evenly distributed random real number greater than or equal to 0 and less than 1. A new random real number is returned every time the worksheet is calculated.

By placing the RAND() function in a column co-located with your data, you will assign a random number to each row in your data-set or range.

Once that is done, all you have to do is sort your data by the RAND() column and then select however many rows you need. It’s that simple!

If you are like me, you probably need to see it in action to get a better understanding. Check out the video above and if you learned something, please go ahead and like it on my Youtube channel!

Thanks for your support!

All views and opinions are solely my own and do NOT necessarily reflect those my employer.

References:

All Excel function definitions are from https://support.office.com