Define It: Probabilistic (Fuzzy) Matching

Define It: Probabilistic (Fuzzy) Matching

You may have never heard of the term before, but you benefit from Probabilistic Matching in nearly every aspect of your daily life.

No longer apples & oranges – matching data today requires a sophisticated and flexible solution.

What if we were to tell you that the binary, exacting world of data has a flexible, reasoning side? It’s called Probabilistic Matching, and it is one of the more common ways to force computed programs to think a bit more like humans.

Let’s talk matching. Assume you have two different pieces of data. The first is a list of your customers and their locations. The second is a list of tradeshow registrants. Naturally, you want to see which of your customers will be attending the show. To do this, you just need to match your customers to the show attendees. How would you do this?

You could manually search each record in your system, but there are thousands of names and it could take weeks. You could also use Excel to match records using VLookup, which looks for matches in a secondary data set. But you quickly find out there’s a huge problem with this: programs only match identical records. This means “Dr. James R. Smith DDS” won’t match “Dr. James Smith DDS” because it lacks the middle initial. In the formula’s eyes, it’s a different record.

At issue here is the absolute reasoning of a computer doesn’t play well with the human thought process of “close enough.” Like telling a child it’s time for bed because it’s 8 o’clock only to watch them melt down because it’s 7:58, not – in fact – 8:00. The computer has absolute standards but we don’t.

Enter Probabilistic Matching – also called “Fuzzy” Matching. The idea here is to take data from different sources and break it down into components, compare, and measure the similarity of each. This process takes place behind the scenes in many applications. It’s the reason you can Google search with typos and still get the right response. Google has programmed it to examine not just your input, but your meaning as well.

The most basic fuzzy matching method calculates the “distance” between the two values. In our earlier example,  there would be a very small distance between Dr. James R. Smith DDS and Dr. James Smith DDS, as 15 of the 16 characters are the same and only one character would need to change to be perfectly matched. In essence the program can see this and say (in a human accent): “That’s close enough, it’s probably the same doctor”.

But in the real world, that is too restrictive. Values can be deceiving in their length, such as “Super Smiles” vs. “Super Smiles of San Bernardino”. Surely they are the same customer, but the distance between them is massive thanks to the extra words. To counter, fuzzy matching algorithms will test many different variations of the two values and compare specific parts together, looking for the best match.  Another challenge is text order; comparing “Super Smiles of San Bernardino” vs “San Bernardino – Super Smiles” would be problematic for standard distance calculations because while containing the same characters, the order needs to be changed dramatically for the records to match. To solve this, we use a process called “tokenization” to separate the words and order them alphabetically before comparing. This, along with other data prep work normalizes the text and makes matching more reliable.

But that’s just the start: To really get into the realm of Probabilistic matching, you need to blend match assessments in other fields too. So if the customer name match is mild, but the address match is strong, it provides more probability that the two records are the same than if you just compared one of those fields.  You can also compare records in the same geographic region to further limit false positives. This is clear in the differences in Siri’s response when you say “Staples” vs “Staples near me.”

When you start adding the different layers of matching techniques, it becomes a bit like Mad Science. Each engineering firm may have their go-to algorithms and techniques, and it’s important to get one that is right for your application. At Elevation Data Group, for example, we have tuned our algorithms to data commonly found in medical and dental data sets. This allows us to account for common data irregularities such as group abbreviations, doctor suffixes, and generic practice names that occur when data comes from many different sources like customers, distributors, websites, paid lists, etc. This niche configuration produces both higher match rates and more appropriate matches.

When you have a program that works well, you can match millions of records together and fully integrate every angle of your business, customer base, or supply chain. With data flowing in from everywhere these days, a good probabilistic matching program is as important as the data itself.  

The Define It! series provides short explanations for common data practices, written for general business understanding and edited for IT-lish.

More from the Define  It! Series:

Elevation Data Group is a managed data services and consulting firm helping businesses put their data to good use.