A Brief Introduction to Fuzzy Matching

c
4 min readOct 29, 2020

Google’s search engine is so powerful because it provides us the ability to search for information without an exhaustive and accurate query.

I recently realized I took this for granted when I looked into how one would create an algorithm that took in the results of a text recognition program and called an API based on those results.

I found that I fell into the hole of premature optimization when I discovered that the APIs in use actually implemented fuzzy matching in its search queries. So for example, if I were looking for “A Study in Scarlet” by Arthur Conan Doyle in the Google Books API, I can query ‘stdy scrlet conan’ with typos and missing words and still receive an accurate result.

Indeed, I was thinking of Kermit the Frog when I tested this.

This idea of fuzzy matching, wherein one does not need a word-for-word query, utilizes approximate string matching to quantify the difference between words. The difference is measured with the notion of edit distance which varies depending on the context, but essentially counts the (minimum) number of edits required to change one string to another.

Levenshtein Distance

This “distance metric” is one of the simpler ones with only 3 edit operations.

Insertion: moo -> moonDeletion: seat -> seaSubstitution: fuzzy -> wuzzy

For example, going from dog to frog requires only 2 edits. dog -> drog -> frog. The lesser the edits, the more similar these words are.

Furthermore, the Levenshtein Distance is symmetric in the sense that the distance from string1 to string2 is the same as that of string2 to string1. Each operation has an opposite: any insertion can be deleted, any deletion can be inserted, and any substitution can be reversed. Thus, it makes sense to speak of a Levenshtein Distance of a pair of strings, rather than directionally.

To see the Levenshtein Distance in action, a diagram would be helpful.

A matrix for the Levenshtein Distance of substrings

Each entry in the matrix at row i and column j can be described as the Levenshtein Distance between the first i characters of Sunday and the first j characters of Saturday.

This diagram is helpful because it captures the idea of “flood filling”, which is the path from the upper left to the lower right that connects the least edit distances. The following code implements an iterative algorithm that computes the Levenshtein Distance by finding said minimum path.

On line 19, note that the cell’s value is calculated as a minimum of edit distances depending on the operation (hence, we see three possibilities).

editMatrix[i-1][j - 1]: This corresponds to a substitution of index i of string1 to match index j of string 2.

editMatrix[i-1][j]: This corresponds to an insertion of string2[j] at index i of string1.

editMatrix[i][j-1]: This corresponds to an insertion of string1[i] at index j of string2.

All in all, the algorithm follows this minimum path at each step to attain an overall minimum edit distance between the two strings.

There are several more “approximate string matching” algorithms with their respective distance metrics but these are all limited in scope to raw letter comparison. The Levenshtein Distance would not help when we compare two names written in different formats: John Smith vs. Smith, John.

To scale this to a fuzzy matching process that takes into account both language and its often inconsistent patterns, we can look towards TF-IDF.

Term Frequency — Inverse Document Frequency

Term Frequency: Say we have a collection of documents. Term frequency refers to how often a word appears among those documents. If a word appears more often, it is likely to be a strong indicator about its topic. For example, this link on orcas mentions orcas quite frequently at 27 instances (…obviously). But term frequency alone does not paint a complete picture. This is calculated as:

TF(w) = (# of times word w appears in document)/(# of words in document)

Inverse Document Frequency: In said collection of documents, if a word appears regularly in all the documents, then it is less likely to represent the topic of any particular document compared to a word that appears regularly in just one document. This is calculated as:

IDF(w) = ln(# of documents / # of documents with word w)

This is incredibly helpful in situations such as comparing books. Paul: A Biography, about the apostle Paul, should be more similar to The Apostle: A Life of Paul than to Franco: A Biography, about a Spanish dictator.

Links

https://devopedia.org/levenshtein-distance#:~:text=Algorithm%20has%20a%20time%20complexity,coined%20by%20Wagner%20and%20Fischer.

https://medium.com/tim-black/fuzzy-string-matching-at-scale-41ae6ac452c2

--

--