Using the Levenshtein distance as a naming pairing tool
We commonly find different data sources with a shared character variable as a key (for example, city names). Often, these character columns do not match due to typos. In this post I will use the Levenshtein distance as a tool to pair strings from two different data sources.
This post was driven by a demand I received last week when I had two datasets and a simple task: “Merge it!”. When I opened these datasets, I saw that the key column were people’s names.
I am going to use a fake dataset that I created containing similar problems that I observed in my task. For Levenshtein’s distance, I used the stringdist package and stringr for cleanup, but of course you can do it your own way.
library(dplyr)
library(stringr)
library(stringdist)
library(DT)
options_table <- list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#38a7bb', 'color': '#fff'});",
"}"),
dom = 'Bt',
scrollX = TRUE,
pageLength = 10
)
Creating a fake dataset
To exemplify some issues I found in my dataset, I created a fake dataset with some Brazilian names. Some of them will have typos with a probability of 0.5. The names will be divided into two datasets (to mimic my problem) where in the first dataset I have a dictionary with the correct names and a variable (I also added the incorrect name just for easy comparison after pairing) and in the second dataset I have another variable and names that are sometimes incorrect. The data can be seen above:
set.seed(123456)
first_name <- data.frame(a = c("Augusto", "Bárbara", "Douglas", "Guilherme",
"Juliana", "Juliane", "Larissa", "Lívia", "Lucas", "Luís", "Luís",
"Luisa", "Luisa"),
b = c("August", "Barbara", "Dougla", "Guilherme ",
"Julianna", "Julliane", "Larisa", "Livia", "Lucas ", "Luis", "Luíz",
"Luísa", "Luiza"),
stringsAsFactors = FALSE)
ml_name <- data.frame(a = c("Castro", "Da Cunha", "Da Silva", "Dos Santos", "Farias", "Ferreira",
"Gomes", "Rodrigues", "Santos", "Silva", "Souza"),
b = c("Castr", "Cunha", "DaSilva", "Do Santos", "Faria", "Fereira",
"Gomez", "Rodriguez", "Santos ", "Siva", "Sousa"),
stringsAsFactors = FALSE)
create_names <- function(first_names, family_names, n_names = 50, prob_typo = 0.5, prob_third = 0.5){
##-- Sampling names ----
first <- sample(x = 1:nrow(first_names), size = n_names, replace = TRUE)
second <- sample(x = 1:nrow(family_names), size = n_names, replace = TRUE)
third_name <- rbinom(n = n_names, size = 1, prob = prob_third)
third <- ifelse(third_name, sample(x = 1:nrow(family_names), size = n_names, replace = TRUE), NA)
n_third <- sum(!is.na(third))
##-- Names on column a ----
names_a <- ifelse(!is.na(third),
paste(first_names$a[first], family_names$a[second], family_names$a[third]),
paste(first_names$a[first], family_names$a[second]))
##-- Names on column b ----
binom1 <- rbinom(n = n_names, size = 1, prob = prob_typo)
binom2 <- rbinom(n = n_names, size = 1, prob = prob_typo)
binom3 <- rbinom(n = n_names, size = 1, prob = prob_typo)
names_b <- ifelse(!is.na(third),
paste(ifelse(binom1, first_names$a[first], first_names$b[first]),
ifelse(binom2, family_names$a[second], family_names$b[second]),
ifelse(binom3, family_names$a[third], family_names$b[third])),
paste(ifelse(binom1, first_names$a[first], first_names$b[first]),
ifelse(binom2, family_names$a[second], family_names$b[second])))
##-- Returning a dataset with some names ----
res <- data.frame(names_a = names_a, names_b = names_b)
return(res)
}
##-- Sampling names ----
n_names <- 50
df_names <- create_names(first_names = first_name, family_names = ml_name, n_names = n_names) %>%
arrange(names_a) %>%
distinct(names_a, .keep_all = TRUE) ## To avoid homonymous names
##-- Spliting it in two data.frames ----
df1 <- data.frame(name_correct = df_names$names_a, name_incorrect = df_names$names_b, var_a = rnorm(nrow(df_names)))
names_b <- sort(sample(df_names$names_b, size = round(0.8*nrow(df_names)), replace = FALSE))
df2 <- data.frame(name = names_b, var_b = rnorm(length(names_b)))
Now df1 and df2 are two datasets with different variables and our aim is to merge them using the name column (name_correct for df1 and name for df2).
datatable(head(df1), options = options_table, rownames = FALSE) %>%
formatRound(columns = "var_a", digits = 4)
datatable(head(df2), options = options_table, rownames = FALSE) %>%
formatRound(columns = "var_b", digits = 4)
Levenshtein distance
Levenshtein distance (or edit distance) measures the distance between two strings. It corresponds to the minimum number of operations in a string (insertions, deletions, or replacements) required to transform it into another desired string.
The Levenshtein distance between two strings \(\displaystyle a,b\) (of length \(\displaystyle |a|\) and \(\displaystyle |b|\) respectively) is given by \(\displaystyle \operatorname{lev}_{a,b}(|a|,|b|)\) where
\[\begin{equation} \qquad \operatorname {lev} _{a,b}(i,j) = { \begin{cases} \max(i,j) & {\text{ if }} \min(i,j)=0,\\ \min { \begin{cases} \operatorname {lev} _{a,b}(i-1,j)+1\\ \operatorname {lev} _{a,b}(i,j-1)+1\\ \operatorname {lev} _{a,b}(i-1,j-1)+1_{(a_{i}\neq b_{j})} \end{cases} } & {\text{ otherwise.}} \end{cases} } \label{eq:lev}\tag{1} \end{equation}\]
where \(\displaystyle 1_{(a_{i}\neq b_{j})}\) is the indicator function equal to 0 when \(\displaystyle a_{i}=b_{j}\) and equal to 1 otherwise, and \(\displaystyle \operatorname {lev} _{a,b}(i,j)\) is the distance between the first \(\displaystyle i\) characters of \(\displaystyle a\) and the first \(\displaystyle j\) characters of \(\displaystyle b\) (Wikipedia).
This is the math behind the algorithm but lets understand the idea. I found this amazing video that helped me to understand.
Lets define three operations:
Delete: To turn “trance” into “trace” we need to delete the letter n.
Insert: To turn “exit” into “exist” we need to insert the letter s after the letter i.
Replace: To turn “confirmation” into “conformation” we need to replace the first letter i on o.
The weird formulation in (\(\ref{eq:lev}\)) is searching for the minimum number of opertaions to turn one string into another. This is a recursion formulation strarting from the comparison of the entire string and going down by comparison of substrings. I am gonna define the substring notation/operator that is going to be important soon. Lets think about the word string which has lenght 6.
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
S | T | R | I | N | G |
So the substring from position \(-1\) to position \(0\) is the empty string “” (the empty string is also important). The substring starting and ending in \(0\) is the fisrt letter S and so on:
STRING[-1, 0] = “”
STRING[0, 0] = “S”
STRING[0, 1] = “ST”
STRING[0, 2] = “STR”
STRING[0, 3] = “STRI”
STRING[0, 4] = “STRIN”
STRING[0, 5] = “STRING”
This notation is important for comparison between strings and is closely related with the formulation given in (\(\ref{eq:lev}\)). Actually each piece in (\(\ref{eq:lev}\)) is related with one operation. Lets understand!
Suppose that I want to turn the string “CARS” into “PAIRS”. The letter “S” at the end is the same in both words which means that I don’t need to make an operation in this case. Then the Levenshtein distance between “CARS” and “PAIRS” is exactly the same distance between “CAR” and “PAIR”. Using our substring notation we have:
CARS[0, 4] -> PAIRS[0, 5] = CARS[0, 3] -> PAIRS[0, 4]
Now we have the string “CAR” and the string “PAIR” and again we have a match, so:
CARS[0, 3] -> PAIRS[0, 4] = CARS[0, 2] -> PAIRS[0, 3]
Now we have the string “CA” and the string “PAI” and this is a mismatch. This way we should choose the operation that give us the minimum number of operations until the end. We can delete “A”, we can insert “I” or we can replace “A” by “I”. According to each decision another problem is gonna appear and another decision is going to be taken. Lets do the optimal path using the words “CARS” and “PAIRS” and the cummulative number of operations (there is a lot of possible paths).
- CARS[0, 3] -> PAIRS[0, 4]
- (CARS, PAIRS) 0 operations
- CARS[0, 2] -> PAIRS[0, 3]
- (CAR, PAIR) 0 operations
- CARS[0, 1] -> PAIRS[0, 2]
- (CA, PAI) 0 operations
- Insert “I” on “CA”
- (CAI, PAI) 1 operation
- CAIRS[0, 1] -> PAIRS[0, 1]
- (CA, PA) 1 operation
- CAIRS[0, 0] -> PAIRS[0, 0]
- (C, P) 1 operation
- Replace “C” for “P”
- (P, P) 2 operations
Of course it is not an easy way to calculate the distance because we need to identify all possible paths manually. A nice way to visualize the problem is by using local distance as in the follow example:
““ | P | A | I | R | S | |
---|---|---|---|---|---|---|
““ | \(c_{-1, -1}\) | \(c_{-1, 0}\) | \(c_{-1, 1}\) | \(c_{-1, 2}\) | \(c_{-1, 3}\) | \(c_{-1, 4}\) |
C | \(c_{0, -1}\) | \(c_{0, 0}\) | \(c_{0, 1}\) | \(c_{0, 2}\) | \(c_{0, 3}\) | \(c_{0, 4}\) |
A | \(c_{1, -1}\) | \(c_{1, 0}\) | \(c_{1, 1}\) | \(c_{1, 2}\) | \(c_{1, 3}\) | \(c_{1, 4}\) |
R | \(c_{2, -1}\) | \(c_{2, 0}\) | \(c_{2, 1}\) | \(c_{2, 2}\) | \(c_{2, 3}\) | \(c_{2, 4}\) |
S | \(c_{3, -1}\) | \(c_{3, 0}\) | \(c_{3, 1}\) | \(c_{3, 2}\) | \(c_{3, 3}\) | \(c_{3, 4}\) |
In which \(c_{ij}\) is the comparison between CARS[0, i] and PAIRS[0, j] and corresponds to the local distance. Then for example we can fill \(c_{-1, -1}\) with \(0\) because the distance between CARS[0, -1] = “” and PAIRS[0, -1] = “” is 0. In analogous we can fill \(c_{-1, j}\) with \((j + 1)\) because to go from the empty string to a substring of PAIRS we need to insert \((j+1)\) letters. The same is valid for \(c_{i, -1} = i + 1\) and this insertion corresponds to the first entry on Equation (\(\ref{eq:lev}\)) (\(max(i , j) {\text{ if }}\min(i, j)=0\)). Then we have:
““ | P | A | I | R | S | |
---|---|---|---|---|---|---|
"" | 0 | 1 | 2 | 3 | 4 | 5 |
C | 1 | \(c_{0, 0}\) | \(c_{0, 1}\) | \(c_{0, 2}\) | \(c_{0, 3}\) | \(c_{0, 4}\) |
A | 2 | \(c_{1, 0}\) | \(c_{1, 1}\) | \(c_{1, 2}\) | \(c_{1, 3}\) | \(c_{1, 4}\) |
R | 3 | \(c_{2, 0}\) | \(c_{2, 1}\) | \(c_{2, 2}\) | \(c_{2, 3}\) | \(c_{2, 4}\) |
S | 4 | \(c_{3, 0}\) | \(c_{3, 1}\) | \(c_{3, 2}\) | \(c_{3, 3}\) | \(c_{3, 4}\) |
Now lets take \(c_{0, 0}\). This is the comparison between “C” and “P” and it is not a match. It means that, in this case, we need a replacement, then \(c_{0, 0} = 1\). Another way to think about this entry is that if we replace “C” by “P” then the new problem is to turn “P” in “P” (that is the same distance as the empty string to the empty string) which corresponds to \(c_{-1, -1} + 1 = 0 + 1 = 1\). It is related with \({lev} _{a,b}(i-1,j-1)+1_{(a_{i}\neq b_{j})}\) when \(a_i \neq b_j\) in the Equation (\(\ref{eq:lev}\)).
The entry \(c_{0, 1}\) is the comparison between “C” and “PA” and we can remove “C”, insert “A” into “C” or replace “C” by “P”. If we remove “C” the problem is the same of the transformation of “” into “PA” and it is already solved (\(c_{-1, 1} = 2\)) which means that if we remove “C” the distance is \(c_{-1, 1} + 1 = 3\) that corresponds to \({lev} _{a,b}(i-1,j)+1\) in the Equation (\(\ref{eq:lev}\)).
If we insert “A” into “C” the problem is the same of transforming “CA” into “PA” that is the same of transforming “C” into “P” that is already solved (\(c_{0, 0} = 1\)) which means that if we insert “A” the distance is \(c_{0, 0} + 1 = 2\) that corresponds to \({lev} _{a,b}(i,j-1)+1\) in the Equation (\(\ref{eq:lev}\)).
However if we replace “C” by “A” the problem is the same of transforming “” into “P” and this problem is already solved (\(c_{-1, 0} = 1\)) which means that if we replace “C” by “A” the distance is \(c_{-1, 0} + 1 = 2\) that corresponds to \({lev} _{a,b}(i-1,j-1)+1_{(a_{i}\neq b_{j})}\) when \(a_i \neq b_j\) in (\(\ref{eq:lev}\)).
So the minimum of all of this is \(c_{0, 1} = 2\).
““ | P | A | I | R | S | |
---|---|---|---|---|---|---|
"" | 0 | 1 | 2 | 3 | 4 | 5 |
C | 1 | 1 | 2 | \(c_{0, 2}\) | \(c_{0, 3}\) | \(c_{0, 4}\) |
A | 2 | \(c_{1, 0}\) | \(c_{1, 1}\) | \(c_{1, 2}\) | \(c_{1, 3}\) | \(c_{1, 4}\) |
R | 3 | \(c_{2, 0}\) | \(c_{2, 1}\) | \(c_{2, 2}\) | \(c_{2, 3}\) | \(c_{2, 4}\) |
S | 4 | \(c_{3, 0}\) | \(c_{3, 1}\) | \(c_{3, 2}\) | \(c_{3, 3}\) | \(c_{3, 4}\) |
Using the same logic we can fill \(c_{0, 2}\), \(c_{0, 3}\), \(c_{0, 4}\) and \(c_{1, 0}\):
““ | P | A | I | R | S | |
---|---|---|---|---|---|---|
"" | 0 | 1 | 2 | 3 | 4 | 5 |
C | 1 | 1 | 2 | 3 | 4 | 5 |
A | 2 | 2 | \(c_{1, 1}\) | \(c_{1, 2}\) | \(c_{1, 3}\) | \(c_{1, 4}\) |
R | 3 | \(c_{2, 0}\) | \(c_{2, 1}\) | \(c_{2, 2}\) | \(c_{2, 3}\) | \(c_{2, 4}\) |
S | 4 | \(c_{3, 0}\) | \(c_{3, 1}\) | \(c_{3, 2}\) | \(c_{3, 3}\) | \(c_{3, 4}\) |
To fill \(c_{1, 1}\) we need an extra care. In the comparison between “CA” and “PA” we have a match, in this case the distance is the same of “C” to “P” that is already solved \(c_{0, 0}\). So in the case of match we do not sum one (we did not do one operation). This corresponds to \({lev} _{a,b}(i-1,j-1)+1_{(a_{i}\neq b_{j})}\) when \(a_i = b_j\) in (\(\ref{eq:lev}\)).
In summary we have that:
- \(\max(i,j) {\text{ if }} \min(i,j) = 0\): To insert (when one string is empty)
- \({lev} _{a,b}(i-1,j)+1\): To delete
- \({lev} _{a,b}(i,j-1)+1\): To insert
- \({lev} _{a,b}(i-1,j-1)+1_{(a_{i}\neq b_{j})}\) To replace
Following the strategy we have:
““ | P | A | I | R | S | |
---|---|---|---|---|---|---|
"" | 0 | 1 | 2 | 3 | 4 | 5 |
C | 1 | 1 | 2 | 3 | 4 | 5 |
A | 2 | 2 | 1 | 2 | 3 | 4 |
R | 3 | 3 | 2 | 2 | 2 | 3 |
S | 4 | 4 | 3 | 3 | 3 | 2 |
The Levenshtein is given by the number in the position \([|a|, |b|] = 2\) in red and the path is also marked in red.
Another way to think about this matrix is thinking about the subproblems.
\(c_{2, 1}\) corresponds to the distance between CARS[0, 2] = “CAR” and “PA” = PAIRS[0, 1]. If you insert “I” into “CAR” then the problem turns into the distance between “CARA” and “PA” that is the same as “CAR” and “P” that corresponds to the entry \(c_{2, 0}\) (left). If you delete “R” from “CAR” then the problem turns into the distance between “CA” and “PA” that corresponds to the entry \(c_{1, 1}\) (top). If you replace “R” by “A” then the problem turns into the distance between “CAA” and “PA” that is the same as “CA” and “P” that corresponds to the entry \(c_{1, 0}\) (top diagonal).
In summary the local distance on \(\displaystyle c_{i, j} = min\{c_{i, j-1}, c_{i-1, j}, c_{i-1, j-1}\} + 1_{(a_{i}\neq b_{j})}\).
Lets define the red color for the delete operation, green for to replace operation and blue for the insert operation. To transform “CARS” into “PAIRS” we need to first replace “C” by “P”, then “A” is a match in the second letter. After this we need to insert “I” and then “R” and “S” are both matches:
C | A | R | S | |
---|---|---|---|---|
P | A | I | R | S |
If you think the opposite, the distance between “PAIRS” and “CARS” we have:
"" | C | A | R | S | |
---|---|---|---|---|---|
"" | 0 | 1 | 2 | 3 | 4 |
P | 1 | 1 | 2 | 3 | 4 |
A | 2 | 2 | 1 | 2 | 3 |
I | 3 | 3 | 2 | 2 | 3 |
R | 4 | 4 | 3 | 2 | 3 |
S | 5 | 5 | 4 | 3 | 2 |
and,
P | A | I | R | S |
---|---|---|---|---|
C | A | R | S |
Pairing names
Now we know how the Levenshtein works and then we can use it for naming pairing. I am going to use the stringdist::stringdist function that is much more general than the examples I gave earlier (the adist from utils package is also an alternative).
Just to check what we just did:
adist(x = "CARS", y = "PARIS")
## [,1]
## [1,] 2
stringdist(a = "CARS", b = "PAIRS", method = "dl") ## dl = Damerau-Levenshtein
## [1] 2
Firstly, a cleanup is necessary due to the fact that upper and lower case letters are considered different and also accented words:
stringdist(a = "A", b = "a", method = "dl")
## [1] 1
stringdist(a = "a", b = "á", method = "dl")
## [1] 1
So, I’ll replace the accented strings, turn all letters to lowercase and to remove all extra spacing in both datasets.
df1 <- df1 %>%
mutate(pad_name = iconv(x = name_correct, from = "UTF-8", to = "ASCII//TRANSLIT"),
pad_name = str_to_lower(string = pad_name),
pad_name = str_trim(string = pad_name, side = "both"))
df2 <- df2 %>%
mutate(pad_name = iconv(x = name, from = "UTF-8", to = "ASCII//TRANSLIT"),
pad_name = str_to_lower(string = pad_name),
pad_name = str_trim(string = pad_name, side = "both"))
Well, now we can apply the distance between the names in df2 to the names in df1 to identify which one is the most similar (the one that minimizes the distance). After we can create a new variable in df2 with the matched names. Lets also create a measure of similarity that corresponds to \(1 - \frac{\operatorname {lev} _{a,b}}{max\{|a|, |b|\}}\) that is \(1\) if the distance between the strings are \(0\) and \(0\) in the extreme case when a comparison between the empty string and a string that has length \(max\{|a|, |b|\}\) is made.
##-- Distance matrix ----
mat_dist <- stringdistmatrix(a = df2$pad_name, b = df1$pad_name, method = "dl")
##-- Positions that minimizes the distance ----
pos_min <- apply(mat_dist, 1, which.min)
dist_min <- apply(mat_dist, 1, min)
max_length <- pmax(nchar(df2$pad_name), nchar(df1$pad_name[pos_min]))
df2$name_match <- df1$name_correct[pos_min]
df2$pad_name_match <- df1$pad_name[pos_min]
df2$score <- 1 - dist_min/max_length
Now for each name on df2 we have a name in df1 given by the pad_name column. So now we can just merge them. Of course we should analyse the df2$name_match column mainly in the cases we have a score near zero which means that the most similar correct name is quite different than the name with typos.
data_merge <- merge(x = df1, y = df2, by.x = "pad_name", by.y = "pad_name_match", all.x = TRUE)
datatable(head(data_merge), options = options_table, rownames = FALSE) %>%
formatRound(columns = c("var_a", "var_b", "score"), digits = 4)
And finally we can verify if we matched all names correctly:
all(data_merge$name == data_merge$name_incorrect, na.rm = TRUE)
## [1] TRUE
Of course, extra caution is required for homonymous and similar names. In the first case it is not possible to merge only based in the names as information and in the second case a manual check is required. However, in a large data set, the distance from Levenshtein can save a lot of time, making it necessary to check only the names whose scores are less than one.