stringdist

R
intermediate
NLP
Published

September 23, 2024

No feedback found for this session

Introduction

This is an introduction to stringdist which is designed to allow fuzzy matching of text. Install and attach in the standard way:

# install.packages("stringdist") # likely needs compiling
library(stringdist) 
library(ggplot2) # for something fun later

While there are several families of functions in the package, we’ll concentrate on those that focus on the core idea: strings differ, and you can quantify those differences for pleasure and profit. A simple example:

stringdist("thing", "think")
[1] 1
stringdist("thing", "string")
[1] 2
stringdist("thing", "oxpecker")
[1] 8

Once you’ve got an idea about how different you’re okay with your strings being, you can use e.g. a max distance as a cut-off to do work in your code. A simple example of implementing this can be found in the amatch function, which is designed to work like a fuzzy matching version of base-R’s match. We’ll use a and b as variable names to keep aligned with the stringdist documentation. Quick refresher on match:

a <- "the" # take a value
b <- c("this", "is", "the", "thing") # and a vector of possible values
match(a, b) # match returns the index location of your a in your b
[1] 3

Match lets you do useful work with strings - e.g. remove everything before your key word:

b[match(a, b):length(b)]
[1] "the"   "thing"

exemplar: amatch

And now for the amatch version:

a <- "thee" # take a value
amatch(a, b, maxDist = 1) # maxDist = allowable number of typos
[1] 3
amatch(a, b, maxDist = 10) # where several possible values exist, amatch returns the closest match
[1] 3

You can use that in a base-R way to e.g. correct wonky input values:

b[amatch(a, b, maxDist = 1)] 
[1] "the"

Or you might prefer the ain helper:

ain(a, b, maxDist = 1)
[1] TRUE
a %in% b # like a fuzzy version of %in%
[1] FALSE
"the" %in% b
[1] TRUE

Or in a slightly horrid but tidyish way if you wanted to tidy up several different values:

dplyr::tibble(raw_data = c("thi", "ss", "thee", "thung")) |>
  dplyr::rowwise() |>
  dplyr::mutate(is_match = b[amatch(raw_data, b, maxDist = 5)])
# A tibble: 4 × 2
# Rowwise: 
  raw_data is_match
  <chr>    <chr>   
1 thi      this    
2 ss       is      
3 thee     the     
4 thung    thing   

other functions

stringdist, stringsim, and stringdistmatrix

stringdist has a nice toolkit of related functions. stringdist gives you the distance between strings:

stringdist("thing", "think") # gives you the raw distance.
[1] 1
stringdist(a, b) # vectorised
[1] 2 4 1 3

stringsim does the opposite, giving a similarity score between 0 and 1:

stringsim(a, b)
[1] 0.50 0.00 0.75 0.40
stringsim("cat", "dog")
[1] 0
stringsim("cat", "cat")
[1] 1

stringdistmatrix does what you might imagine, returning a matrix of differences. This is largely useful if you want e.g. to pass these scores into something like a clustering algorithm that expects a matrix:

stringdistmatrix(a, b) # same idea, different output: a matrix of differences
     [,1] [,2] [,3] [,4]
[1,]    2    4    1    3
hclust(stringdistmatrix(b))

Call:
hclust(d = stringdistmatrix(b))

Cluster method   : complete 
Distance         : osa 
Number of objects: 4 
plot(hclust(stringdistmatrix(b, useNames = "strings")))

It’s possible-but-messy to take this matrix and tidy it into a tibble:

stringdistmatrix(c("thee", "che"), b, useNames = "strings") |>
  as.data.frame() |>
  tibble::as_tibble(rownames = "a") |> # slightly roundabout conversion to tibble via df to preserve rownames
  tidyr::pivot_longer(!a, names_to = "b", values_to = "stringdist") # useful 
# A tibble: 8 × 3
  a     b     stringdist
  <chr> <chr>      <dbl>
1 thee  this           2
2 thee  is             4
3 thee  the            1
4 thee  thing          3
5 che   this           3
6 che   is             3
7 che   the            1
8 che   thing          4

But probably better/more idiomatic to go:

tidyr::expand_grid(a = c("thee", "che"), b) |>
  dplyr::mutate(stringdist = stringdist(a, b))
# A tibble: 8 × 3
  a     b     stringdist
  <chr> <chr>      <dbl>
1 thee  this           2
2 thee  is             4
3 thee  the            1
4 thee  thing          3
5 che   this           3
6 che   is             3
7 che   the            1
8 che   thing          4

afind, grab, grabl, and extract

afind(a, b)
$location
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1

$distance
     [,1] [,2] [,3] [,4]
[1,]    2    2    0    3

$match
     [,1]   [,2] [,3]  [,4]  
[1,] "thee" "th" "the" "thee"

afind returns a lot of stuff:

afind(a, b)$location # where in substrings do the search terms start?
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
afind(a, b)$distance # how far from the search term is each substring?
     [,1] [,2] [,3] [,4]
[1,]    2    2    0    3
afind(a, b)$match # what's the best matching part of the search term against each substring
     [,1]   [,2] [,3]  [,4]  
[1,] "thee" "th" "the" "thee"

If that’s a bit confusing, grab and grabl are basically replacements for grep(l). grab == grep and grabl == grepl:

c <- c("wig", "win", "banana", "ring")
c[grab(c,  "wig", maxDist = 1)]
[1] "wig" "win"
c[grab(c,  "wig", maxDist = 2)]
[1] "wig"  "win"  "ring"
grabl(c,  "wig", maxDist = 2)
[1]  TRUE  TRUE FALSE  TRUE
c[grab(c,  "wig", maxDist = 3)]
[1] "wig"    "win"    "banana" "ring"  
extract(c, "wii", maxDist=1)
     [,1] 
[1,] "wig"
[2,] "win"
[3,] NA   
[4,] NA   

What do you mean closest?

The really interesting bit: you can play with the method used to calculate distance which will give different results:

hamming

Hamming distance. Number of character substitutions required to change a into b.

amatch("thign", b, maxDist = 1, method = "hamming") # not found
[1] NA
amatch("thign", b, maxDist = 2, method = "hamming") # n subs g, g subs n = 2
[1] 4

lv

Levenshtein distance. Number of del/ins/subs required to change a into b.

amatch("thinng", b, maxDist = 2, method = "hamming")
[1] NA
amatch("thinng", b, maxDist = 2, method = "lv") # as we can now delete
[1] 4

osa

Optimal string aligment. Number of del/ins/subs/swaps required to change a into b:

amatch("think", b, maxDist = 1)
[1] 4
amatch("thign", b, maxDist = 1, method = "osa") # n swaps with g
[1] 4

dl

Full Damerau-Levenshtein distance. As osa, but calculated slightly differently.

amatch("thign", b, maxDist = 1, method = "dl") # can't find a difference with these simple ones from osa
[1] 4

lcs

Longest common substring distance. Comparing longest identical parts of a and b:

amatch("thign", b, maxDist = 2, method = "lcs") # I think roughly = length - common substring
[1] 4

soundex

Distance based on soundex encoding. Used in the phonetic function (see below), but finds homophones:

amatch("thign", b, maxDist = 1, method = "soundex")
[1] 1
stringdist("too", c("two", "to", "oto"), method = "soundex")
[1] 0 0 1

There are also several methods designed to look at q-grams = substring chunks of q characters long, and a group of methods based on the number of similar and different q-grams between a and b:

qgram

This is the q-gram distance:

stringdist("thign", b, method = "qgram")
[1] 3 5 4 0

cosine

The cosine distance between q-gram profiles, which gives you (usefully) a scaled distance of between 0 (identical) and 1 (utterly different)

stringdist("thing", b, method = "cosine")
[1] 0.3291796 0.6837722 0.4836022 0.0000000
stringdist("thing", "fox", method = "cosine")
[1] 1

jaccard

The Jaccard distance between q-gram profiles

stringdist("thing", b, method = "cosine") 
[1] 0.3291796 0.6837722 0.4836022 0.0000000
stringdist("thing", b, method = "jaccard") 
[1] 0.5000000 0.8333333 0.6666667 0.0000000

jw

The Jaro distance (and Jaro-Winkler distance):

stringdist("thing", b, method = "jw", p = 0.2) # correction term
[1] 0.08666667 1.00000000 0.18666667 0.00000000

phonetic

There’s also tooling to deal with the similarity of phonetic spellings. Excellent for e.g. Teams transcripts, if you’re trying to tidy those up:

phonetic("ring")
[1] "R520"
phonetic("wing") # similar sounding strings get similar values
[1] "W520"
phonetic("string")
[1] "S365"
phonetic("strung")
[1] "S365"

a play example: GGC

The KIND network has lots of members from the main NHS body in Glasgow, which is officially known as “NHS Greater Glasgow and Clyde”. That’s a highly variable and abbreviatable name though, leading to over 70 different permutations in common use. Here are some real-ish examples

ggc <- readr::read_lines(here::here("r_training/data/ggc.txt")) # real data

sample(ggc, 10)
 [1] "GGC"                                         
 [2] "NHS GGC"                                     
 [3] "NHS GGC - Public Health - Health Improvement"
 [4] "NHS GGC"                                     
 [5] "Greater Glasgow & Clyde"                     
 [6] "NHSGGC"                                      
 [7] "NHS Greater Glasgow & Clyde"                 
 [8] "NHS GG&C"                                    
 [9] "NHS GGC"                                     
[10] "NHS GG&C"                                    

There’s lots here to play with using stringdist. My two quick examples are to plot a couple of different sensible variants of the name (“NHS GGC”, “NHS Greater Glasgow and Clyde”, “NHSGGC”) methods, and see how many distant results we get:

tidyr::expand_grid(a = c("NHS GGC", "NHS Greater Glasgow and Clyde", "NHSGGC"), b = ggc) |>
  dplyr::mutate(stringdist = stringdist(a, b)) |>
  ggplot() +
  geom_density(aes(x = stringdist, fill = a)) +
  facet_wrap(~a, ncol = 1) +
  theme(legend.position = "none")

Or to try a family tree (more swishily, a cluster dendrogram) of those terms:

sort(table(ggc), decreasing = T)[1:7] |>
  stringdistmatrix(useNames = "names") |>
  hclust() |>
  plot()