No feedback found for this session
stringdist
Introduction
This is an introduction to stringdist which is designed to allow fuzzy matching of text. Install and attach in the standard way:
# install.packages("stringdist") # likely needs compiling
library(stringdist)
library(ggplot2) # for something fun later
While there are several families of functions in the package, we’ll concentrate on those that focus on the core idea: strings differ, and you can quantify those differences for pleasure and profit. A simple example:
stringdist("thing", "think")
[1] 1
stringdist("thing", "string")
[1] 2
stringdist("thing", "oxpecker")
[1] 8
Once you’ve got an idea about how different you’re okay with your strings being, you can use e.g. a max distance as a cut-off to do work in your code. A simple example of implementing this can be found in the amatch
function, which is designed to work like a fuzzy matching version of base-R’s match
. We’ll use a
and b
as variable names to keep aligned with the stringdist documentation. Quick refresher on match
:
<- "the" # take a value
a <- c("this", "is", "the", "thing") # and a vector of possible values
b match(a, b) # match returns the index location of your a in your b
[1] 3
Match lets you do useful work with strings - e.g. remove everything before your key word:
match(a, b):length(b)] b[
[1] "the" "thing"
exemplar: amatch
And now for the amatch
version:
<- "thee" # take a value
a amatch(a, b, maxDist = 1) # maxDist = allowable number of typos
[1] 3
amatch(a, b, maxDist = 10) # where several possible values exist, amatch returns the closest match
[1] 3
You can use that in a base-R way to e.g. correct wonky input values:
amatch(a, b, maxDist = 1)] b[
[1] "the"
Or you might prefer the ain helper:
ain(a, b, maxDist = 1)
[1] TRUE
%in% b # like a fuzzy version of %in% a
[1] FALSE
"the" %in% b
[1] TRUE
Or in a slightly horrid but tidyish way if you wanted to tidy up several different values:
::tibble(raw_data = c("thi", "ss", "thee", "thung")) |>
dplyr::rowwise() |>
dplyr::mutate(is_match = b[amatch(raw_data, b, maxDist = 5)]) dplyr
# A tibble: 4 × 2
# Rowwise:
raw_data is_match
<chr> <chr>
1 thi this
2 ss is
3 thee the
4 thung thing
other functions
stringdist
, stringsim
, and stringdistmatrix
stringdist
has a nice toolkit of related functions. stringdist
gives you the distance between strings:
stringdist("thing", "think") # gives you the raw distance.
[1] 1
stringdist(a, b) # vectorised
[1] 2 4 1 3
stringsim
does the opposite, giving a similarity score between 0 and 1:
stringsim(a, b)
[1] 0.50 0.00 0.75 0.40
stringsim("cat", "dog")
[1] 0
stringsim("cat", "cat")
[1] 1
stringdistmatrix
does what you might imagine, returning a matrix of differences. This is largely useful if you want e.g. to pass these scores into something like a clustering algorithm that expects a matrix:
stringdistmatrix(a, b) # same idea, different output: a matrix of differences
[,1] [,2] [,3] [,4]
[1,] 2 4 1 3
hclust(stringdistmatrix(b))
Call:
hclust(d = stringdistmatrix(b))
Cluster method : complete
Distance : osa
Number of objects: 4
plot(hclust(stringdistmatrix(b, useNames = "strings")))
It’s possible-but-messy to take this matrix and tidy it into a tibble:
stringdistmatrix(c("thee", "che"), b, useNames = "strings") |>
as.data.frame() |>
::as_tibble(rownames = "a") |> # slightly roundabout conversion to tibble via df to preserve rownames
tibble::pivot_longer(!a, names_to = "b", values_to = "stringdist") # useful tidyr
# A tibble: 8 × 3
a b stringdist
<chr> <chr> <dbl>
1 thee this 2
2 thee is 4
3 thee the 1
4 thee thing 3
5 che this 3
6 che is 3
7 che the 1
8 che thing 4
But probably better/more idiomatic to go:
::expand_grid(a = c("thee", "che"), b) |>
tidyr::mutate(stringdist = stringdist(a, b)) dplyr
# A tibble: 8 × 3
a b stringdist
<chr> <chr> <dbl>
1 thee this 2
2 thee is 4
3 thee the 1
4 thee thing 3
5 che this 3
6 che is 3
7 che the 1
8 che thing 4
afind
, grab
, grabl
, and extract
afind(a, b)
$location
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
$distance
[,1] [,2] [,3] [,4]
[1,] 2 2 0 3
$match
[,1] [,2] [,3] [,4]
[1,] "thee" "th" "the" "thee"
afind
returns a lot of stuff:
afind(a, b)$location # where in substrings do the search terms start?
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
afind(a, b)$distance # how far from the search term is each substring?
[,1] [,2] [,3] [,4]
[1,] 2 2 0 3
afind(a, b)$match # what's the best matching part of the search term against each substring
[,1] [,2] [,3] [,4]
[1,] "thee" "th" "the" "thee"
If that’s a bit confusing, grab
and grabl
are basically replacements for grep(l)
. grab
== grep
and grabl
== grepl
:
<- c("wig", "win", "banana", "ring")
c grab(c, "wig", maxDist = 1)] c[
[1] "wig" "win"
grab(c, "wig", maxDist = 2)] c[
[1] "wig" "win" "ring"
grabl(c, "wig", maxDist = 2)
[1] TRUE TRUE FALSE TRUE
grab(c, "wig", maxDist = 3)] c[
[1] "wig" "win" "banana" "ring"
extract(c, "wii", maxDist=1)
[,1]
[1,] "wig"
[2,] "win"
[3,] NA
[4,] NA
What do you mean closest?
The really interesting bit: you can play with the method used to calculate distance which will give different results:
hamming
Hamming distance. Number of character substitutions required to change a into b.
amatch("thign", b, maxDist = 1, method = "hamming") # not found
[1] NA
amatch("thign", b, maxDist = 2, method = "hamming") # n subs g, g subs n = 2
[1] 4
lv
Levenshtein distance. Number of del/ins/subs required to change a into b.
amatch("thinng", b, maxDist = 2, method = "hamming")
[1] NA
amatch("thinng", b, maxDist = 2, method = "lv") # as we can now delete
[1] 4
osa
Optimal string aligment. Number of del/ins/subs/swaps required to change a into b:
amatch("think", b, maxDist = 1)
[1] 4
amatch("thign", b, maxDist = 1, method = "osa") # n swaps with g
[1] 4
dl
Full Damerau-Levenshtein distance. As osa, but calculated slightly differently.
amatch("thign", b, maxDist = 1, method = "dl") # can't find a difference with these simple ones from osa
[1] 4
lcs
Longest common substring distance. Comparing longest identical parts of a and b:
amatch("thign", b, maxDist = 2, method = "lcs") # I think roughly = length - common substring
[1] 4
soundex
Distance based on soundex encoding. Used in the phonetic
function (see below), but finds homophones:
amatch("thign", b, maxDist = 1, method = "soundex")
[1] 1
stringdist("too", c("two", "to", "oto"), method = "soundex")
[1] 0 0 1
There are also several methods designed to look at q-grams = substring chunks of q characters long, and a group of methods based on the number of similar and different q-grams between a and b:
qgram
This is the q-gram distance:
stringdist("thign", b, method = "qgram")
[1] 3 5 4 0
cosine
The cosine distance between q-gram profiles, which gives you (usefully) a scaled distance of between 0 (identical) and 1 (utterly different)
stringdist("thing", b, method = "cosine")
[1] 0.3291796 0.6837722 0.4836022 0.0000000
stringdist("thing", "fox", method = "cosine")
[1] 1
jaccard
The Jaccard distance between q-gram profiles
stringdist("thing", b, method = "cosine")
[1] 0.3291796 0.6837722 0.4836022 0.0000000
stringdist("thing", b, method = "jaccard")
[1] 0.5000000 0.8333333 0.6666667 0.0000000
jw
The Jaro distance (and Jaro-Winkler distance):
stringdist("thing", b, method = "jw", p = 0.2) # correction term
[1] 0.08666667 1.00000000 0.18666667 0.00000000
phonetic
There’s also tooling to deal with the similarity of phonetic spellings. Excellent for e.g. Teams transcripts, if you’re trying to tidy those up:
phonetic("ring")
[1] "R520"
phonetic("wing") # similar sounding strings get similar values
[1] "W520"
phonetic("string")
[1] "S365"
phonetic("strung")
[1] "S365"
a play example: GGC
The KIND network has lots of members from the main NHS body in Glasgow, which is officially known as “NHS Greater Glasgow and Clyde”. That’s a highly variable and abbreviatable name though, leading to over 70 different permutations in common use. Here are some real-ish examples
<- readr::read_lines(here::here("r_training/data/ggc.txt")) # real data
ggc
sample(ggc, 10)
[1] "NHS GG&C West of Scotland Breast Screening"
[2] "NHSGGC"
[3] "Greater Glasgow and Clyde"
[4] "NHS GG&C"
[5] "NHSGGC"
[6] "NHSGGC"
[7] "NHS GGC - Public Health - Health Improvement"
[8] "Greater Glasgow & Clyde"
[9] "NHS GGC"
[10] "NHS GG&C"
There’s lots here to play with using stringdist
. My two quick examples are to plot a couple of different sensible variants of the name (“NHS GGC”, “NHS Greater Glasgow and Clyde”, “NHSGGC”) methods, and see how many distant results we get:
::expand_grid(a = c("NHS GGC", "NHS Greater Glasgow and Clyde", "NHSGGC"), b = ggc) |>
tidyr::mutate(stringdist = stringdist(a, b)) |>
dplyrggplot() +
geom_density(aes(x = stringdist, fill = a)) +
facet_wrap(~a, ncol = 1) +
theme(legend.position = "none")
Or to try a family tree (more swishily, a cluster dendrogram) of those terms:
sort(table(ggc), decreasing = T)[1:7] |>
stringdistmatrix(useNames = "names") |>
hclust() |>
plot()