Joining data with dplyr

R
intermediate
Published

May 19, 2025

No feedback found for this session

Session outline

This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to do three things with dplyr:

  • explain the difference between binds and joins
  • to demonstrate a range of different joining strategies
  • to explain dplyr’s new tools for controlling joins (.by, by(), multiple(), unmatched())

You might also like some of the other dplyr-themed practical sessions:

Joins

Joins are where we merge two tibbles together in some way, while broadly preserving the structure of each. There are a couple of variants that we’ll introduce in turn:

  • binding joins, which are joins where data are added based on position. For example, we merge the each of the first columns in the two tibbles into one new first column.
  • mutating joins, which are joins where data are added by values to match existing data. For example, we might match two tibbles based on CHI number, and connect two otherwise dissimilar data sets together.

Let’s start with binding joins - after a bit of basic package loading:

Binding

To demonstrate how joins work in dplyr, we need to start with some suitable data.

We’ll set up some extremely simple data using tribble() to demonstrate how the joins work:

x <- tibble::tribble(
  ~category, ~value, ~key,
  "x",5,3,
  "x",2,7,
  "x",4,2,
  "x",7,1,
  "x",9,1
  )



y <- tibble::tribble(
  ~category, ~value, ~key,
  "y",2,5,
  "y",7,2,
  "y",3,2,
  "y",2,8,
  "y",2,14
  )

bind_rows()

bind_rows() does exactly what the name describes. It adds the rows of one tibble to another, resulting in one joined tibble containing all the rows:

x |>
  bind_rows(y) 
# A tibble: 10 × 3
   category value   key
   <chr>    <dbl> <dbl>
 1 x            5     3
 2 x            2     7
 3 x            4     2
 4 x            7     1
 5 x            9     1
 6 y            2     5
 7 y            7     2
 8 y            3     2
 9 y            2     8
10 y            2    14

bind_rows() will create new columns and fill with NAs if the names of the columns don’t match:

x |>
  rename("wrong_name" = "value") |>
  bind_rows(y)
# A tibble: 10 × 4
   category wrong_name   key value
   <chr>         <dbl> <dbl> <dbl>
 1 x                 5     3    NA
 2 x                 2     7    NA
 3 x                 4     2    NA
 4 x                 7     1    NA
 5 x                 9     1    NA
 6 y                NA     5     2
 7 y                NA     2     7
 8 y                NA     2     3
 9 y                NA     8     2
10 y                NA    14     2

And you can create an optional ID column to show where each row originated, which is helpful for debugging:

x |>
  bind_rows(y, .id="ID") 
# A tibble: 10 × 4
   ID    category value   key
   <chr> <chr>    <dbl> <dbl>
 1 1     x            5     3
 2 1     x            2     7
 3 1     x            4     2
 4 1     x            7     1
 5 1     x            9     1
 6 2     y            2     5
 7 2     y            7     2
 8 2     y            3     2
 9 2     y            2     8
10 2     y            2    14

You can also bind_rows() by list:

bind_rows(list(x, y)) 
# A tibble: 10 × 3
   category value   key
   <chr>    <dbl> <dbl>
 1 x            5     3
 2 x            2     7
 3 x            4     2
 4 x            7     1
 5 x            9     1
 6 y            2     5
 7 y            7     2
 8 y            3     2
 9 y            2     8
10 y            2    14

bind_cols()

Another function, another descriptive name. bind_cols() adds columns to the starting tibble, but otherwise works as bind_rows():

x |>
  bind_cols(y, .name_repair="universal") 
# A tibble: 5 × 6
  category...1 value...2 key...3 category...4 value...5 key...6
  <chr>            <dbl>   <dbl> <chr>            <dbl>   <dbl>
1 x                    5       3 y                    2       5
2 x                    2       7 y                    7       2
3 x                    4       2 y                    3       2
4 x                    7       1 y                    2       8
5 x                    9       1 y                    2      14

Mutating joins

To get started with mutating joins in dplyr, we could do much worse than the first part of the manual page, which is nice and clear:


The mutating joins add columns from y to x, matching rows based on the keys:

If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x.


inner_join()

Inner join

inner_join() joins two tibbles by whichever key column you supply using by=:

x |>
  inner_join(y, by="key") 
# A tibble: 2 × 5
  category.x value.x   key category.y value.y
  <chr>        <dbl> <dbl> <chr>        <dbl>
1 x                4     2 y                7
2 x                4     2 y                3

This returns all the rows that exist in both tibbles.

left_join()

Left join

left_join() includes all the rows in the first tibble:

x |>
  left_join(y, by="key", suffix = c(".a", ".b")) 
# A tibble: 6 × 5
  category.a value.a   key category.b value.b
  <chr>        <dbl> <dbl> <chr>        <dbl>
1 x                5     3 <NA>            NA
2 x                2     7 <NA>            NA
3 x                4     2 y                7
4 x                4     2 y                3
5 x                7     1 <NA>            NA
6 x                9     1 <NA>            NA

We get returned the rows that exist in x, and any that match keys from y. Any ‘extra’ rows in simple_a are filled in with NAs. We’ve also tweaked the column names in this example, using the suffix argument.

right_join()

Right join

right_join() includes all the rows in the second tibble:

x |>
  right_join(y, by="key", keep=T) 
# A tibble: 5 × 6
  category.x value.x key.x category.y value.y key.y
  <chr>        <dbl> <dbl> <chr>        <dbl> <dbl>
1 x                4     2 y                7     2
2 x                4     2 y                3     2
3 <NA>            NA    NA y                2     5
4 <NA>            NA    NA y                2     8
5 <NA>            NA    NA y                2    14

We get all the rows that exist in y, and any that match keys from simple_a, again, filling with NAs where appropriate. Here, we’re demonstrating the keep argument, which will repeat the column on which we’re joining for each of the tibbles.

full_join()

Full join

full_join() gives us all the rows from all the tibbles:

x |>
  full_join(y, by="key") 
# A tibble: 9 × 5
  category.x value.x   key category.y value.y
  <chr>        <dbl> <dbl> <chr>        <dbl>
1 x                5     3 <NA>            NA
2 x                2     7 <NA>            NA
3 x                4     2 y                7
4 x                4     2 y                3
5 x                7     1 <NA>            NA
6 x                9     1 <NA>            NA
7 <NA>            NA     5 y                2
8 <NA>            NA     8 y                2
9 <NA>            NA    14 y                2

All rows in both tibbles, filling with NAs where needed.

anti_join()

Anti join

Anti-join is a bit of an outlier, because it returns a smaller tibble than it is supplied with. It returns all the rows of our first tibble that are not present in the second tibble:

x |>
  anti_join(y, by="key") 
# A tibble: 4 × 3
  category value   key
  <chr>    <dbl> <dbl>
1 x            5     3
2 x            2     7
3 x            7     1
4 x            9     1

Only those rows in x that are not present in y.