No feedback found for this session
Joining data with dplyr
Session outline
This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to do three things with dplyr:
- explain the difference between binds and joins
- to demonstrate a range of different joining strategies
- to explain dplyr’s new tools for controlling joins (
.by
,by()
,multiple()
,unmatched()
)
You might also like some of the other dplyr-themed practical sessions:
Joins
Joins are where we merge two tibbles together in some way, while broadly preserving the structure of each. There are a couple of variants that we’ll introduce in turn:
- binding joins, which are joins where data are added based on position. For example, we merge the each of the first columns in the two tibbles into one new first column.
- mutating joins, which are joins where data are added by values to match existing data. For example, we might match two tibbles based on CHI number, and connect two otherwise dissimilar data sets together.
Let’s start with binding joins - after a bit of basic package loading:
Binding
To demonstrate how joins work in dplyr, we need to start with some suitable data.
We’ll set up some extremely simple data using tribble()
to demonstrate how the joins work:
bind_rows()
bind_rows()
does exactly what the name describes. It adds the rows of one tibble to another, resulting in one joined tibble containing all the rows:
x |>
bind_rows(y)
# A tibble: 10 × 3
category value key
<chr> <dbl> <dbl>
1 x 5 3
2 x 2 7
3 x 4 2
4 x 7 1
5 x 9 1
6 y 2 5
7 y 7 2
8 y 3 2
9 y 2 8
10 y 2 14
bind_rows()
will create new columns and fill with NAs if the names of the columns don’t match:
# A tibble: 10 × 4
category wrong_name key value
<chr> <dbl> <dbl> <dbl>
1 x 5 3 NA
2 x 2 7 NA
3 x 4 2 NA
4 x 7 1 NA
5 x 9 1 NA
6 y NA 5 2
7 y NA 2 7
8 y NA 2 3
9 y NA 8 2
10 y NA 14 2
And you can create an optional ID column to show where each row originated, which is helpful for debugging:
x |>
bind_rows(y, .id="ID")
# A tibble: 10 × 4
ID category value key
<chr> <chr> <dbl> <dbl>
1 1 x 5 3
2 1 x 2 7
3 1 x 4 2
4 1 x 7 1
5 1 x 9 1
6 2 y 2 5
7 2 y 7 2
8 2 y 3 2
9 2 y 2 8
10 2 y 2 14
You can also bind_rows()
by list:
bind_cols()
Another function, another descriptive name. bind_cols()
adds columns to the starting tibble, but otherwise works as bind_rows()
:
x |>
bind_cols(y, .name_repair="universal")
# A tibble: 5 × 6
category...1 value...2 key...3 category...4 value...5 key...6
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 x 5 3 y 2 5
2 x 2 7 y 7 2
3 x 4 2 y 3 2
4 x 7 1 y 2 8
5 x 9 1 y 2 14
Mutating joins
To get started with mutating joins in dplyr, we could do much worse than the first part of the manual page, which is nice and clear:
The mutating joins add columns from y to x, matching rows based on the keys:
-
inner_join()
: includes all rows in x and y. -
left_join()
: includes all rows in x. -
right_join()
: includes all rows in y. -
full_join()
: includes all rows in x or y. -
anti_join()
: includes all rows in x that are not in y.
If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x.
inner_join()
inner_join()
joins two tibbles by whichever key column you supply using by=
:
x |>
inner_join(y, by="key")
# A tibble: 2 × 5
category.x value.x key category.y value.y
<chr> <dbl> <dbl> <chr> <dbl>
1 x 4 2 y 7
2 x 4 2 y 3
This returns all the rows that exist in both tibbles.
left_join()
left_join()
includes all the rows in the first tibble:
# A tibble: 6 × 5
category.a value.a key category.b value.b
<chr> <dbl> <dbl> <chr> <dbl>
1 x 5 3 <NA> NA
2 x 2 7 <NA> NA
3 x 4 2 y 7
4 x 4 2 y 3
5 x 7 1 <NA> NA
6 x 9 1 <NA> NA
We get returned the rows that exist in x
, and any that match keys from y
. Any ‘extra’ rows in simple_a are filled in with NAs
. We’ve also tweaked the column names in this example, using the suffix
argument.
right_join()
right_join()
includes all the rows in the second tibble:
x |>
right_join(y, by="key", keep=T)
# A tibble: 5 × 6
category.x value.x key.x category.y value.y key.y
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 x 4 2 y 7 2
2 x 4 2 y 3 2
3 <NA> NA NA y 2 5
4 <NA> NA NA y 2 8
5 <NA> NA NA y 2 14
We get all the rows that exist in y
, and any that match keys from simple_a
, again, filling with NAs where appropriate. Here, we’re demonstrating the keep
argument, which will repeat the column on which we’re joining for each of the tibbles.
full_join()
full_join()
gives us all the rows from all the tibbles:
x |>
full_join(y, by="key")
# A tibble: 9 × 5
category.x value.x key category.y value.y
<chr> <dbl> <dbl> <chr> <dbl>
1 x 5 3 <NA> NA
2 x 2 7 <NA> NA
3 x 4 2 y 7
4 x 4 2 y 3
5 x 7 1 <NA> NA
6 x 9 1 <NA> NA
7 <NA> NA 5 y 2
8 <NA> NA 8 y 2
9 <NA> NA 14 y 2
All rows in both tibbles, filling with NAs
where needed.
anti_join()
Anti-join is a bit of an outlier, because it returns a smaller tibble than it is supplied with. It returns all the rows of our first tibble that are not present in the second tibble:
x |>
anti_join(y, by="key")
# A tibble: 4 × 3
category value key
<chr> <dbl> <dbl>
1 x 5 3
2 x 2 7
3 x 7 1
4 x 9 1
Only those rows in x
that are not present in y
.