No feedback found for this session
Tidyselect
Introduction
Lots of the power in dplyr
comes from the many ways that you can select columns. In this section, we’ll look at some examples of tidyselect, which is a series of functions that allow you to specify columns by various patterns. Not only does this give you lots of ways of simplifying and streamlining your code, but it’s also a great way of making your code more portable - so that it works well inside functions, or across different data sets.
In the tidyevaluation training session, we look across this material in much more depth. There’s quite a lot to absorb, so I would definitely recommend using the manual pages (using ??tidyselect
) as a reference for this section.
Package loading is just:
library(NHSRdatasets) # for sample data
library(dplyr)
contains()
contains()
is a good example of tidyselect functions that are based on pattern matching. There are several related functions that work in the same way, including:
Here, we’ll concentrate on contains()
as an exemplar for this family, because the other functions in the family work in a very similar way. contains()
allows you to pick columns based on their names. So to select()
all the columns in stranded_data
whose names contain the string “care”:
# A tibble: 768 × 3
care.home.referral mental_health_care periods_of_previous_care
<int> <int> <int>
1 0 0 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 1
6 1 1 1
7 1 0 1
8 0 0 1
9 0 0 5
10 1 1 1
# ℹ 758 more rows
Note that by default this string matching is case insensitive, so “care” will match columns called “CARE”, “Care”, and “cArE”. You can also combine the contains()
statement with other select()
criteria:
# A tibble: 768 × 4
age care.home.referral mental_health_care periods_of_previous_care
<int> <int> <int> <int>
1 50 0 0 1
2 31 1 0 1
3 32 0 1 1
4 69 1 1 1
5 33 0 1 1
6 75 1 1 1
7 26 1 0 1
8 64 0 0 1
9 53 0 0 5
10 63 1 1 1
# ℹ 758 more rows
And you can match over a vector of strings:
# A tibble: 768 × 4
care.home.referral mental_health_care periods_of_previous_care age
<int> <int> <int> <int>
1 0 0 1 50
2 1 0 1 31
3 0 1 1 32
4 1 1 1 69
5 0 1 1 33
6 1 1 1 75
7 1 0 1 26
8 0 0 1 64
9 0 0 5 53
10 1 1 1 63
# ℹ 758 more rows
The order of strings in your vector will determine the order of columns returned:
# A tibble: 768 × 4
age care.home.referral mental_health_care periods_of_previous_care
<int> <int> <int> <int>
1 50 0 0 1
2 31 1 0 1
3 32 0 1 1
4 69 1 1 1
5 33 0 1 1
6 75 1 1 1
7 26 1 0 1
8 64 0 0 1
9 53 0 0 5
10 63 1 1 1
# ℹ 758 more rows
And you can negate the contains()
to give the complement:
# A tibble: 768 × 5
stranded.label medicallysafe hcop admit_date frailty_index
<chr> <int> <int> <chr> <chr>
1 Not Stranded 0 0 29/12/2020 No index item
2 Not Stranded 0 1 11/12/2020 No index item
3 Not Stranded 1 0 19/01/2021 No index item
4 Not Stranded 1 0 07/01/2021 Mobility problems
5 Not Stranded 0 1 15/01/2021 No index item
6 Stranded 1 0 02/01/2021 Mobility problems
7 Not Stranded 0 1 02/01/2021 No index item
8 Not Stranded 1 1 29/12/2020 Fall patient history
9 Not Stranded 1 0 04/01/2021 No index item
10 Not Stranded 0 0 30/12/2020 Activity Limitation
# ℹ 758 more rows
starts_with()
and ends_with()
work in exactly the same way - but will only match strings at the beginning and end of the column name respectively:
stranded_data |>
select(starts_with("care"))
# A tibble: 768 × 1
care.home.referral
<int>
1 0
2 1
3 0
4 1
5 0
6 1
7 1
8 0
9 0
10 1
# ℹ 758 more rows
# A tibble: 768 × 2
mental_health_care periods_of_previous_care
<int> <int>
1 0 1
2 0 1
3 1 1
4 1 1
5 1 1
6 1 1
7 0 1
8 0 1
9 0 5
10 1 1
# ℹ 758 more rows
matches()
allows use of a regular expression:
# A tibble: 768 × 2
medicallysafe mental_health_care
<int> <int>
1 0 0
2 0 0
3 1 1
4 1 1
5 0 1
6 1 1
7 0 0
8 1 0
9 1 0
10 0 1
# ℹ 758 more rows
num_range()
matches columns with numeric ranges. For example, in the billboard dataset, there are a large number of columns named wk1
, wk2
… representing weeks. To select wk10
to wk15
:
# A tibble: 317 × 6
wk10 wk11 wk12 wk13 wk14 wk15
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 51 51 51 47 44 38
4 61 61 59 61 66 72
5 57 64 70 75 76 78
6 6 7 22 29 36 47
7 NA NA NA NA NA NA
8 36 37 37 38 49 61
9 10 9 8 6 1 2
10 59 66 68 61 67 59
# ℹ 307 more rows
any_of()
any_of()
matches full column names from a vector of names. Let’s set up some names:
my_columns <- names(stranded_data)[1:4]
my_columns
[1] "stranded.label" "age" "care.home.referral"
[4] "medicallysafe"
If we supply any_of()
with this vector of names, any_of()
will match any column names that appear in the vector, and ignore those that do not:
# A tibble: 768 × 4
stranded.label age care.home.referral medicallysafe
<chr> <int> <int> <int>
1 Not Stranded 50 0 0
2 Not Stranded 31 1 0
3 Not Stranded 32 0 1
4 Not Stranded 69 1 1
5 Not Stranded 33 0 0
6 Stranded 75 1 1
7 Not Stranded 26 1 0
8 Not Stranded 64 0 1
9 Not Stranded 53 0 1
10 Not Stranded 63 1 0
# ℹ 758 more rows
Compared to contains()
, any_of()
will match the entire name, while contains()
will match any part of the name.
There’s a great tip on the all_of()
man page - any_of()
is an especially clever way to drop columns from a tibble because calling it more than once does not cause an error:
# A tibble: 768 × 5
hcop mental_health_care periods_of_previous_care admit_date frailty_index
<int> <int> <int> <chr> <chr>
1 0 0 1 29/12/2020 No index item
2 1 0 1 11/12/2020 No index item
3 0 1 1 19/01/2021 No index item
4 0 1 1 07/01/2021 Mobility proble…
5 1 1 1 15/01/2021 No index item
6 0 1 1 02/01/2021 Mobility proble…
7 1 0 1 02/01/2021 No index item
8 1 0 1 29/12/2020 Fall patient hi…
9 0 0 5 04/01/2021 No index item
10 0 1 1 30/12/2020 Activity Limita…
# ℹ 758 more rows
Trying to do something similar using select()
will cause an error:
if_any()
and if_all()
stranded_data |>
filter(if_any(contains("care"), ~ .x == 1)) # that'll find rows with a 1 in any "care" column
# A tibble: 723 × 9
stranded.label age care.home.referral medicallysafe hcop
<chr> <int> <int> <int> <int>
1 Not Stranded 50 0 0 0
2 Not Stranded 31 1 0 1
3 Not Stranded 32 0 1 0
4 Not Stranded 69 1 1 0
5 Not Stranded 33 0 0 1
6 Stranded 75 1 1 0
7 Not Stranded 26 1 0 1
8 Not Stranded 64 0 1 1
9 Not Stranded 63 1 0 0
10 Not Stranded 30 1 1 0
# ℹ 713 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
# admit_date <chr>, frailty_index <chr>
stranded_data |>
mutate(care_found = case_when(if_all(contains("care"), ~ .x >= 1) ~ "Yup",
TRUE ~ "Nope")) |>
select(contains("care"), care_found)
# A tibble: 768 × 4
care.home.referral mental_health_care periods_of_previous_care care_found
<int> <int> <int> <chr>
1 0 0 1 Nope
2 1 0 1 Nope
3 0 1 1 Nope
4 1 1 1 Yup
5 0 1 1 Nope
6 1 1 1 Yup
7 1 0 1 Nope
8 0 0 1 Nope
9 0 0 5 Nope
10 1 1 1 Yup
# ℹ 758 more rows
everything()
everything()
selects all columns. This is less useful in combination with select()
itself, but simplifies some other functions well - particularly with pivot_longer()
.
stranded_data |>
select(2:5) |>
tidyr::pivot_longer(everything())
# A tibble: 3,072 × 2
name value
<chr> <int>
1 age 50
2 care.home.referral 0
3 medicallysafe 0
4 hcop 0
5 age 31
6 care.home.referral 1
7 medicallysafe 0
8 hcop 1
9 age 32
10 care.home.referral 0
# ℹ 3,062 more rows
See also last_col()
, which is another example of a tidyselect helper for specific columns.