No feedback found for this session
Tidyselect
Introduction
This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to explore dplyr’s built-in language for selecting columns: tidyselect.
Package loading is just:
library(NHSRdatasets) # for sample data
library(dplyr)
Logical operators
The four operators that we’ll introduce here are found all over R, and aren’t in any way specific to the tidyverse, so we’ll review them very briefly before moving on to the more idiosyncratic aspects of tidyselect.
:
allows you to select a range of columns. So to select the second to the fifth columns in the stranded_data
set by name:
stranded_data |>
select(age:hcop)
# A tibble: 768 × 4
age care.home.referral medicallysafe hcop
<int> <int> <int> <int>
1 50 0 0 0
2 31 1 0 1
3 32 0 1 0
4 69 1 1 0
5 33 0 0 1
6 75 1 1 0
7 26 1 0 1
8 64 0 1 1
9 53 0 1 0
10 63 1 0 0
# ℹ 758 more rows
Similarly, by index:
stranded_data |>
select(2:5)
# A tibble: 768 × 4
age care.home.referral medicallysafe hcop
<int> <int> <int> <int>
1 50 0 0 0
2 31 1 0 1
3 32 0 1 0
4 69 1 1 0
5 33 0 0 1
6 75 1 1 0
7 26 1 0 1
8 64 0 1 1
9 53 0 1 0
10 63 1 0 0
# ℹ 758 more rows
!
allows you to select the complement - in other words, everything but the specified column:
stranded_data |>
select(!stranded.label)
# A tibble: 768 × 8
age care.home.referral medicallysafe hcop mental_health_care
<int> <int> <int> <int> <int>
1 50 0 0 0 0
2 31 1 0 1 0
3 32 0 1 0 1
4 69 1 1 0 1
5 33 0 0 1 1
6 75 1 1 0 1
7 26 1 0 1 0
8 64 0 1 1 0
9 53 0 1 0 0
10 63 1 0 0 1
# ℹ 758 more rows
# ℹ 3 more variables: periods_of_previous_care <int>, admit_date <chr>,
# frailty_index <chr>
Use c()
if you want to drop multiple columns using !
:
# A tibble: 768 × 7
care.home.referral medicallysafe hcop mental_health_care
<int> <int> <int> <int>
1 0 0 0 0
2 1 0 1 0
3 0 1 0 1
4 1 1 0 1
5 0 0 1 1
6 1 1 0 1
7 1 0 1 0
8 0 1 1 0
9 0 1 0 0
10 1 0 0 1
# ℹ 758 more rows
# ℹ 3 more variables: periods_of_previous_care <int>, admit_date <chr>,
# frailty_index <chr>
&
lets you group together selection helpers. If you’re selecting many columns by name, &
isn’t needed. Instead, you can just specify what you need to keep:
stranded_data |>
select(stranded.label, hcop, age)
# A tibble: 768 × 3
stranded.label hcop age
<chr> <int> <int>
1 Not Stranded 0 50
2 Not Stranded 1 31
3 Not Stranded 0 32
4 Not Stranded 0 69
5 Not Stranded 1 33
6 Stranded 0 75
7 Not Stranded 1 26
8 Not Stranded 1 64
9 Not Stranded 0 53
10 Not Stranded 0 63
# ℹ 758 more rows
Where &
is really useful is in combination with pattern matching helpers, which we’ll look at in more detail in the next subsection:
stranded_data |>
select(starts_with("m") & ends_with("e"))
# A tibble: 768 × 2
medicallysafe mental_health_care
<int> <int>
1 0 0
2 0 0
3 1 1
4 1 1
5 0 1
6 1 1
7 0 0
8 1 0
9 1 0
10 0 1
# ℹ 758 more rows
&
returns the columns where both of the patterns are matched. Compare and contrast the |
operator, which gives the union of both helpers:
stranded_data |>
select(starts_with("m") | ends_with("e"))
# A tibble: 768 × 5
medicallysafe mental_health_care age periods_of_previous_care admit_date
<int> <int> <int> <int> <chr>
1 0 0 50 1 29/12/2020
2 0 0 31 1 11/12/2020
3 1 1 32 1 19/01/2021
4 1 1 69 1 07/01/2021
5 0 1 33 1 15/01/2021
6 1 1 75 1 02/01/2021
7 0 0 26 1 02/01/2021
8 1 0 64 1 29/12/2020
9 1 0 53 5 04/01/2021
10 0 1 63 1 30/12/2020
# ℹ 758 more rows
contains()
contains()
is a good example of tidyselect functions that are based on pattern matching. There are several related functions that work in the same way, including:
Here, we’ll concentrate on contains()
as an exemplar for this family, because the other functions in the family work in a very similar way. contains()
allows you to pick columns based on their names. So to select()
all the columns in stranded_data
whose names contain the string “care”:
# A tibble: 768 × 3
care.home.referral mental_health_care periods_of_previous_care
<int> <int> <int>
1 0 0 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 1
6 1 1 1
7 1 0 1
8 0 0 1
9 0 0 5
10 1 1 1
# ℹ 758 more rows
Note that by default this string matching is case insensitive, so “care” will match columns called “CARE”, “Care”, and “cArE”. You can also combine the contains()
statement with other select()
criteria:
# A tibble: 768 × 4
age care.home.referral mental_health_care periods_of_previous_care
<int> <int> <int> <int>
1 50 0 0 1
2 31 1 0 1
3 32 0 1 1
4 69 1 1 1
5 33 0 1 1
6 75 1 1 1
7 26 1 0 1
8 64 0 0 1
9 53 0 0 5
10 63 1 1 1
# ℹ 758 more rows
And you can match over a vector of strings:
# A tibble: 768 × 4
care.home.referral mental_health_care periods_of_previous_care age
<int> <int> <int> <int>
1 0 0 1 50
2 1 0 1 31
3 0 1 1 32
4 1 1 1 69
5 0 1 1 33
6 1 1 1 75
7 1 0 1 26
8 0 0 1 64
9 0 0 5 53
10 1 1 1 63
# ℹ 758 more rows
The order of strings in your vector will determine the order of columns returned:
# A tibble: 768 × 4
age care.home.referral mental_health_care periods_of_previous_care
<int> <int> <int> <int>
1 50 0 0 1
2 31 1 0 1
3 32 0 1 1
4 69 1 1 1
5 33 0 1 1
6 75 1 1 1
7 26 1 0 1
8 64 0 0 1
9 53 0 0 5
10 63 1 1 1
# ℹ 758 more rows
And you can negate the contains()
to give the complement:
# A tibble: 768 × 5
stranded.label medicallysafe hcop admit_date frailty_index
<chr> <int> <int> <chr> <chr>
1 Not Stranded 0 0 29/12/2020 No index item
2 Not Stranded 0 1 11/12/2020 No index item
3 Not Stranded 1 0 19/01/2021 No index item
4 Not Stranded 1 0 07/01/2021 Mobility problems
5 Not Stranded 0 1 15/01/2021 No index item
6 Stranded 1 0 02/01/2021 Mobility problems
7 Not Stranded 0 1 02/01/2021 No index item
8 Not Stranded 1 1 29/12/2020 Fall patient history
9 Not Stranded 1 0 04/01/2021 No index item
10 Not Stranded 0 0 30/12/2020 Activity Limitation
# ℹ 758 more rows
starts_with()
and ends_with()
work in exactly the same way - but will only match strings at the beginning and end of the column name respectively:
stranded_data |>
select(starts_with("care"))
# A tibble: 768 × 1
care.home.referral
<int>
1 0
2 1
3 0
4 1
5 0
6 1
7 1
8 0
9 0
10 1
# ℹ 758 more rows
# A tibble: 768 × 2
mental_health_care periods_of_previous_care
<int> <int>
1 0 1
2 0 1
3 1 1
4 1 1
5 1 1
6 1 1
7 0 1
8 0 1
9 0 5
10 1 1
# ℹ 758 more rows
num_range()
matches columns with numeric ranges. For example, in the billboard dataset, there are a large number of columns named wk1
, wk2
… representing weeks. To select wk10
to wk15
:
# A tibble: 317 × 6
wk10 wk11 wk12 wk13 wk14 wk15
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 51 51 51 47 44 38
4 61 61 59 61 66 72
5 57 64 70 75 76 78
6 6 7 22 29 36 47
7 NA NA NA NA NA NA
8 36 37 37 38 49 61
9 10 9 8 6 1 2
10 59 66 68 61 67 59
# ℹ 307 more rows
matches()
While starts_with()
and ends_with()
match supplied strings, matches()
uses regular expressions (regex). Regular expressions are a way of matching all kinds of strings by specifying a pattern. For example, you can use regex to search through a piece of text for all the email addresses that it contains. In pseudocode, the regex would work like this:
- look for an @
- look for the start and ends of the word that contains that @
- make sure that the word doesn’t contain any forbidden characters
Getting to grips with regex is a bit forbidding, because the syntax used is very concise and impenetrable. For instance, the above pseudocode translates into:
[[:alnum:].-_]+@[[:alnum:].-]+
Luckily, there are some great tools to help the beginner - such as the Rstudio cheatsheet, or the interactive tool regular expressions 101. There are also a few very simple examples below that cover some of the common cases:
Match one of a group of letters using []
:
# A tibble: 768 × 2
medicallysafe mental_health_care
<int> <int>
1 0 0
2 0 0
3 1 1
4 1 1
5 0 1
6 1 1
7 0 0
8 1 0
9 1 0
10 0 1
# ℹ 758 more rows
This returns columns containing either “med” or “men”.
To match characters at the start of a word use ^
:
# A tibble: 768 × 4
age medicallysafe mental_health_care admit_date
<int> <int> <int> <chr>
1 50 0 0 29/12/2020
2 31 0 0 11/12/2020
3 32 1 1 19/01/2021
4 69 1 1 07/01/2021
5 33 0 1 15/01/2021
6 75 1 1 02/01/2021
7 26 0 0 02/01/2021
8 64 1 0 29/12/2020
9 53 1 0 04/01/2021
10 63 0 1 30/12/2020
# ℹ 758 more rows
.
matches any character:
# A tibble: 768 × 3
care.home.referral medicallysafe mental_health_care
<int> <int> <int>
1 0 0 0
2 1 0 0
3 0 1 1
4 1 1 1
5 0 0 1
6 1 1 1
7 1 0 0
8 0 1 0
9 0 1 0
10 1 0 1
# ℹ 758 more rows
To match columns containing a literal full stop use \\
to escape the character:
where()
There’s a bit more to explain with where()
than the other tidyselect helpers. Broadly, where()
lets us use a function to match columns. Let’s start simply. We’ll use the base R function is.character()
. This is simple - it returns TRUE
when a column (or vector, but not whole tibble) contains character data. Let’s look at the stranded.label column in stranded_data:
is.character(stranded_data$stranded.label)
[1] TRUE
This should return true. So we can see how we might use is.character as the basis of selecting all the character columns from this data using where()
.
# A tibble: 768 × 3
stranded.label admit_date frailty_index
<chr> <chr> <chr>
1 Not Stranded 29/12/2020 No index item
2 Not Stranded 11/12/2020 No index item
3 Not Stranded 19/01/2021 No index item
4 Not Stranded 07/01/2021 Mobility problems
5 Not Stranded 15/01/2021 No index item
6 Stranded 02/01/2021 Mobility problems
7 Not Stranded 02/01/2021 No index item
8 Not Stranded 29/12/2020 Fall patient history
9 Not Stranded 04/01/2021 No index item
10 Not Stranded 30/12/2020 Activity Limitation
# ℹ 758 more rows
any_of()
any_of()
matches full column names from a vector of names. Let’s set up some names:
my_columns <- names(stranded_data)[1:4]
my_columns
[1] "stranded.label" "age" "care.home.referral"
[4] "medicallysafe"
If we supply any_of()
with this vector of names, any_of()
will match any column names that appear in the vector, and ignore those that do not:
# A tibble: 768 × 4
stranded.label age care.home.referral medicallysafe
<chr> <int> <int> <int>
1 Not Stranded 50 0 0
2 Not Stranded 31 1 0
3 Not Stranded 32 0 1
4 Not Stranded 69 1 1
5 Not Stranded 33 0 0
6 Stranded 75 1 1
7 Not Stranded 26 1 0
8 Not Stranded 64 0 1
9 Not Stranded 53 0 1
10 Not Stranded 63 1 0
# ℹ 758 more rows
Compared to contains()
, any_of()
will match the entire name, while contains()
will match any part of the name.
There’s a great tip on the all_of()
man page - any_of()
is an especially clever way to drop columns from a tibble because calling it more than once does not cause an error:
# A tibble: 768 × 5
hcop mental_health_care periods_of_previous_care admit_date frailty_index
<int> <int> <int> <chr> <chr>
1 0 0 1 29/12/2020 No index item
2 1 0 1 11/12/2020 No index item
3 0 1 1 19/01/2021 No index item
4 0 1 1 07/01/2021 Mobility proble…
5 1 1 1 15/01/2021 No index item
6 0 1 1 02/01/2021 Mobility proble…
7 1 0 1 02/01/2021 No index item
8 1 0 1 29/12/2020 Fall patient hi…
9 0 0 5 04/01/2021 No index item
10 0 1 1 30/12/2020 Activity Limita…
# ℹ 758 more rows
Trying to do something similar using select()
will cause an error:
everything()
everything()
selects all columns. This is less useful in combination with select()
itself, but simplifies some other functions well - particularly with pivot_longer()
.
stranded_data |>
select(2:5) |>
tidyr::pivot_longer(everything())
# A tibble: 3,072 × 2
name value
<chr> <int>
1 age 50
2 care.home.referral 0
3 medicallysafe 0
4 hcop 0
5 age 31
6 care.home.referral 1
7 medicallysafe 0
8 hcop 1
9 age 32
10 care.home.referral 0
# ℹ 3,062 more rows
last_col()
Last col selects the right-most column in your data. That’s especially useful in concert with mutate()
, as by default new columns are created at the right-hand edge of your tibble. This gives you an easy way of previewing changes in new columns:
# A tibble: 768 × 10
decade stranded.label age care.home.referral medicallysafe hcop
<dbl> <chr> <int> <int> <int> <int>
1 50 Not Stranded 50 0 0 0
2 30 Not Stranded 31 1 0 1
3 30 Not Stranded 32 0 1 0
4 70 Not Stranded 69 1 1 0
5 30 Not Stranded 33 0 0 1
6 80 Stranded 75 1 1 0
7 30 Not Stranded 26 1 0 1
8 60 Not Stranded 64 0 1 1
9 50 Not Stranded 53 0 1 0
10 60 Not Stranded 63 1 0 0
# ℹ 758 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
# admit_date <chr>, frailty_index <chr>
if_any()
, if_all()
, across()
, and pick()
Although not technically part of the tidyselect family, the final topic for this session is to explore a group of functions that work similarly: if_any()
/ if_all()
, which allow you to effectively integrate basic filtering with selection, and across()
/ pick()
, which allows you to apply functions across many columns.
stranded_data |>
filter(if_any(contains("care"), ~ .x == 1)) # that'll find rows with a 1 in any "care" column
# A tibble: 723 × 9
stranded.label age care.home.referral medicallysafe hcop
<chr> <int> <int> <int> <int>
1 Not Stranded 50 0 0 0
2 Not Stranded 31 1 0 1
3 Not Stranded 32 0 1 0
4 Not Stranded 69 1 1 0
5 Not Stranded 33 0 0 1
6 Stranded 75 1 1 0
7 Not Stranded 26 1 0 1
8 Not Stranded 64 0 1 1
9 Not Stranded 63 1 0 0
10 Not Stranded 30 1 1 0
# ℹ 713 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
# admit_date <chr>, frailty_index <chr>
stranded_data |>
mutate(care_found = case_when(if_all(contains("care"), ~ .x >= 1) ~ "Yup",
TRUE ~ "Nope")) |>
select(contains("care"), care_found)
# A tibble: 768 × 4
care.home.referral mental_health_care periods_of_previous_care care_found
<int> <int> <int> <chr>
1 0 0 1 Nope
2 1 0 1 Nope
3 0 1 1 Nope
4 1 1 1 Yup
5 0 1 1 Nope
6 1 1 1 Yup
7 1 0 1 Nope
8 0 0 1 Nope
9 0 0 5 Nope
10 1 1 1 Yup
# ℹ 758 more rows
across()
is used inside summarise()
or mutate()
:
# A tibble: 768 × 9
stranded.label age care.home.referral medicallysafe hcop
<chr> <int> <int> <int> <int>
1 NOT STRANDED 50 0 0 0
2 NOT STRANDED 31 1 0 1
3 NOT STRANDED 32 0 1 0
4 NOT STRANDED 69 1 1 0
5 NOT STRANDED 33 0 0 1
6 STRANDED 75 1 1 0
7 NOT STRANDED 26 1 0 1
8 NOT STRANDED 64 0 1 1
9 NOT STRANDED 53 0 1 0
10 NOT STRANDED 63 1 0 0
# ℹ 758 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
# admit_date <chr>, frailty_index <chr>
# A tibble: 768 × 9
stranded.label age care.home.referral medicallysafe hcop
<chr> <int> <int> <dbl> <dbl>
1 Not Stranded 50 0 0 0
2 Not Stranded 31 1 0 20
3 Not Stranded 32 0 20 0
4 Not Stranded 69 1 20 0
5 Not Stranded 33 0 0 20
6 Stranded 75 1 20 0
7 Not Stranded 26 1 0 20
8 Not Stranded 64 0 20 20
9 Not Stranded 53 0 20 0
10 Not Stranded 63 1 0 0
# ℹ 758 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
# admit_date <chr>, frailty_index <chr>
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160.0 110 4 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21 6 160.0 110 4 2.875 17.02 0 1 4 4
Datsun 710 23 4 108.0 93 4 2.320 18.61 1 1 4 1
Hornet 4 Drive 21 6 258.0 110 3 3.215 19.44 1 0 3 1
Hornet Sportabout 19 8 360.0 175 3 3.440 17.02 0 0 3 2
Valiant 18 6 225.0 105 3 3.460 20.22 1 0 3 1
Duster 360 14 8 360.0 245 3 3.570 15.84 0 0 3 4
Merc 240D 24 4 146.7 62 4 3.190 20.00 1 0 4 2
Merc 230 23 4 140.8 95 4 3.150 22.90 1 0 4 2
Merc 280 19 6 167.6 123 4 3.440 18.30 1 0 4 4
Merc 280C 18 6 167.6 123 4 3.440 18.90 1 0 4 4
Merc 450SE 16 8 275.8 180 3 4.070 17.40 0 0 3 3
Merc 450SL 17 8 275.8 180 3 3.730 17.60 0 0 3 3
Merc 450SLC 15 8 275.8 180 3 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10 8 472.0 205 3 5.250 17.98 0 0 3 4
Lincoln Continental 10 8 460.0 215 3 5.424 17.82 0 0 3 4
Chrysler Imperial 15 8 440.0 230 3 5.345 17.42 0 0 3 4
Fiat 128 32 4 78.7 66 4 2.200 19.47 1 1 4 1
Honda Civic 30 4 75.7 52 5 1.615 18.52 1 1 4 2
Toyota Corolla 34 4 71.1 65 4 1.835 19.90 1 1 4 1
Toyota Corona 22 4 120.1 97 4 2.465 20.01 1 0 3 1
Dodge Challenger 16 8 318.0 150 3 3.520 16.87 0 0 3 2
AMC Javelin 15 8 304.0 150 3 3.435 17.30 0 0 3 2
Camaro Z28 13 8 350.0 245 4 3.840 15.41 0 0 3 4
Pontiac Firebird 19 8 400.0 175 3 3.845 17.05 0 0 3 2
Fiat X1-9 27 4 79.0 66 4 1.935 18.90 1 1 4 1
Porsche 914-2 26 4 120.3 91 4 2.140 16.70 0 1 5 2
Lotus Europa 30 4 95.1 113 4 1.513 16.90 1 1 5 2
Ford Pantera L 16 8 351.0 264 4 3.170 14.50 0 1 5 4
Ferrari Dino 20 6 145.0 175 4 2.770 15.50 0 1 5 6
Maserati Bora 15 8 301.0 335 4 3.570 14.60 0 1 5 8
Volvo 142E 21 4 121.0 109 4 2.780 18.60 1 1 4 2
stranded_data |>
group_by(care.home.referral) |>
summarise(across(where(is.numeric), list(mean = mean, sd = sd)))
# A tibble: 2 × 11
care.home.referral age_mean age_sd medicallysafe_mean medicallysafe_sd
<int> <dbl> <dbl> <dbl> <dbl>
1 0 53.1 20.7 0.475 0.500
2 1 52.3 20.5 0.471 0.500
# ℹ 6 more variables: hcop_mean <dbl>, hcop_sd <dbl>,
# mental_health_care_mean <dbl>, mental_health_care_sd <dbl>,
# periods_of_previous_care_mean <dbl>, periods_of_previous_care_sd <dbl>
stranded_data |>
group_by(care.home.referral) |>
summarise(across(where(is.numeric), mean, .names = "average_{.col}"))
# A tibble: 2 × 6
care.home.referral average_age average_medicallysafe average_hcop
<int> <dbl> <dbl> <dbl>
1 0 53.1 0.475 0.496
2 1 52.3 0.471 0.506
# ℹ 2 more variables: average_mental_health_care <dbl>,
# average_periods_of_previous_care <dbl>
pick()
effectively allows you to sub-set a tibble inside a data masking function. Effectively an integrated select()
:
# A tibble: 37 × 3
care.home.referral mental_health_care periods_of_previous_care
<int> <int> <int>
1 0 0 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 0 5
6 0 0 4
7 1 1 2
8 1 1 4
9 0 1 3
10 1 0 2
# ℹ 27 more rows
# A tibble: 19 × 3
mental_health_care periods_of_previous_care n
<int> <int> <int>
1 0 1 293
2 1 1 286
3 1 NA 42
4 0 NA 27
5 1 2 18
6 0 2 17
7 0 3 17
8 1 3 16
9 1 4 8
10 0 10 7
11 1 10 7
12 0 8 5
13 1 7 5
14 0 4 4
15 0 5 4
16 0 7 4
17 0 6 3
18 1 8 3
19 1 5 2
Especially useful when you’re effectively working with a tibble sub-set of your data:
stranded_data |>
mutate(rank = dense_rank(pick(periods_of_previous_care, age))) |>
arrange(-rank) |>
select(rank, age, periods_of_previous_care) # something like a rank by previous episodes, with age as a tie-breaker
# A tibble: 768 × 3
rank age periods_of_previous_care
<int> <int> <int>
1 152 75 10
2 151 71 10
3 150 68 10
4 149 66 10
5 148 65 10
6 148 65 10
7 147 63 10
8 146 61 10
9 145 51 10
10 144 42 10
# ℹ 758 more rows