Tidyselect

R
intermediate
Published

May 19, 2025

No feedback found for this session

Introduction

Lots of the power in dplyr comes from the many ways that you can select columns. In this section, we’ll look at some examples of tidyselect, which is a series of functions that allow you to specify columns by various patterns. Not only does this give you lots of ways of simplifying and streamlining your code, but it’s also a great way of making your code more portable - so that it works well inside functions, or across different data sets.

In the tidyevaluation training session, we look across this material in much more depth. There’s quite a lot to absorb, so I would definitely recommend using the manual pages (using ??tidyselect) as a reference for this section.

Package loading is just:

library(NHSRdatasets) # for sample data
library(dplyr)

contains()

contains() is a good example of tidyselect functions that are based on pattern matching. There are several related functions that work in the same way, including:

Here, we’ll concentrate on contains() as an exemplar for this family, because the other functions in the family work in a very similar way. contains() allows you to pick columns based on their names. So to select() all the columns in stranded_data whose names contain the string “care”:

stranded_data |>
  select(contains("care"))
# A tibble: 768 × 3
   care.home.referral mental_health_care periods_of_previous_care
                <int>              <int>                    <int>
 1                  0                  0                        1
 2                  1                  0                        1
 3                  0                  1                        1
 4                  1                  1                        1
 5                  0                  1                        1
 6                  1                  1                        1
 7                  1                  0                        1
 8                  0                  0                        1
 9                  0                  0                        5
10                  1                  1                        1
# ℹ 758 more rows

Note that by default this string matching is case insensitive, so “care” will match columns called “CARE”, “Care”, and “cArE”. You can also combine the contains() statement with other select() criteria:

stranded_data |>
  select(age, contains("care"))
# A tibble: 768 × 4
     age care.home.referral mental_health_care periods_of_previous_care
   <int>              <int>              <int>                    <int>
 1    50                  0                  0                        1
 2    31                  1                  0                        1
 3    32                  0                  1                        1
 4    69                  1                  1                        1
 5    33                  0                  1                        1
 6    75                  1                  1                        1
 7    26                  1                  0                        1
 8    64                  0                  0                        1
 9    53                  0                  0                        5
10    63                  1                  1                        1
# ℹ 758 more rows

And you can match over a vector of strings:

stranded_data |>
  select(contains(c("care", "age")))
# A tibble: 768 × 4
   care.home.referral mental_health_care periods_of_previous_care   age
                <int>              <int>                    <int> <int>
 1                  0                  0                        1    50
 2                  1                  0                        1    31
 3                  0                  1                        1    32
 4                  1                  1                        1    69
 5                  0                  1                        1    33
 6                  1                  1                        1    75
 7                  1                  0                        1    26
 8                  0                  0                        1    64
 9                  0                  0                        5    53
10                  1                  1                        1    63
# ℹ 758 more rows

The order of strings in your vector will determine the order of columns returned:

stranded_data |>
  select(contains(c("age", "care")))
# A tibble: 768 × 4
     age care.home.referral mental_health_care periods_of_previous_care
   <int>              <int>              <int>                    <int>
 1    50                  0                  0                        1
 2    31                  1                  0                        1
 3    32                  0                  1                        1
 4    69                  1                  1                        1
 5    33                  0                  1                        1
 6    75                  1                  1                        1
 7    26                  1                  0                        1
 8    64                  0                  0                        1
 9    53                  0                  0                        5
10    63                  1                  1                        1
# ℹ 758 more rows

And you can negate the contains() to give the complement:

stranded_data |>
  select(-contains(c("age", "care")))
# A tibble: 768 × 5
   stranded.label medicallysafe  hcop admit_date frailty_index       
   <chr>                  <int> <int> <chr>      <chr>               
 1 Not Stranded               0     0 29/12/2020 No index item       
 2 Not Stranded               0     1 11/12/2020 No index item       
 3 Not Stranded               1     0 19/01/2021 No index item       
 4 Not Stranded               1     0 07/01/2021 Mobility problems   
 5 Not Stranded               0     1 15/01/2021 No index item       
 6 Stranded                   1     0 02/01/2021 Mobility problems   
 7 Not Stranded               0     1 02/01/2021 No index item       
 8 Not Stranded               1     1 29/12/2020 Fall patient history
 9 Not Stranded               1     0 04/01/2021 No index item       
10 Not Stranded               0     0 30/12/2020 Activity Limitation 
# ℹ 758 more rows

starts_with() and ends_with() work in exactly the same way - but will only match strings at the beginning and end of the column name respectively:

stranded_data |>
  select(starts_with("care"))
# A tibble: 768 × 1
   care.home.referral
                <int>
 1                  0
 2                  1
 3                  0
 4                  1
 5                  0
 6                  1
 7                  1
 8                  0
 9                  0
10                  1
# ℹ 758 more rows
stranded_data |>
  select(ends_with("care"))
# A tibble: 768 × 2
   mental_health_care periods_of_previous_care
                <int>                    <int>
 1                  0                        1
 2                  0                        1
 3                  1                        1
 4                  1                        1
 5                  1                        1
 6                  1                        1
 7                  0                        1
 8                  0                        1
 9                  0                        5
10                  1                        1
# ℹ 758 more rows

matches() allows use of a regular expression:

stranded_data |>
  select(matches("me[dn]"))
# A tibble: 768 × 2
   medicallysafe mental_health_care
           <int>              <int>
 1             0                  0
 2             0                  0
 3             1                  1
 4             1                  1
 5             0                  1
 6             1                  1
 7             0                  0
 8             1                  0
 9             1                  0
10             0                  1
# ℹ 758 more rows

num_range() matches columns with numeric ranges. For example, in the billboard dataset, there are a large number of columns named wk1, wk2… representing weeks. To select wk10 to wk15:

tidyr::billboard |> 
  dplyr::select(num_range("wk", 10:15))
# A tibble: 317 × 6
    wk10  wk11  wk12  wk13  wk14  wk15
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    NA    NA    NA    NA    NA    NA
 2    NA    NA    NA    NA    NA    NA
 3    51    51    51    47    44    38
 4    61    61    59    61    66    72
 5    57    64    70    75    76    78
 6     6     7    22    29    36    47
 7    NA    NA    NA    NA    NA    NA
 8    36    37    37    38    49    61
 9    10     9     8     6     1     2
10    59    66    68    61    67    59
# ℹ 307 more rows

any_of()

any_of() matches full column names from a vector of names. Let’s set up some names:

my_columns <- names(stranded_data)[1:4]
my_columns
[1] "stranded.label"     "age"                "care.home.referral"
[4] "medicallysafe"     

If we supply any_of() with this vector of names, any_of() will match any column names that appear in the vector, and ignore those that do not:

stranded_data |>
  select(any_of(my_columns))
# A tibble: 768 × 4
   stranded.label   age care.home.referral medicallysafe
   <chr>          <int>              <int>         <int>
 1 Not Stranded      50                  0             0
 2 Not Stranded      31                  1             0
 3 Not Stranded      32                  0             1
 4 Not Stranded      69                  1             1
 5 Not Stranded      33                  0             0
 6 Stranded          75                  1             1
 7 Not Stranded      26                  1             0
 8 Not Stranded      64                  0             1
 9 Not Stranded      53                  0             1
10 Not Stranded      63                  1             0
# ℹ 758 more rows

Compared to contains(), any_of() will match the entire name, while contains() will match any part of the name.

There’s a great tip on the all_of() man page - any_of() is an especially clever way to drop columns from a tibble because calling it more than once does not cause an error:

stranded_data |>
  select(-any_of(my_columns)) |>
  select(-any_of(my_columns))
# A tibble: 768 × 5
    hcop mental_health_care periods_of_previous_care admit_date frailty_index   
   <int>              <int>                    <int> <chr>      <chr>           
 1     0                  0                        1 29/12/2020 No index item   
 2     1                  0                        1 11/12/2020 No index item   
 3     0                  1                        1 19/01/2021 No index item   
 4     0                  1                        1 07/01/2021 Mobility proble…
 5     1                  1                        1 15/01/2021 No index item   
 6     0                  1                        1 02/01/2021 Mobility proble…
 7     1                  0                        1 02/01/2021 No index item   
 8     1                  0                        1 29/12/2020 Fall patient hi…
 9     0                  0                        5 04/01/2021 No index item   
10     0                  1                        1 30/12/2020 Activity Limita…
# ℹ 758 more rows

Trying to do something similar using select() will cause an error:

try(stranded_data |>
  select(-age) |>
  select(-age))
Error in select(select(stranded_data, -age), -age) : 
  Can't select columns that don't exist.
✖ Column `age` doesn't exist.

if_any() and if_all()

stranded_data |>
  filter(if_any(contains("care"), ~ .x == 1)) # that'll find rows with a 1 in any "care" column
# A tibble: 723 × 9
   stranded.label   age care.home.referral medicallysafe  hcop
   <chr>          <int>              <int>         <int> <int>
 1 Not Stranded      50                  0             0     0
 2 Not Stranded      31                  1             0     1
 3 Not Stranded      32                  0             1     0
 4 Not Stranded      69                  1             1     0
 5 Not Stranded      33                  0             0     1
 6 Stranded          75                  1             1     0
 7 Not Stranded      26                  1             0     1
 8 Not Stranded      64                  0             1     1
 9 Not Stranded      63                  1             0     0
10 Not Stranded      30                  1             1     0
# ℹ 713 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
#   admit_date <chr>, frailty_index <chr>
stranded_data |>
  mutate(care_found = case_when(if_all(contains("care"), ~ .x >= 1) ~ "Yup",
                                TRUE ~ "Nope")) |>
  select(contains("care"), care_found)
# A tibble: 768 × 4
   care.home.referral mental_health_care periods_of_previous_care care_found
                <int>              <int>                    <int> <chr>     
 1                  0                  0                        1 Nope      
 2                  1                  0                        1 Nope      
 3                  0                  1                        1 Nope      
 4                  1                  1                        1 Yup       
 5                  0                  1                        1 Nope      
 6                  1                  1                        1 Yup       
 7                  1                  0                        1 Nope      
 8                  0                  0                        1 Nope      
 9                  0                  0                        5 Nope      
10                  1                  1                        1 Yup       
# ℹ 758 more rows

everything()

everything() selects all columns. This is less useful in combination with select() itself, but simplifies some other functions well - particularly with pivot_longer().

stranded_data |> 
  select(2:5) |>
  tidyr::pivot_longer(everything())
# A tibble: 3,072 × 2
   name               value
   <chr>              <int>
 1 age                   50
 2 care.home.referral     0
 3 medicallysafe          0
 4 hcop                   0
 5 age                   31
 6 care.home.referral     1
 7 medicallysafe          0
 8 hcop                   1
 9 age                   32
10 care.home.referral     0
# ℹ 3,062 more rows

See also last_col(), which is another example of a tidyselect helper for specific columns.