Tidyselect

intermediate

Published

July 1, 2025

No feedback found for this session

Introduction

This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to explore dplyr’s built-in language for selecting columns: tidyselect.

Package loading is just:

library(NHSRdatasets) # for sample data
library(dplyr)

Logical operators

The four operators that we’ll introduce here are found all over R, and aren’t in any way specific to the tidyverse, so we’ll review them very briefly before moving on to the more idiosyncratic aspects of tidyselect.

: allows you to select a range of columns. So to select the second to the fifth columns in the stranded_data set by name:

stranded_data |>
  select(age:hcop)

# A tibble: 768 × 4
     age care.home.referral medicallysafe  hcop
   <int>              <int>         <int> <int>
 1    50                  0             0     0
 2    31                  1             0     1
 3    32                  0             1     0
 4    69                  1             1     0
 5    33                  0             0     1
 6    75                  1             1     0
 7    26                  1             0     1
 8    64                  0             1     1
 9    53                  0             1     0
10    63                  1             0     0
# ℹ 758 more rows

Similarly, by index:

stranded_data |>
  select(2:5)

# A tibble: 768 × 4
     age care.home.referral medicallysafe  hcop
   <int>              <int>         <int> <int>
 1    50                  0             0     0
 2    31                  1             0     1
 3    32                  0             1     0
 4    69                  1             1     0
 5    33                  0             0     1
 6    75                  1             1     0
 7    26                  1             0     1
 8    64                  0             1     1
 9    53                  0             1     0
10    63                  1             0     0
# ℹ 758 more rows

! allows you to select the complement - in other words, everything but the specified column:

stranded_data |>
  select(!stranded.label)

# A tibble: 768 × 8
     age care.home.referral medicallysafe  hcop mental_health_care
   <int>              <int>         <int> <int>              <int>
 1    50                  0             0     0                  0
 2    31                  1             0     1                  0
 3    32                  0             1     0                  1
 4    69                  1             1     0                  1
 5    33                  0             0     1                  1
 6    75                  1             1     0                  1
 7    26                  1             0     1                  0
 8    64                  0             1     1                  0
 9    53                  0             1     0                  0
10    63                  1             0     0                  1
# ℹ 758 more rows
# ℹ 3 more variables: periods_of_previous_care <int>, admit_date <chr>,
#   frailty_index <chr>

Use c() if you want to drop multiple columns using !:

stranded_data |>
  select(!c(age, stranded.label))

# A tibble: 768 × 7
   care.home.referral medicallysafe  hcop mental_health_care
                <int>         <int> <int>              <int>
 1                  0             0     0                  0
 2                  1             0     1                  0
 3                  0             1     0                  1
 4                  1             1     0                  1
 5                  0             0     1                  1
 6                  1             1     0                  1
 7                  1             0     1                  0
 8                  0             1     1                  0
 9                  0             1     0                  0
10                  1             0     0                  1
# ℹ 758 more rows
# ℹ 3 more variables: periods_of_previous_care <int>, admit_date <chr>,
#   frailty_index <chr>

& lets you group together selection helpers. If you’re selecting many columns by name, & isn’t needed. Instead, you can just specify what you need to keep:

stranded_data |>
  select(stranded.label, hcop, age)

# A tibble: 768 × 3
   stranded.label  hcop   age
   <chr>          <int> <int>
 1 Not Stranded       0    50
 2 Not Stranded       1    31
 3 Not Stranded       0    32
 4 Not Stranded       0    69
 5 Not Stranded       1    33
 6 Stranded           0    75
 7 Not Stranded       1    26
 8 Not Stranded       1    64
 9 Not Stranded       0    53
10 Not Stranded       0    63
# ℹ 758 more rows

Where & is really useful is in combination with pattern matching helpers, which we’ll look at in more detail in the next subsection:

stranded_data |>
  select(starts_with("m") & ends_with("e"))

# A tibble: 768 × 2
   medicallysafe mental_health_care
           <int>              <int>
 1             0                  0
 2             0                  0
 3             1                  1
 4             1                  1
 5             0                  1
 6             1                  1
 7             0                  0
 8             1                  0
 9             1                  0
10             0                  1
# ℹ 758 more rows

& returns the columns where both of the patterns are matched. Compare and contrast the | operator, which gives the union of both helpers:

stranded_data |>
  select(starts_with("m") | ends_with("e"))

# A tibble: 768 × 5
   medicallysafe mental_health_care   age periods_of_previous_care admit_date
           <int>              <int> <int>                    <int> <chr>     
 1             0                  0    50                        1 29/12/2020
 2             0                  0    31                        1 11/12/2020
 3             1                  1    32                        1 19/01/2021
 4             1                  1    69                        1 07/01/2021
 5             0                  1    33                        1 15/01/2021
 6             1                  1    75                        1 02/01/2021
 7             0                  0    26                        1 02/01/2021
 8             1                  0    64                        1 29/12/2020
 9             1                  0    53                        5 04/01/2021
10             0                  1    63                        1 30/12/2020
# ℹ 758 more rows

`contains()`

contains() is a good example of tidyselect functions that are based on pattern matching. There are several related functions that work in the same way, including:

Here, we’ll concentrate on contains() as an exemplar for this family, because the other functions in the family work in a very similar way. contains() allows you to pick columns based on their names. So to select() all the columns in stranded_data whose names contain the string “care”:

stranded_data |>
  select(contains("care"))

# A tibble: 768 × 3
   care.home.referral mental_health_care periods_of_previous_care
                <int>              <int>                    <int>
 1                  0                  0                        1
 2                  1                  0                        1
 3                  0                  1                        1
 4                  1                  1                        1
 5                  0                  1                        1
 6                  1                  1                        1
 7                  1                  0                        1
 8                  0                  0                        1
 9                  0                  0                        5
10                  1                  1                        1
# ℹ 758 more rows

Note that by default this string matching is case insensitive, so “care” will match columns called “CARE”, “Care”, and “cArE”. You can also combine the contains() statement with other select() criteria:

stranded_data |>
  select(age, contains("care"))

# A tibble: 768 × 4
     age care.home.referral mental_health_care periods_of_previous_care
   <int>              <int>              <int>                    <int>
 1    50                  0                  0                        1
 2    31                  1                  0                        1
 3    32                  0                  1                        1
 4    69                  1                  1                        1
 5    33                  0                  1                        1
 6    75                  1                  1                        1
 7    26                  1                  0                        1
 8    64                  0                  0                        1
 9    53                  0                  0                        5
10    63                  1                  1                        1
# ℹ 758 more rows

And you can match over a vector of strings:

stranded_data |>
  select(contains(c("care", "age")))

# A tibble: 768 × 4
   care.home.referral mental_health_care periods_of_previous_care   age
                <int>              <int>                    <int> <int>
 1                  0                  0                        1    50
 2                  1                  0                        1    31
 3                  0                  1                        1    32
 4                  1                  1                        1    69
 5                  0                  1                        1    33
 6                  1                  1                        1    75
 7                  1                  0                        1    26
 8                  0                  0                        1    64
 9                  0                  0                        5    53
10                  1                  1                        1    63
# ℹ 758 more rows

The order of strings in your vector will determine the order of columns returned:

stranded_data |>
  select(contains(c("age", "care")))

# A tibble: 768 × 4
     age care.home.referral mental_health_care periods_of_previous_care
   <int>              <int>              <int>                    <int>
 1    50                  0                  0                        1
 2    31                  1                  0                        1
 3    32                  0                  1                        1
 4    69                  1                  1                        1
 5    33                  0                  1                        1
 6    75                  1                  1                        1
 7    26                  1                  0                        1
 8    64                  0                  0                        1
 9    53                  0                  0                        5
10    63                  1                  1                        1
# ℹ 758 more rows

And you can negate the contains() to give the complement:

stranded_data |>
  select(-contains(c("age", "care")))

# A tibble: 768 × 5
   stranded.label medicallysafe  hcop admit_date frailty_index       
   <chr>                  <int> <int> <chr>      <chr>               
 1 Not Stranded               0     0 29/12/2020 No index item       
 2 Not Stranded               0     1 11/12/2020 No index item       
 3 Not Stranded               1     0 19/01/2021 No index item       
 4 Not Stranded               1     0 07/01/2021 Mobility problems   
 5 Not Stranded               0     1 15/01/2021 No index item       
 6 Stranded                   1     0 02/01/2021 Mobility problems   
 7 Not Stranded               0     1 02/01/2021 No index item       
 8 Not Stranded               1     1 29/12/2020 Fall patient history
 9 Not Stranded               1     0 04/01/2021 No index item       
10 Not Stranded               0     0 30/12/2020 Activity Limitation 
# ℹ 758 more rows

starts_with() and ends_with() work in exactly the same way - but will only match strings at the beginning and end of the column name respectively:

stranded_data |>
  select(starts_with("care"))

# A tibble: 768 × 1
   care.home.referral
                <int>
 1                  0
 2                  1
 3                  0
 4                  1
 5                  0
 6                  1
 7                  1
 8                  0
 9                  0
10                  1
# ℹ 758 more rows

stranded_data |>
  select(ends_with("care"))

# A tibble: 768 × 2
   mental_health_care periods_of_previous_care
                <int>                    <int>
 1                  0                        1
 2                  0                        1
 3                  1                        1
 4                  1                        1
 5                  1                        1
 6                  1                        1
 7                  0                        1
 8                  0                        1
 9                  0                        5
10                  1                        1
# ℹ 758 more rows

num_range() matches columns with numeric ranges. For example, in the billboard dataset, there are a large number of columns named wk1, wk2… representing weeks. To select wk10 to wk15:

tidyr::billboard |> 
  dplyr::select(num_range("wk", 10:15))

# A tibble: 317 × 6
    wk10  wk11  wk12  wk13  wk14  wk15
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    NA    NA    NA    NA    NA    NA
 2    NA    NA    NA    NA    NA    NA
 3    51    51    51    47    44    38
 4    61    61    59    61    66    72
 5    57    64    70    75    76    78
 6     6     7    22    29    36    47
 7    NA    NA    NA    NA    NA    NA
 8    36    37    37    38    49    61
 9    10     9     8     6     1     2
10    59    66    68    61    67    59
# ℹ 307 more rows

`matches()`

While starts_with() and ends_with() match supplied strings, matches() uses regular expressions (regex). Regular expressions are a way of matching all kinds of strings by specifying a pattern. For example, you can use regex to search through a piece of text for all the email addresses that it contains. In pseudocode, the regex would work like this:

look for an @
look for the start and ends of the word that contains that @
make sure that the word doesn’t contain any forbidden characters

Getting to grips with regex is a bit forbidding, because the syntax used is very concise and impenetrable. For instance, the above pseudocode translates into:

[[:alnum:].-_]+@[[:alnum:].-]+

Luckily, there are some great tools to help the beginner - such as the Rstudio cheatsheet, or the interactive tool regular expressions 101. There are also a few very simple examples below that cover some of the common cases:

Match one of a group of letters using []:

stranded_data |>
  select(matches("me[dn]"))

# A tibble: 768 × 2
   medicallysafe mental_health_care
           <int>              <int>
 1             0                  0
 2             0                  0
 3             1                  1
 4             1                  1
 5             0                  1
 6             1                  1
 7             0                  0
 8             1                  0
 9             1                  0
10             0                  1
# ℹ 758 more rows

This returns columns containing either “med” or “men”.

To match characters at the start of a word use ^:

stranded_data |>
  select(matches("^[am]"))

# A tibble: 768 × 4
     age medicallysafe mental_health_care admit_date
   <int>         <int>              <int> <chr>     
 1    50             0                  0 29/12/2020
 2    31             0                  0 11/12/2020
 3    32             1                  1 19/01/2021
 4    69             1                  1 07/01/2021
 5    33             0                  1 15/01/2021
 6    75             1                  1 02/01/2021
 7    26             0                  0 02/01/2021
 8    64             1                  0 29/12/2020
 9    53             1                  0 04/01/2021
10    63             0                  1 30/12/2020
# ℹ 758 more rows

. matches any character:

stranded_data |>
  select(matches("me."))

# A tibble: 768 × 3
   care.home.referral medicallysafe mental_health_care
                <int>         <int>              <int>
 1                  0             0                  0
 2                  1             0                  0
 3                  0             1                  1
 4                  1             1                  1
 5                  0             0                  1
 6                  1             1                  1
 7                  1             0                  0
 8                  0             1                  0
 9                  0             1                  0
10                  1             0                  1
# ℹ 758 more rows

To match columns containing a literal full stop use \\ to escape the character:

stranded_data |>
  select(matches("\\."))

# A tibble: 768 × 2
   stranded.label care.home.referral
   <chr>                       <int>
 1 Not Stranded                    0
 2 Not Stranded                    1
 3 Not Stranded                    0
 4 Not Stranded                    1
 5 Not Stranded                    0
 6 Stranded                        1
 7 Not Stranded                    1
 8 Not Stranded                    0
 9 Not Stranded                    0
10 Not Stranded                    1
# ℹ 758 more rows

`where()`

There’s a bit more to explain with where() than the other tidyselect helpers. Broadly, where() lets us use a function to match columns. Let’s start simply. We’ll use the base R function is.character(). This is simple - it returns TRUE when a column (or vector, but not whole tibble) contains character data. Let’s look at the stranded.label column in stranded_data:

is.character(stranded_data$stranded.label)

[1] TRUE

This should return true. So we can see how we might use is.character as the basis of selecting all the character columns from this data using where().

# select all the character columns from stranded_data:
stranded_data |> 
  select(where(is.character))

# A tibble: 768 × 3
   stranded.label admit_date frailty_index       
   <chr>          <chr>      <chr>               
 1 Not Stranded   29/12/2020 No index item       
 2 Not Stranded   11/12/2020 No index item       
 3 Not Stranded   19/01/2021 No index item       
 4 Not Stranded   07/01/2021 Mobility problems   
 5 Not Stranded   15/01/2021 No index item       
 6 Stranded       02/01/2021 Mobility problems   
 7 Not Stranded   02/01/2021 No index item       
 8 Not Stranded   29/12/2020 Fall patient history
 9 Not Stranded   04/01/2021 No index item       
10 Not Stranded   30/12/2020 Activity Limitation 
# ℹ 758 more rows

`any_of()`

any_of() matches full column names from a vector of names. Let’s set up some names:

my_columns <- names(stranded_data)[1:4]
my_columns

[1] "stranded.label"     "age"                "care.home.referral"
[4] "medicallysafe"

If we supply any_of() with this vector of names, any_of() will match any column names that appear in the vector, and ignore those that do not:

stranded_data |>
  select(any_of(my_columns))

# A tibble: 768 × 4
   stranded.label   age care.home.referral medicallysafe
   <chr>          <int>              <int>         <int>
 1 Not Stranded      50                  0             0
 2 Not Stranded      31                  1             0
 3 Not Stranded      32                  0             1
 4 Not Stranded      69                  1             1
 5 Not Stranded      33                  0             0
 6 Stranded          75                  1             1
 7 Not Stranded      26                  1             0
 8 Not Stranded      64                  0             1
 9 Not Stranded      53                  0             1
10 Not Stranded      63                  1             0
# ℹ 758 more rows

Compared to contains(), any_of() will match the entire name, while contains() will match any part of the name.

There’s a great tip on the all_of() man page - any_of() is an especially clever way to drop columns from a tibble because calling it more than once does not cause an error:

stranded_data |>
  select(-any_of(my_columns)) |>
  select(-any_of(my_columns))

# A tibble: 768 × 5
    hcop mental_health_care periods_of_previous_care admit_date frailty_index   
   <int>              <int>                    <int> <chr>      <chr>           
 1     0                  0                        1 29/12/2020 No index item   
 2     1                  0                        1 11/12/2020 No index item   
 3     0                  1                        1 19/01/2021 No index item   
 4     0                  1                        1 07/01/2021 Mobility proble…
 5     1                  1                        1 15/01/2021 No index item   
 6     0                  1                        1 02/01/2021 Mobility proble…
 7     1                  0                        1 02/01/2021 No index item   
 8     1                  0                        1 29/12/2020 Fall patient hi…
 9     0                  0                        5 04/01/2021 No index item   
10     0                  1                        1 30/12/2020 Activity Limita…
# ℹ 758 more rows

Trying to do something similar using select() will cause an error:

try(stranded_data |>
  select(-age) |>
  select(-age))

Error in select(select(stranded_data, -age), -age) : 
  Can't select columns that don't exist.
✖ Column `age` doesn't exist.

`everything()`

everything() selects all columns. This is less useful in combination with select() itself, but simplifies some other functions well - particularly with pivot_longer().

stranded_data |> 
  select(2:5) |>
  tidyr::pivot_longer(everything())

# A tibble: 3,072 × 2
   name               value
   <chr>              <int>
 1 age                   50
 2 care.home.referral     0
 3 medicallysafe          0
 4 hcop                   0
 5 age                   31
 6 care.home.referral     1
 7 medicallysafe          0
 8 hcop                   1
 9 age                   32
10 care.home.referral     0
# ℹ 3,062 more rows

`last_col()`

Last col selects the right-most column in your data. That’s especially useful in concert with mutate(), as by default new columns are created at the right-hand edge of your tibble. This gives you an easy way of previewing changes in new columns:

stranded_data |> 
  mutate(decade = round(age, -1)) |>
  relocate(last_col())

# A tibble: 768 × 10
   decade stranded.label   age care.home.referral medicallysafe  hcop
    <dbl> <chr>          <int>              <int>         <int> <int>
 1     50 Not Stranded      50                  0             0     0
 2     30 Not Stranded      31                  1             0     1
 3     30 Not Stranded      32                  0             1     0
 4     70 Not Stranded      69                  1             1     0
 5     30 Not Stranded      33                  0             0     1
 6     80 Stranded          75                  1             1     0
 7     30 Not Stranded      26                  1             0     1
 8     60 Not Stranded      64                  0             1     1
 9     50 Not Stranded      53                  0             1     0
10     60 Not Stranded      63                  1             0     0
# ℹ 758 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
#   admit_date <chr>, frailty_index <chr>

`if_any()`, `if_all()`, `across()`, and `pick()`

Although not technically part of the tidyselect family, the final topic for this session is to explore a group of functions that work similarly: if_any() / if_all(), which allow you to effectively integrate basic filtering with selection, and across() / pick(), which allows you to apply functions across many columns.

stranded_data |>
  filter(if_any(contains("care"), ~ .x == 1)) # that'll find rows with a 1 in any "care" column

# A tibble: 723 × 9
   stranded.label   age care.home.referral medicallysafe  hcop
   <chr>          <int>              <int>         <int> <int>
 1 Not Stranded      50                  0             0     0
 2 Not Stranded      31                  1             0     1
 3 Not Stranded      32                  0             1     0
 4 Not Stranded      69                  1             1     0
 5 Not Stranded      33                  0             0     1
 6 Stranded          75                  1             1     0
 7 Not Stranded      26                  1             0     1
 8 Not Stranded      64                  0             1     1
 9 Not Stranded      63                  1             0     0
10 Not Stranded      30                  1             1     0
# ℹ 713 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
#   admit_date <chr>, frailty_index <chr>

stranded_data |>
  mutate(care_found = case_when(if_all(contains("care"), ~ .x >= 1) ~ "Yup",
                                TRUE ~ "Nope")) |>
  select(contains("care"), care_found)

# A tibble: 768 × 4
   care.home.referral mental_health_care periods_of_previous_care care_found
                <int>              <int>                    <int> <chr>     
 1                  0                  0                        1 Nope      
 2                  1                  0                        1 Nope      
 3                  0                  1                        1 Nope      
 4                  1                  1                        1 Yup       
 5                  0                  1                        1 Nope      
 6                  1                  1                        1 Yup       
 7                  1                  0                        1 Nope      
 8                  0                  0                        1 Nope      
 9                  0                  0                        5 Nope      
10                  1                  1                        1 Yup       
# ℹ 758 more rows

across() is used inside summarise() or mutate():

stranded_data |>
  mutate(across(where(is.character), toupper))

# A tibble: 768 × 9
   stranded.label   age care.home.referral medicallysafe  hcop
   <chr>          <int>              <int>         <int> <int>
 1 NOT STRANDED      50                  0             0     0
 2 NOT STRANDED      31                  1             0     1
 3 NOT STRANDED      32                  0             1     0
 4 NOT STRANDED      69                  1             1     0
 5 NOT STRANDED      33                  0             0     1
 6 STRANDED          75                  1             1     0
 7 NOT STRANDED      26                  1             0     1
 8 NOT STRANDED      64                  0             1     1
 9 NOT STRANDED      53                  0             1     0
10 NOT STRANDED      63                  1             0     0
# ℹ 758 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
#   admit_date <chr>, frailty_index <chr>

stranded_data |>
  mutate(across(medicallysafe:hcop, ~ .x * 20))

# A tibble: 768 × 9
   stranded.label   age care.home.referral medicallysafe  hcop
   <chr>          <int>              <int>         <dbl> <dbl>
 1 Not Stranded      50                  0             0     0
 2 Not Stranded      31                  1             0    20
 3 Not Stranded      32                  0            20     0
 4 Not Stranded      69                  1            20     0
 5 Not Stranded      33                  0             0    20
 6 Stranded          75                  1            20     0
 7 Not Stranded      26                  1             0    20
 8 Not Stranded      64                  0            20    20
 9 Not Stranded      53                  0            20     0
10 Not Stranded      63                  1             0     0
# ℹ 758 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
#   admit_date <chr>, frailty_index <chr>

cols <- c("mpg", "drat")

mtcars |>
   mutate(across(all_of(cols), round))

                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4            21   6 160.0 110    4 2.620 16.46  0  1    4    4
Mazda RX4 Wag        21   6 160.0 110    4 2.875 17.02  0  1    4    4
Datsun 710           23   4 108.0  93    4 2.320 18.61  1  1    4    1
Hornet 4 Drive       21   6 258.0 110    3 3.215 19.44  1  0    3    1
Hornet Sportabout    19   8 360.0 175    3 3.440 17.02  0  0    3    2
Valiant              18   6 225.0 105    3 3.460 20.22  1  0    3    1
Duster 360           14   8 360.0 245    3 3.570 15.84  0  0    3    4
Merc 240D            24   4 146.7  62    4 3.190 20.00  1  0    4    2
Merc 230             23   4 140.8  95    4 3.150 22.90  1  0    4    2
Merc 280             19   6 167.6 123    4 3.440 18.30  1  0    4    4
Merc 280C            18   6 167.6 123    4 3.440 18.90  1  0    4    4
Merc 450SE           16   8 275.8 180    3 4.070 17.40  0  0    3    3
Merc 450SL           17   8 275.8 180    3 3.730 17.60  0  0    3    3
Merc 450SLC          15   8 275.8 180    3 3.780 18.00  0  0    3    3
Cadillac Fleetwood   10   8 472.0 205    3 5.250 17.98  0  0    3    4
Lincoln Continental  10   8 460.0 215    3 5.424 17.82  0  0    3    4
Chrysler Imperial    15   8 440.0 230    3 5.345 17.42  0  0    3    4
Fiat 128             32   4  78.7  66    4 2.200 19.47  1  1    4    1
Honda Civic          30   4  75.7  52    5 1.615 18.52  1  1    4    2
Toyota Corolla       34   4  71.1  65    4 1.835 19.90  1  1    4    1
Toyota Corona        22   4 120.1  97    4 2.465 20.01  1  0    3    1
Dodge Challenger     16   8 318.0 150    3 3.520 16.87  0  0    3    2
AMC Javelin          15   8 304.0 150    3 3.435 17.30  0  0    3    2
Camaro Z28           13   8 350.0 245    4 3.840 15.41  0  0    3    4
Pontiac Firebird     19   8 400.0 175    3 3.845 17.05  0  0    3    2
Fiat X1-9            27   4  79.0  66    4 1.935 18.90  1  1    4    1
Porsche 914-2        26   4 120.3  91    4 2.140 16.70  0  1    5    2
Lotus Europa         30   4  95.1 113    4 1.513 16.90  1  1    5    2
Ford Pantera L       16   8 351.0 264    4 3.170 14.50  0  1    5    4
Ferrari Dino         20   6 145.0 175    4 2.770 15.50  0  1    5    6
Maserati Bora        15   8 301.0 335    4 3.570 14.60  0  1    5    8
Volvo 142E           21   4 121.0 109    4 2.780 18.60  1  1    4    2

stranded_data |>
  group_by(care.home.referral) |>
  summarise(across(where(is.numeric), list(mean = mean, sd = sd)))

# A tibble: 2 × 11
  care.home.referral age_mean age_sd medicallysafe_mean medicallysafe_sd
               <int>    <dbl>  <dbl>              <dbl>            <dbl>
1                  0     53.1   20.7              0.475            0.500
2                  1     52.3   20.5              0.471            0.500
# ℹ 6 more variables: hcop_mean <dbl>, hcop_sd <dbl>,
#   mental_health_care_mean <dbl>, mental_health_care_sd <dbl>,
#   periods_of_previous_care_mean <dbl>, periods_of_previous_care_sd <dbl>

stranded_data |>
  group_by(care.home.referral) |>
  summarise(across(where(is.numeric), mean, .names = "average_{.col}"))

# A tibble: 2 × 6
  care.home.referral average_age average_medicallysafe average_hcop
               <int>       <dbl>                 <dbl>        <dbl>
1                  0        53.1                 0.475        0.496
2                  1        52.3                 0.471        0.506
# ℹ 2 more variables: average_mental_health_care <dbl>,
#   average_periods_of_previous_care <dbl>

pick() effectively allows you to sub-set a tibble inside a data masking function. Effectively an integrated select():

stranded_data |>
  distinct(pick(contains("care")))

# A tibble: 37 × 3
   care.home.referral mental_health_care periods_of_previous_care
                <int>              <int>                    <int>
 1                  0                  0                        1
 2                  1                  0                        1
 3                  0                  1                        1
 4                  1                  1                        1
 5                  0                  0                        5
 6                  0                  0                        4
 7                  1                  1                        2
 8                  1                  1                        4
 9                  0                  1                        3
10                  1                  0                        2
# ℹ 27 more rows

stranded_data |>
  count(pick(ends_with("care")), sort = T)

# A tibble: 19 × 3
   mental_health_care periods_of_previous_care     n
                <int>                    <int> <int>
 1                  0                        1   293
 2                  1                        1   286
 3                  1                       NA    42
 4                  0                       NA    27
 5                  1                        2    18
 6                  0                        2    17
 7                  0                        3    17
 8                  1                        3    16
 9                  1                        4     8
10                  0                       10     7
11                  1                       10     7
12                  0                        8     5
13                  1                        7     5
14                  0                        4     4
15                  0                        5     4
16                  0                        7     4
17                  0                        6     3
18                  1                        8     3
19                  1                        5     2

Especially useful when you’re effectively working with a tibble sub-set of your data:

stranded_data |>
  mutate(rank = dense_rank(pick(periods_of_previous_care, age))) |>
  arrange(-rank) |>
  select(rank, age, periods_of_previous_care) # something like a rank by previous episodes, with age as a tie-breaker

# A tibble: 768 × 3
    rank   age periods_of_previous_care
   <int> <int>                    <int>
 1   152    75                       10
 2   151    71                       10
 3   150    68                       10
 4   149    66                       10
 5   148    65                       10
 6   148    65                       10
 7   147    63                       10
 8   146    61                       10
 9   145    51                       10
10   144    42                       10
# ℹ 758 more rows

Introduction

Logical operators

contains()

matches()

where()

any_of()

everything()

last_col()

if_any(), if_all(), across(), and pick()