Exploring data with dplyr

intermediate

Published

May 19, 2025

No feedback found for this session

Session outline

This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to do three things with dplyr:

build your fluency with basic dplyr functionality
show some of the similarities of approach across the package, again to help you master the syntax
learn some of the more advanced functions to harness the power available beneath the surface

You might also like some of the other dplyr-themed practical sessions:

A starting-point

You should have at least some previous experience working with dplyr before starting this training. Specifically, you should be familiar with at least:

select(), to select columns from data
filter(), to select rows from data
mutate(), to make new columns from existing columns

Investigating data

Dplyr is usually used as a toolkit for manipulating tidy data. This session concentrates on some lesser-used functions in the dplyr package that are especially useful earlier in the analysis lifecycle when exploring new data: slice(), glimpse(), rename(), relocate(), case_when()/case_match(), and consecutive_id()

Package loading as follows:

library(NHSRdatasets)
library(dplyr)

`slice()`

It’s definitely worth pointing you towards the slice() manual page - it’s clear and interesting.

At its simplest, slice() is an alternative to head:

ae_attendances |> 
  slice(1:6)

# A tibble: 6 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-03-01 RF4      1           21289     2879       5060
2 2017-03-01 RF4      2             813       22          0
3 2017-03-01 RF4      other        2850        6          0
4 2017-03-01 R1H      1           30210     5902       6943
5 2017-03-01 R1H      2             807       11          0
6 2017-03-01 R1H      other       11352      136          0

The number supplied to slice() returns that row - so slice(3) shows the third row. You can also provide a sequence or vector of rows:

ae_attendances |> 
  slice(1:3)

# A tibble: 3 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-03-01 RF4      1           21289     2879       5060
2 2017-03-01 RF4      2             813       22          0
3 2017-03-01 RF4      other        2850        6          0

ae_attendances |> 
  slice(2,5,918)

# A tibble: 3 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-03-01 RF4      2             813       22          0
2 2017-03-01 R1H      2             807       11          0
3 2017-01-01 RQ6      other       10253        0          0

More interestingly, you can group_by(), then slice, to see the first row of each group. Here, the data is filtered, grouped by org_code, then we slice the top row of each group:

ae_attendances |>
  filter(attendances >= 22000) |>
  group_by(org_code) |>
  slice(1)

# A tibble: 4 × 6
# Groups:   org_code [4]
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-10-01 R0A      1           22392     3628       6836
2 2017-03-01 R1H      1           30210     5902       6943
3 2019-03-01 RRK      1           32017    10670      10850
4 2017-03-01 RW6      1           22930     4997       5328

(we’ll explore group_by() in the summarising data with dplyr session, so don’t worry if this is unfamiliar)

The slice_min() and slice_max() functions are also useful, giving you something like an integrated arrange() for looking at slices of the data:

ae_attendances |>
  filter(type == "1") |>
  slice_max(breaches, n=1)

# A tibble: 1 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2019-01-01 RRK      1           31935    12502      11493

# equivalent to
ae_attendances |>
  arrange(desc(breaches)) |>
  slice(1)

# A tibble: 1 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2019-01-01 RRK      1           31935    12502      11493

(note that you need to explicitly name the number of rows you want using n=...)

slice_min() and slice_max() are particularly useful for groups:

ae_attendances |>
  filter(type == "1") |>
  group_by(org_code) |>
  slice_max(breaches, n=1)

# A tibble: 140 × 6
# Groups:   org_code [140]
   period     org_code type  attendances breaches admissions
   <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
 1 2019-01-01 R0A      1           21980     6206       6985
 2 2018-11-01 R1F      1            3810     1262        967
 3 2017-01-01 R1H      1           28420     7426       6435
 4 2017-01-01 R1K      1           11961     5897       4587
 5 2019-01-01 RA2      1            6242     1071       2513
 6 2017-01-01 RA3      1            4073     1471       1074
 7 2016-06-01 RA4      1            3992      454       1057
 8 2016-11-01 RA7      1            9625     2445       3064
 9 2019-01-01 RA9      1            6056     2053       2557
10 2019-03-01 RAE      1           10698     3438       2854
# ℹ 130 more rows

slice_sample() gives a random sampling of the data.

ae_attendances |>
  slice_sample(n=6)

# A tibble: 6 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2018-10-01 RJR      1            6192     1150       1850
2 2016-11-01 RX1      1           14832     3513       4125
3 2017-10-01 R1F      other         671        0          0
4 2018-06-01 R1H      other       13502      122          0
5 2017-02-01 RY4      other         794        1          0
6 2017-08-01 RXK      other        2417        0          0

There are also slice_head() and slice_tail() functions which work in the same way.

Final tip - slice with a negative index can be used as like filter() to remove the specified row(s):

# returns the last 5 rows only
ae_attendances |> 
  slice(-1:-12760)

# A tibble: 5 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2018-04-01 Y02584   other         471        0          0
2 2018-04-01 RA3      1            3825      476       1016
3 2018-04-01 AXG      other        2980       24          0
4 2018-04-01 NLX24    other        1538        0          0
5 2018-04-01 RA4      1            4388       82       1292

`glimpse()`

glimpse() is a data viewing function similar to print(), except it transposes the data so that each column is displayed as a row. This is particularly useful for wide data with many columns, especially when you are interested in checking the class (date, character, etc.) of your columns.

synthetic_news_data |> 
  glimpse()

Rows: 1,000
Columns: 12
$ male  <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
$ age   <int> 68, 94, 85, 44, 77, 58, 25, 69, 91, 70, 87, 93, 61, 75, 97, 80, …
$ NEWS  <int> 3, 1, 0, 0, 1, 1, 4, 0, 1, 1, 7, 2, 5, 1, 1, 3, 1, 5, 0, 2, 1, 2…
$ syst  <int> 150, 145, 169, 154, 122, 146, 65, 116, 162, 132, 110, 166, 123, …
$ dias  <int> 98, 67, 69, 106, 67, 106, 42, 56, 72, 96, 85, 90, 78, 80, 72, 81…
$ temp  <dbl> 36.8, 35.0, 36.2, 36.9, 36.4, 35.3, 35.6, 37.2, 35.5, 35.3, 37.0…
$ pulse <int> 78, 62, 54, 80, 62, 73, 72, 90, 60, 67, 95, 87, 93, 65, 89, 145,…
$ resp  <int> 26, 18, 18, 17, 20, 20, 12, 16, 16, 16, 24, 16, 26, 12, 16, 16, …
$ sat   <int> 96, 96, 96, 96, 95, 98, 99, 97, 99, 97, 87, 95, 96, 96, 98, 99, …
$ sup   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ died  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Note the useful dimension information in the first two lines, and the class information in angle brackets. Okay, so there are lots of similar ways of displaying the same information, but glimpse() is nice and concise both to write and to read. Compare a couple of base r (or purrr) near-equivalents, which are especially messy when the data is quite wide:

synthetic_news_data |> 
  purrr::map_df(class)

# A tibble: 1 × 12
  male    age     NEWS    syst   dias  temp  pulse resp  sat   sup   alert died 
  <chr>   <chr>   <chr>   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 integer integer integer integ… inte… nume… inte… inte… inte… inte… inte… inte…

# equivalent in base R
lapply(synthetic_news_data, class)

$male
[1] "integer"

$age
[1] "integer"

$NEWS
[1] "integer"

$syst
[1] "integer"

$dias
[1] "integer"

$temp
[1] "numeric"

$pulse
[1] "integer"

$resp
[1] "integer"

$sat
[1] "integer"

$sup
[1] "integer"

$alert
[1] "integer"

$died
[1] "integer"

glimpse() plays nicely with the pipe, meaning that it’s potentially useful while you’re working on a complicated data transformation. If you have lots of stages piped together, you can insert glimpse() in periodically to check that each stage of your transformation is working as expected:

synthetic_news_data |> 
  glimpse() |>
  filter(age == 71 & male == 0) |>
  glimpse() |>
  mutate(pulse_pres = syst-dias) |>
  glimpse()

Rows: 1,000
Columns: 12
$ male  <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
$ age   <int> 68, 94, 85, 44, 77, 58, 25, 69, 91, 70, 87, 93, 61, 75, 97, 80, …
$ NEWS  <int> 3, 1, 0, 0, 1, 1, 4, 0, 1, 1, 7, 2, 5, 1, 1, 3, 1, 5, 0, 2, 1, 2…
$ syst  <int> 150, 145, 169, 154, 122, 146, 65, 116, 162, 132, 110, 166, 123, …
$ dias  <int> 98, 67, 69, 106, 67, 106, 42, 56, 72, 96, 85, 90, 78, 80, 72, 81…
$ temp  <dbl> 36.8, 35.0, 36.2, 36.9, 36.4, 35.3, 35.6, 37.2, 35.5, 35.3, 37.0…
$ pulse <int> 78, 62, 54, 80, 62, 73, 72, 90, 60, 67, 95, 87, 93, 65, 89, 145,…
$ resp  <int> 26, 18, 18, 17, 20, 20, 12, 16, 16, 16, 24, 16, 26, 12, 16, 16, …
$ sat   <int> 96, 96, 96, 96, 95, 98, 99, 97, 99, 97, 87, 95, 96, 96, 98, 99, …
$ sup   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ died  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Rows: 11
Columns: 12
$ male  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ age   <int> 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71
$ NEWS  <int> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 3
$ syst  <int> 128, 128, 137, 128, 114, 106, 128, 114, 132, 137, 102
$ dias  <int> 67, 81, 61, 81, 62, 56, 82, 66, 96, 78, 51
$ temp  <dbl> 38.2, 37.9, 36.4, 36.2, 35.9, 36.2, 37.9, 38.2, 35.4, 37.0, 35.8
$ pulse <int> 99, 101, 102, 73, 87, 72, 110, 103, 67, 102, 54
$ resp  <int> 17, 18, 17, 18, 16, 15, 17, 14, 17, 19, 18
$ sat   <int> 98, 96, 95, 95, 95, 97, 97, 99, 99, 95, 100
$ sup   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ died  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Rows: 11
Columns: 13
$ male       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ age        <int> 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71
$ NEWS       <int> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 3
$ syst       <int> 128, 128, 137, 128, 114, 106, 128, 114, 132, 137, 102
$ dias       <int> 67, 81, 61, 81, 62, 56, 82, 66, 96, 78, 51
$ temp       <dbl> 38.2, 37.9, 36.4, 36.2, 35.9, 36.2, 37.9, 38.2, 35.4, 37.0,…
$ pulse      <int> 99, 101, 102, 73, 87, 72, 110, 103, 67, 102, 54
$ resp       <int> 17, 18, 17, 18, 16, 15, 17, 14, 17, 19, 18
$ sat        <int> 98, 96, 95, 95, 95, 97, 97, 99, 99, 95, 100
$ sup        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
$ alert      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ died       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ pulse_pres <int> 61, 47, 76, 47, 52, 50, 46, 48, 36, 59, 51

`rename()` (and `rename_with()`)

rename() renames columns. Like mutate(), the general syntax for rename() is new name = old name.

LOS_model |>
  rename(age = Age)

# A tibble: 300 × 5
      ID Organisation   age   LOS Death
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     2     0
 2     2 Trust2          27     1     0
 3     3 Trust3          93    12     0
 4     4 Trust4          45     3     1
 5     5 Trust5          70    11     0
 6     6 Trust6          60     7     0
 7     7 Trust7          25     4     0
 8     8 Trust8          48     4     0
 9     9 Trust9          51     7     1
10    10 Trust10         81     1     0
# ℹ 290 more rows

There’s not much more to say about the basic rename() function beyond that. However, the scoped variants, rename_with() and rename_all() provide some useful additional tools. For example, we can use tolower() - which converts strings to all lower-case - to rename multiple columns at once:

LOS_model |> 
  rename_with(tolower)

# A tibble: 300 × 5
      id organisation   age   los death
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     2     0
 2     2 Trust2          27     1     0
 3     3 Trust3          93    12     0
 4     4 Trust4          45     3     1
 5     5 Trust5          70    11     0
 6     6 Trust6          60     7     0
 7     7 Trust7          25     4     0
 8     8 Trust8          48     4     0
 9     9 Trust9          51     7     1
10    10 Trust10         81     1     0
# ℹ 290 more rows

rename() works well with tidyselect operators too, as we discuss in that session.

`relocate()`

relocate() re-arranges the order of columns:

LOS_model

# A tibble: 300 × 5
      ID Organisation   Age   LOS Death
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     2     0
 2     2 Trust2          27     1     0
 3     3 Trust3          93    12     0
 4     4 Trust4          45     3     1
 5     5 Trust5          70    11     0
 6     6 Trust6          60     7     0
 7     7 Trust7          25     4     0
 8     8 Trust8          48     4     0
 9     9 Trust9          51     7     1
10    10 Trust10         81     1     0
# ℹ 290 more rows

relocate()’s default behaviour is to move specified columns to the far left:

LOS_model |>
  relocate(LOS)

# A tibble: 300 × 5
     LOS    ID Organisation   Age Death
   <int> <int> <ord>        <int> <int>
 1     2     1 Trust1          55     0
 2     1     2 Trust2          27     0
 3    12     3 Trust3          93     0
 4     3     4 Trust4          45     1
 5    11     5 Trust5          70     0
 6     7     6 Trust6          60     0
 7     4     7 Trust7          25     0
 8     4     8 Trust8          48     0
 9     7     9 Trust9          51     1
10     1    10 Trust10         81     0
# ℹ 290 more rows

You can specify .before and .after if you need finer control over where your column ends up:

LOS_model |>
  relocate(LOS, .after=Death)

# A tibble: 300 × 5
      ID Organisation   Age Death   LOS
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     0     2
 2     2 Trust2          27     0     1
 3     3 Trust3          93     0    12
 4     4 Trust4          45     1     3
 5     5 Trust5          70     0    11
 6     6 Trust6          60     0     7
 7     7 Trust7          25     0     4
 8     8 Trust8          48     0     4
 9     9 Trust9          51     1     7
10    10 Trust10         81     0     1
# ℹ 290 more rows

LOS_model |>
  relocate(Death, .before=last_col()) # little bit of tidyselect

# A tibble: 300 × 5
      ID Organisation   Age   LOS Death
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     2     0
 2     2 Trust2          27     1     0
 3     3 Trust3          93    12     0
 4     4 Trust4          45     3     1
 5     5 Trust5          70    11     0
 6     6 Trust6          60     7     0
 7     7 Trust7          25     4     0
 8     8 Trust8          48     4     0
 9     9 Trust9          51     7     1
10    10 Trust10         81     1     0
# ℹ 290 more rows

`case_when()`

The ONS mortality data in NHSRdatasets has several categories

ons_mortality |>
  count(category_1)

# A tibble: 9 × 2
  category_1                                                                   n
  <chr>                                                                    <int>
1 All respiratory diseases (ICD-10 J00-J99) ICD-10                           536
2 Deaths where COVID-19 was mentioned on the death certificate (ICD-10 U0…    14
3 Females                                                                   3927
4 Males                                                                     3927
5 Persons                                                                   3927
6 Region                                                                    5350
7 Total deaths                                                              1056
8 average of same week over 5 years                                           14
9 <NA>                                                                        52

Winter (metrological winter, for clarity) runs from the start of December to the end of February. Can we class the total deaths into winter, and non-winter categories? In practice, I’d be tempted to use lubridate::quarter() with an appropriate starting offset for this kind of work if possible, especially if I was grouping into regular groups. Winter/non-winter isn’t regular though, so case_when() is the better option. Here, we’ll work in stages. First, as we’ll need the months, we’ll extract those using lubridate::month():

ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date))

# A tibble: 1,056 × 6
   category_1   category_2 counts date       week_no month
   <chr>        <chr>       <dbl> <date>       <int> <dbl>
 1 Total deaths all ages    12968 2010-01-08       1     1
 2 Total deaths all ages    12541 2010-01-15       2     1
 3 Total deaths all ages    11762 2010-01-22       3     1
 4 Total deaths all ages    11056 2010-01-29       4     1
 5 Total deaths all ages    10524 2010-02-05       5     2
 6 Total deaths all ages    10117 2010-02-12       6     2
 7 Total deaths all ages    10102 2010-02-19       7     2
 8 Total deaths all ages    10295 2010-02-26       8     2
 9 Total deaths all ages     9981 2010-03-05       9     3
10 Total deaths all ages     9792 2010-03-12      10     3
# ℹ 1,046 more rows

Our definition of winter is any case where that month is 1, 2, or 12. So use case_when() to catch Jan and Feb:

ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date)) |>
  mutate(thing = case_when(month < 3 ~ "Winter"))

# A tibble: 1,056 × 7
   category_1   category_2 counts date       week_no month thing 
   <chr>        <chr>       <dbl> <date>       <int> <dbl> <chr> 
 1 Total deaths all ages    12968 2010-01-08       1     1 Winter
 2 Total deaths all ages    12541 2010-01-15       2     1 Winter
 3 Total deaths all ages    11762 2010-01-22       3     1 Winter
 4 Total deaths all ages    11056 2010-01-29       4     1 Winter
 5 Total deaths all ages    10524 2010-02-05       5     2 Winter
 6 Total deaths all ages    10117 2010-02-12       6     2 Winter
 7 Total deaths all ages    10102 2010-02-19       7     2 Winter
 8 Total deaths all ages    10295 2010-02-26       8     2 Winter
 9 Total deaths all ages     9981 2010-03-05       9     3 <NA>  
10 Total deaths all ages     9792 2010-03-12      10     3 <NA>  
# ℹ 1,046 more rows

and then add an extra case to deal with December:

ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date)) |>
  mutate(season = case_when(month < 3 ~ "Winter",
                           month == 12 ~ "Winter"))

# A tibble: 1,056 × 7
   category_1   category_2 counts date       week_no month season
   <chr>        <chr>       <dbl> <date>       <int> <dbl> <chr> 
 1 Total deaths all ages    12968 2010-01-08       1     1 Winter
 2 Total deaths all ages    12541 2010-01-15       2     1 Winter
 3 Total deaths all ages    11762 2010-01-22       3     1 Winter
 4 Total deaths all ages    11056 2010-01-29       4     1 Winter
 5 Total deaths all ages    10524 2010-02-05       5     2 Winter
 6 Total deaths all ages    10117 2010-02-12       6     2 Winter
 7 Total deaths all ages    10102 2010-02-19       7     2 Winter
 8 Total deaths all ages    10295 2010-02-26       8     2 Winter
 9 Total deaths all ages     9981 2010-03-05       9     3 <NA>  
10 Total deaths all ages     9792 2010-03-12      10     3 <NA>  
# ℹ 1,046 more rows

And finally a default to catch all the non-Winter values:

seasonal_ons_mortality <- ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date)) |>
  mutate(season = case_when(month < 3 ~ "Winter",
                           month == 12 ~ "Winter",
                           TRUE ~ "Not winter"))
seasonal_ons_mortality

# A tibble: 1,056 × 7
   category_1   category_2 counts date       week_no month season    
   <chr>        <chr>       <dbl> <date>       <int> <dbl> <chr>     
 1 Total deaths all ages    12968 2010-01-08       1     1 Winter    
 2 Total deaths all ages    12541 2010-01-15       2     1 Winter    
 3 Total deaths all ages    11762 2010-01-22       3     1 Winter    
 4 Total deaths all ages    11056 2010-01-29       4     1 Winter    
 5 Total deaths all ages    10524 2010-02-05       5     2 Winter    
 6 Total deaths all ages    10117 2010-02-12       6     2 Winter    
 7 Total deaths all ages    10102 2010-02-19       7     2 Winter    
 8 Total deaths all ages    10295 2010-02-26       8     2 Winter    
 9 Total deaths all ages     9981 2010-03-05       9     3 Not winter
10 Total deaths all ages     9792 2010-03-12      10     3 Not winter
# ℹ 1,046 more rows

seasonal_ons_mortality |>
  count(season)

# A tibble: 2 × 2
  season         n
  <chr>      <int>
1 Not winter   793
2 Winter       263

`case_match()`

Rather than a logical test, case_match() matches against values rather than performing a logical test. This is excellent for the winter classification example:

ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date)) |>
  mutate(season = case_match(month, c(1, 2, 12) ~ "Winter",
                             .default = "Not winter"))

# A tibble: 1,056 × 7
   category_1   category_2 counts date       week_no month season    
   <chr>        <chr>       <dbl> <date>       <int> <dbl> <chr>     
 1 Total deaths all ages    12968 2010-01-08       1     1 Winter    
 2 Total deaths all ages    12541 2010-01-15       2     1 Winter    
 3 Total deaths all ages    11762 2010-01-22       3     1 Winter    
 4 Total deaths all ages    11056 2010-01-29       4     1 Winter    
 5 Total deaths all ages    10524 2010-02-05       5     2 Winter    
 6 Total deaths all ages    10117 2010-02-12       6     2 Winter    
 7 Total deaths all ages    10102 2010-02-19       7     2 Winter    
 8 Total deaths all ages    10295 2010-02-26       8     2 Winter    
 9 Total deaths all ages     9981 2010-03-05       9     3 Not winter
10 Total deaths all ages     9792 2010-03-12      10     3 Not winter
# ℹ 1,046 more rows

`consecutive_id()`

consecutive_id(c(1, 1, 1, 2, 1, 1, 2, 2)) # puts groups together by changes. This would have been very useful for my transcripts

[1] 1 1 1 2 3 3 4 4

repeats <- tibble(speaker = c("steve", "steve", "emma", "steve", "emma", "emma"),
       comment = letters[1:6]) # repeated data

repeats |>
  mutate(group = consecutive_id(speaker)) # add a group that increments with each change of speaker

# A tibble: 6 × 3
  speaker comment group
  <chr>   <chr>   <int>
1 steve   a           1
2 steve   b           1
3 emma    c           2
4 steve   d           3
5 emma    e           4
6 emma    f           4

repeats |>
  mutate(group = consecutive_id(speaker)) |>
  group_by(group) |>
  mutate(comment = paste(comment, collapse = " ")) |>
  distinct()

# A tibble: 4 × 3
# Groups:   group [4]
  speaker comment group
  <chr>   <chr>   <int>
1 steve   a b         1
2 emma    c           2
3 steve   d           3
4 emma    e f         4

Session outline

A starting-point

Investigating data

slice()

glimpse()

rename() (and rename_with())

relocate()

case_when()

case_match()

consecutive_id()