Exploring data with dplyr

R
intermediate
Published

May 19, 2025

No feedback found for this session

Session outline

This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to do three things with dplyr:

  • build your fluency with basic dplyr functionality
  • show some of the similarities of approach across the package, again to help you master the syntax
  • learn some of the more advanced functions to harness the power available beneath the surface

You might also like some of the other dplyr-themed practical sessions:

A starting-point

You should have at least some previous experience working with dplyr before starting this training. Specifically, you should be familiar with at least:

  • select(), to select columns from data
  • filter(), to select rows from data
  • mutate(), to make new columns from existing columns

Investigating data

Dplyr is usually used as a toolkit for manipulating tidy data. This session concentrates on some lesser-used functions in the dplyr package that are especially useful earlier in the analysis lifecycle when exploring new data: slice(), glimpse(), rename(), relocate(), case_when()/case_match(), and consecutive_id()

Package loading as follows:

slice()

It’s definitely worth pointing you towards the slice() manual page - it’s clear and interesting.

At its simplest, slice() is an alternative to head:

ae_attendances |> 
  slice(1:6)
# A tibble: 6 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-03-01 RF4      1           21289     2879       5060
2 2017-03-01 RF4      2             813       22          0
3 2017-03-01 RF4      other        2850        6          0
4 2017-03-01 R1H      1           30210     5902       6943
5 2017-03-01 R1H      2             807       11          0
6 2017-03-01 R1H      other       11352      136          0

The number supplied to slice() returns that row - so slice(3) shows the third row. You can also provide a sequence or vector of rows:

ae_attendances |> 
  slice(1:3)
# A tibble: 3 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-03-01 RF4      1           21289     2879       5060
2 2017-03-01 RF4      2             813       22          0
3 2017-03-01 RF4      other        2850        6          0
ae_attendances |> 
  slice(2,5,918)
# A tibble: 3 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-03-01 RF4      2             813       22          0
2 2017-03-01 R1H      2             807       11          0
3 2017-01-01 RQ6      other       10253        0          0

More interestingly, you can group_by(), then slice, to see the first row of each group. Here, the data is filtered, grouped by org_code, then we slice the top row of each group:

ae_attendances |>
  filter(attendances >= 22000) |>
  group_by(org_code) |>
  slice(1)
# A tibble: 4 × 6
# Groups:   org_code [4]
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-10-01 R0A      1           22392     3628       6836
2 2017-03-01 R1H      1           30210     5902       6943
3 2019-03-01 RRK      1           32017    10670      10850
4 2017-03-01 RW6      1           22930     4997       5328

(we’ll explore group_by() in the summarising data with dplyr session, so don’t worry if this is unfamiliar)

The slice_min() and slice_max() functions are also useful, giving you something like an integrated arrange() for looking at slices of the data:

ae_attendances |>
  filter(type == "1") |>
  slice_max(breaches, n=1)
# A tibble: 1 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2019-01-01 RRK      1           31935    12502      11493
# equivalent to
ae_attendances |>
  arrange(desc(breaches)) |>
  slice(1)
# A tibble: 1 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2019-01-01 RRK      1           31935    12502      11493

(note that you need to explicitly name the number of rows you want using n=...)

slice_min() and slice_max() are particularly useful for groups:

ae_attendances |>
  filter(type == "1") |>
  group_by(org_code) |>
  slice_max(breaches, n=1) 
# A tibble: 140 × 6
# Groups:   org_code [140]
   period     org_code type  attendances breaches admissions
   <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
 1 2019-01-01 R0A      1           21980     6206       6985
 2 2018-11-01 R1F      1            3810     1262        967
 3 2017-01-01 R1H      1           28420     7426       6435
 4 2017-01-01 R1K      1           11961     5897       4587
 5 2019-01-01 RA2      1            6242     1071       2513
 6 2017-01-01 RA3      1            4073     1471       1074
 7 2016-06-01 RA4      1            3992      454       1057
 8 2016-11-01 RA7      1            9625     2445       3064
 9 2019-01-01 RA9      1            6056     2053       2557
10 2019-03-01 RAE      1           10698     3438       2854
# ℹ 130 more rows

slice_sample() gives a random sampling of the data.

ae_attendances |>
  slice_sample(n=6) 
# A tibble: 6 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2017-06-01 RMC      1            8793     1491       2056
2 2017-01-01 Y03047   other        2037        0          0
3 2017-03-01 RYJ      other       11156      190          0
4 2018-02-01 REF      1            5140     2232       2204
5 2018-09-01 RVR      2             427        1         12
6 2016-07-01 R1D      2             363        0          0

There are also slice_head() and slice_tail() functions which work in the same way.

Final tip - slice with a negative index can be used as like filter() to remove the specified row(s):

# returns the last 5 rows only
ae_attendances |> 
  slice(-1:-12760)
# A tibble: 5 × 6
  period     org_code type  attendances breaches admissions
  <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>
1 2018-04-01 Y02584   other         471        0          0
2 2018-04-01 RA3      1            3825      476       1016
3 2018-04-01 AXG      other        2980       24          0
4 2018-04-01 NLX24    other        1538        0          0
5 2018-04-01 RA4      1            4388       82       1292

glimpse()

glimpse() is a data viewing function similar to print(), except it transposes the data so that each column is displayed as a row. This is particularly useful for wide data with many columns, especially when you are interested in checking the class (date, character, etc.) of your columns.

synthetic_news_data |> 
  glimpse() 
Rows: 1,000
Columns: 12
$ male  <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
$ age   <int> 68, 94, 85, 44, 77, 58, 25, 69, 91, 70, 87, 93, 61, 75, 97, 80, …
$ NEWS  <int> 3, 1, 0, 0, 1, 1, 4, 0, 1, 1, 7, 2, 5, 1, 1, 3, 1, 5, 0, 2, 1, 2…
$ syst  <int> 150, 145, 169, 154, 122, 146, 65, 116, 162, 132, 110, 166, 123, …
$ dias  <int> 98, 67, 69, 106, 67, 106, 42, 56, 72, 96, 85, 90, 78, 80, 72, 81…
$ temp  <dbl> 36.8, 35.0, 36.2, 36.9, 36.4, 35.3, 35.6, 37.2, 35.5, 35.3, 37.0…
$ pulse <int> 78, 62, 54, 80, 62, 73, 72, 90, 60, 67, 95, 87, 93, 65, 89, 145,…
$ resp  <int> 26, 18, 18, 17, 20, 20, 12, 16, 16, 16, 24, 16, 26, 12, 16, 16, …
$ sat   <int> 96, 96, 96, 96, 95, 98, 99, 97, 99, 97, 87, 95, 96, 96, 98, 99, …
$ sup   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ died  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Note the useful dimension information in the first two lines, and the class information in angle brackets. Okay, so there are lots of similar ways of displaying the same information, but glimpse() is nice and concise both to write and to read. Compare a couple of base r (or purrr) near-equivalents, which are especially messy when the data is quite wide:

synthetic_news_data |> 
  purrr::map_df(class)
# A tibble: 1 × 12
  male    age     NEWS    syst   dias  temp  pulse resp  sat   sup   alert died 
  <chr>   <chr>   <chr>   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 integer integer integer integ… inte… nume… inte… inte… inte… inte… inte… inte…
# equivalent in base R
lapply(synthetic_news_data, class)
$male
[1] "integer"

$age
[1] "integer"

$NEWS
[1] "integer"

$syst
[1] "integer"

$dias
[1] "integer"

$temp
[1] "numeric"

$pulse
[1] "integer"

$resp
[1] "integer"

$sat
[1] "integer"

$sup
[1] "integer"

$alert
[1] "integer"

$died
[1] "integer"

glimpse() plays nicely with the pipe, meaning that it’s potentially useful while you’re working on a complicated data transformation. If you have lots of stages piped together, you can insert glimpse() in periodically to check that each stage of your transformation is working as expected:

synthetic_news_data |> 
  glimpse() |>
  filter(age == 71 & male == 0) |>
  glimpse() |>
  mutate(pulse_pres = syst-dias) |>
  glimpse()
Rows: 1,000
Columns: 12
$ male  <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
$ age   <int> 68, 94, 85, 44, 77, 58, 25, 69, 91, 70, 87, 93, 61, 75, 97, 80, …
$ NEWS  <int> 3, 1, 0, 0, 1, 1, 4, 0, 1, 1, 7, 2, 5, 1, 1, 3, 1, 5, 0, 2, 1, 2…
$ syst  <int> 150, 145, 169, 154, 122, 146, 65, 116, 162, 132, 110, 166, 123, …
$ dias  <int> 98, 67, 69, 106, 67, 106, 42, 56, 72, 96, 85, 90, 78, 80, 72, 81…
$ temp  <dbl> 36.8, 35.0, 36.2, 36.9, 36.4, 35.3, 35.6, 37.2, 35.5, 35.3, 37.0…
$ pulse <int> 78, 62, 54, 80, 62, 73, 72, 90, 60, 67, 95, 87, 93, 65, 89, 145,…
$ resp  <int> 26, 18, 18, 17, 20, 20, 12, 16, 16, 16, 24, 16, 26, 12, 16, 16, …
$ sat   <int> 96, 96, 96, 96, 95, 98, 99, 97, 99, 97, 87, 95, 96, 96, 98, 99, …
$ sup   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ died  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Rows: 11
Columns: 12
$ male  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ age   <int> 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71
$ NEWS  <int> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 3
$ syst  <int> 128, 128, 137, 128, 114, 106, 128, 114, 132, 137, 102
$ dias  <int> 67, 81, 61, 81, 62, 56, 82, 66, 96, 78, 51
$ temp  <dbl> 38.2, 37.9, 36.4, 36.2, 35.9, 36.2, 37.9, 38.2, 35.4, 37.0, 35.8
$ pulse <int> 99, 101, 102, 73, 87, 72, 110, 103, 67, 102, 54
$ resp  <int> 17, 18, 17, 18, 16, 15, 17, 14, 17, 19, 18
$ sat   <int> 98, 96, 95, 95, 95, 97, 97, 99, 99, 95, 100
$ sup   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ died  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Rows: 11
Columns: 13
$ male       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ age        <int> 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71
$ NEWS       <int> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 3
$ syst       <int> 128, 128, 137, 128, 114, 106, 128, 114, 132, 137, 102
$ dias       <int> 67, 81, 61, 81, 62, 56, 82, 66, 96, 78, 51
$ temp       <dbl> 38.2, 37.9, 36.4, 36.2, 35.9, 36.2, 37.9, 38.2, 35.4, 37.0,…
$ pulse      <int> 99, 101, 102, 73, 87, 72, 110, 103, 67, 102, 54
$ resp       <int> 17, 18, 17, 18, 16, 15, 17, 14, 17, 19, 18
$ sat        <int> 98, 96, 95, 95, 95, 97, 97, 99, 99, 95, 100
$ sup        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
$ alert      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ died       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ pulse_pres <int> 61, 47, 76, 47, 52, 50, 46, 48, 36, 59, 51

rename() (and rename_with())

rename() renames columns. Like mutate(), the general syntax for rename() is new name = old name.

LOS_model |>
  rename(age = Age) 
# A tibble: 300 × 5
      ID Organisation   age   LOS Death
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     2     0
 2     2 Trust2          27     1     0
 3     3 Trust3          93    12     0
 4     4 Trust4          45     3     1
 5     5 Trust5          70    11     0
 6     6 Trust6          60     7     0
 7     7 Trust7          25     4     0
 8     8 Trust8          48     4     0
 9     9 Trust9          51     7     1
10    10 Trust10         81     1     0
# ℹ 290 more rows

There’s not much more to say about the basic rename() function beyond that. However, the scoped variants, rename_with() and rename_all() provide some useful additional tools. For example, we can use tolower() - which converts strings to all lower-case - to rename multiple columns at once:

LOS_model |> 
  rename_with(tolower) 
# A tibble: 300 × 5
      id organisation   age   los death
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     2     0
 2     2 Trust2          27     1     0
 3     3 Trust3          93    12     0
 4     4 Trust4          45     3     1
 5     5 Trust5          70    11     0
 6     6 Trust6          60     7     0
 7     7 Trust7          25     4     0
 8     8 Trust8          48     4     0
 9     9 Trust9          51     7     1
10    10 Trust10         81     1     0
# ℹ 290 more rows

rename() works well with tidyselect operators too, as we discuss in that session.

relocate()

relocate() re-arranges the order of columns:

LOS_model 
# A tibble: 300 × 5
      ID Organisation   Age   LOS Death
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     2     0
 2     2 Trust2          27     1     0
 3     3 Trust3          93    12     0
 4     4 Trust4          45     3     1
 5     5 Trust5          70    11     0
 6     6 Trust6          60     7     0
 7     7 Trust7          25     4     0
 8     8 Trust8          48     4     0
 9     9 Trust9          51     7     1
10    10 Trust10         81     1     0
# ℹ 290 more rows

relocate()’s default behaviour is to move specified columns to the far left:

LOS_model |>
  relocate(LOS) 
# A tibble: 300 × 5
     LOS    ID Organisation   Age Death
   <int> <int> <ord>        <int> <int>
 1     2     1 Trust1          55     0
 2     1     2 Trust2          27     0
 3    12     3 Trust3          93     0
 4     3     4 Trust4          45     1
 5    11     5 Trust5          70     0
 6     7     6 Trust6          60     0
 7     4     7 Trust7          25     0
 8     4     8 Trust8          48     0
 9     7     9 Trust9          51     1
10     1    10 Trust10         81     0
# ℹ 290 more rows

You can specify .before and .after if you need finer control over where your column ends up:

LOS_model |>
  relocate(LOS, .after=Death)
# A tibble: 300 × 5
      ID Organisation   Age Death   LOS
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     0     2
 2     2 Trust2          27     0     1
 3     3 Trust3          93     0    12
 4     4 Trust4          45     1     3
 5     5 Trust5          70     0    11
 6     6 Trust6          60     0     7
 7     7 Trust7          25     0     4
 8     8 Trust8          48     0     4
 9     9 Trust9          51     1     7
10    10 Trust10         81     0     1
# ℹ 290 more rows
LOS_model |>
  relocate(Death, .before=last_col()) # little bit of tidyselect
# A tibble: 300 × 5
      ID Organisation   Age   LOS Death
   <int> <ord>        <int> <int> <int>
 1     1 Trust1          55     2     0
 2     2 Trust2          27     1     0
 3     3 Trust3          93    12     0
 4     4 Trust4          45     3     1
 5     5 Trust5          70    11     0
 6     6 Trust6          60     7     0
 7     7 Trust7          25     4     0
 8     8 Trust8          48     4     0
 9     9 Trust9          51     7     1
10    10 Trust10         81     1     0
# ℹ 290 more rows

case_when()

The ONS mortality data in NHSRdatasets has several categories

ons_mortality |>
  count(category_1)
# A tibble: 9 × 2
  category_1                                                                   n
  <chr>                                                                    <int>
1 All respiratory diseases (ICD-10 J00-J99) ICD-10                           536
2 Deaths where COVID-19 was mentioned on the death certificate (ICD-10 U0…    14
3 Females                                                                   3927
4 Males                                                                     3927
5 Persons                                                                   3927
6 Region                                                                    5350
7 Total deaths                                                              1056
8 average of same week over 5 years                                           14
9 <NA>                                                                        52

Winter (metrological winter, for clarity) runs from the start of December to the end of February. Can we class the total deaths into winter, and non-winter categories? In practice, I’d be tempted to use lubridate::quarter() with an appropriate starting offset for this kind of work if possible, especially if I was grouping into regular groups. Winter/non-winter isn’t regular though, so case_when() is the better option. Here, we’ll work in stages. First, as we’ll need the months, we’ll extract those using lubridate::month():

ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date))
# A tibble: 1,056 × 6
   category_1   category_2 counts date       week_no month
   <chr>        <chr>       <dbl> <date>       <int> <dbl>
 1 Total deaths all ages    12968 2010-01-08       1     1
 2 Total deaths all ages    12541 2010-01-15       2     1
 3 Total deaths all ages    11762 2010-01-22       3     1
 4 Total deaths all ages    11056 2010-01-29       4     1
 5 Total deaths all ages    10524 2010-02-05       5     2
 6 Total deaths all ages    10117 2010-02-12       6     2
 7 Total deaths all ages    10102 2010-02-19       7     2
 8 Total deaths all ages    10295 2010-02-26       8     2
 9 Total deaths all ages     9981 2010-03-05       9     3
10 Total deaths all ages     9792 2010-03-12      10     3
# ℹ 1,046 more rows

Our definition of winter is any case where that month is 1, 2, or 12. So use case_when() to catch Jan and Feb:

ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date)) |>
  mutate(thing = case_when(month < 3 ~ "Winter"))
# A tibble: 1,056 × 7
   category_1   category_2 counts date       week_no month thing 
   <chr>        <chr>       <dbl> <date>       <int> <dbl> <chr> 
 1 Total deaths all ages    12968 2010-01-08       1     1 Winter
 2 Total deaths all ages    12541 2010-01-15       2     1 Winter
 3 Total deaths all ages    11762 2010-01-22       3     1 Winter
 4 Total deaths all ages    11056 2010-01-29       4     1 Winter
 5 Total deaths all ages    10524 2010-02-05       5     2 Winter
 6 Total deaths all ages    10117 2010-02-12       6     2 Winter
 7 Total deaths all ages    10102 2010-02-19       7     2 Winter
 8 Total deaths all ages    10295 2010-02-26       8     2 Winter
 9 Total deaths all ages     9981 2010-03-05       9     3 <NA>  
10 Total deaths all ages     9792 2010-03-12      10     3 <NA>  
# ℹ 1,046 more rows

and then add an extra case to deal with December:

ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date)) |>
  mutate(season = case_when(month < 3 ~ "Winter",
                           month == 12 ~ "Winter"))
# A tibble: 1,056 × 7
   category_1   category_2 counts date       week_no month season
   <chr>        <chr>       <dbl> <date>       <int> <dbl> <chr> 
 1 Total deaths all ages    12968 2010-01-08       1     1 Winter
 2 Total deaths all ages    12541 2010-01-15       2     1 Winter
 3 Total deaths all ages    11762 2010-01-22       3     1 Winter
 4 Total deaths all ages    11056 2010-01-29       4     1 Winter
 5 Total deaths all ages    10524 2010-02-05       5     2 Winter
 6 Total deaths all ages    10117 2010-02-12       6     2 Winter
 7 Total deaths all ages    10102 2010-02-19       7     2 Winter
 8 Total deaths all ages    10295 2010-02-26       8     2 Winter
 9 Total deaths all ages     9981 2010-03-05       9     3 <NA>  
10 Total deaths all ages     9792 2010-03-12      10     3 <NA>  
# ℹ 1,046 more rows

And finally a default to catch all the non-Winter values:

seasonal_ons_mortality <- ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date)) |>
  mutate(season = case_when(month < 3 ~ "Winter",
                           month == 12 ~ "Winter",
                           TRUE ~ "Not winter"))
seasonal_ons_mortality
# A tibble: 1,056 × 7
   category_1   category_2 counts date       week_no month season    
   <chr>        <chr>       <dbl> <date>       <int> <dbl> <chr>     
 1 Total deaths all ages    12968 2010-01-08       1     1 Winter    
 2 Total deaths all ages    12541 2010-01-15       2     1 Winter    
 3 Total deaths all ages    11762 2010-01-22       3     1 Winter    
 4 Total deaths all ages    11056 2010-01-29       4     1 Winter    
 5 Total deaths all ages    10524 2010-02-05       5     2 Winter    
 6 Total deaths all ages    10117 2010-02-12       6     2 Winter    
 7 Total deaths all ages    10102 2010-02-19       7     2 Winter    
 8 Total deaths all ages    10295 2010-02-26       8     2 Winter    
 9 Total deaths all ages     9981 2010-03-05       9     3 Not winter
10 Total deaths all ages     9792 2010-03-12      10     3 Not winter
# ℹ 1,046 more rows
seasonal_ons_mortality |>
  count(season)
# A tibble: 2 × 2
  season         n
  <chr>      <int>
1 Not winter   793
2 Winter       263

case_match()

Rather than a logical test, case_match() matches against values rather than performing a logical test. This is excellent for the winter classification example:

ons_mortality |>
  filter(category_1 == "Total deaths") |>
  mutate(month = lubridate::month(date)) |>
  mutate(season = case_match(month, c(1, 2, 12) ~ "Winter",
                             .default = "Not winter"))
# A tibble: 1,056 × 7
   category_1   category_2 counts date       week_no month season    
   <chr>        <chr>       <dbl> <date>       <int> <dbl> <chr>     
 1 Total deaths all ages    12968 2010-01-08       1     1 Winter    
 2 Total deaths all ages    12541 2010-01-15       2     1 Winter    
 3 Total deaths all ages    11762 2010-01-22       3     1 Winter    
 4 Total deaths all ages    11056 2010-01-29       4     1 Winter    
 5 Total deaths all ages    10524 2010-02-05       5     2 Winter    
 6 Total deaths all ages    10117 2010-02-12       6     2 Winter    
 7 Total deaths all ages    10102 2010-02-19       7     2 Winter    
 8 Total deaths all ages    10295 2010-02-26       8     2 Winter    
 9 Total deaths all ages     9981 2010-03-05       9     3 Not winter
10 Total deaths all ages     9792 2010-03-12      10     3 Not winter
# ℹ 1,046 more rows

consecutive_id()

consecutive_id(c(1, 1, 1, 2, 1, 1, 2, 2)) # puts groups together by changes. This would have been very useful for my transcripts
[1] 1 1 1 2 3 3 4 4
repeats <- tibble(speaker = c("steve", "steve", "emma", "steve", "emma", "emma"),
       comment = letters[1:6]) # repeated data

repeats |>
  mutate(group = consecutive_id(speaker)) # add a group that increments with each change of speaker
# A tibble: 6 × 3
  speaker comment group
  <chr>   <chr>   <int>
1 steve   a           1
2 steve   b           1
3 emma    c           2
4 steve   d           3
5 emma    e           4
6 emma    f           4
repeats |>
  mutate(group = consecutive_id(speaker)) |>
  group_by(group) |>
  mutate(comment = paste(comment, collapse = " ")) |>
  distinct()
# A tibble: 4 × 3
# Groups:   group [4]
  speaker comment group
  <chr>   <chr>   <int>
1 steve   a b         1
2 emma    c           2
3 steve   d           3
4 emma    e f         4