No feedback found for this session
Exploring data with dplyr
Session outline
This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to do three things with dplyr:
- build your fluency with basic dplyr functionality
- show some of the similarities of approach across the package, again to help you master the syntax
- learn some of the more advanced functions to harness the power available beneath the surface
You might also like some of the other dplyr-themed practical sessions:
A starting-point
You should have at least some previous experience working with dplyr before starting this training. Specifically, you should be familiar with at least:
Investigating data
Dplyr is usually used as a toolkit for manipulating tidy data. This session concentrates on some lesser-used functions in the dplyr package that are especially useful earlier in the analysis lifecycle when exploring new data: slice()
, glimpse()
, rename()
, relocate()
, case_when()
/case_match()
, and consecutive_id()
Package loading as follows:
slice()
It’s definitely worth pointing you towards the slice() manual page - it’s clear and interesting.
At its simplest, slice()
is an alternative to head:
ae_attendances |>
slice(1:6)
# A tibble: 6 × 6
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2017-03-01 RF4 1 21289 2879 5060
2 2017-03-01 RF4 2 813 22 0
3 2017-03-01 RF4 other 2850 6 0
4 2017-03-01 R1H 1 30210 5902 6943
5 2017-03-01 R1H 2 807 11 0
6 2017-03-01 R1H other 11352 136 0
The number supplied to slice()
returns that row - so slice(3)
shows the third row. You can also provide a sequence or vector of rows:
ae_attendances |>
slice(1:3)
# A tibble: 3 × 6
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2017-03-01 RF4 1 21289 2879 5060
2 2017-03-01 RF4 2 813 22 0
3 2017-03-01 RF4 other 2850 6 0
ae_attendances |>
slice(2,5,918)
# A tibble: 3 × 6
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2017-03-01 RF4 2 813 22 0
2 2017-03-01 R1H 2 807 11 0
3 2017-01-01 RQ6 other 10253 0 0
More interestingly, you can group_by()
, then slice, to see the first row of each group. Here, the data is filtered, grouped by org_code, then we slice the top row of each group:
# A tibble: 4 × 6
# Groups: org_code [4]
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2017-10-01 R0A 1 22392 3628 6836
2 2017-03-01 R1H 1 30210 5902 6943
3 2019-03-01 RRK 1 32017 10670 10850
4 2017-03-01 RW6 1 22930 4997 5328
(we’ll explore group_by()
in the summarising data with dplyr session, so don’t worry if this is unfamiliar)
The slice_min()
and slice_max()
functions are also useful, giving you something like an integrated arrange()
for looking at slices of the data:
# A tibble: 1 × 6
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2019-01-01 RRK 1 31935 12502 11493
# A tibble: 1 × 6
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2019-01-01 RRK 1 31935 12502 11493
(note that you need to explicitly name the number of rows you want using n=...
)
slice_min()
and slice_max()
are particularly useful for groups:
# A tibble: 140 × 6
# Groups: org_code [140]
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2019-01-01 R0A 1 21980 6206 6985
2 2018-11-01 R1F 1 3810 1262 967
3 2017-01-01 R1H 1 28420 7426 6435
4 2017-01-01 R1K 1 11961 5897 4587
5 2019-01-01 RA2 1 6242 1071 2513
6 2017-01-01 RA3 1 4073 1471 1074
7 2016-06-01 RA4 1 3992 454 1057
8 2016-11-01 RA7 1 9625 2445 3064
9 2019-01-01 RA9 1 6056 2053 2557
10 2019-03-01 RAE 1 10698 3438 2854
# ℹ 130 more rows
slice_sample()
gives a random sampling of the data.
ae_attendances |>
slice_sample(n=6)
# A tibble: 6 × 6
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2017-06-01 RMC 1 8793 1491 2056
2 2017-01-01 Y03047 other 2037 0 0
3 2017-03-01 RYJ other 11156 190 0
4 2018-02-01 REF 1 5140 2232 2204
5 2018-09-01 RVR 2 427 1 12
6 2016-07-01 R1D 2 363 0 0
There are also slice_head()
and slice_tail()
functions which work in the same way.
Final tip - slice with a negative index can be used as like filter()
to remove the specified row(s):
# returns the last 5 rows only
ae_attendances |>
slice(-1:-12760)
# A tibble: 5 × 6
period org_code type attendances breaches admissions
<date> <fct> <fct> <dbl> <dbl> <dbl>
1 2018-04-01 Y02584 other 471 0 0
2 2018-04-01 RA3 1 3825 476 1016
3 2018-04-01 AXG other 2980 24 0
4 2018-04-01 NLX24 other 1538 0 0
5 2018-04-01 RA4 1 4388 82 1292
glimpse()
glimpse()
is a data viewing function similar to print()
, except it transposes the data so that each column is displayed as a row. This is particularly useful for wide data with many columns, especially when you are interested in checking the class (date, character, etc.) of your columns.
synthetic_news_data |>
glimpse()
Rows: 1,000
Columns: 12
$ male <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
$ age <int> 68, 94, 85, 44, 77, 58, 25, 69, 91, 70, 87, 93, 61, 75, 97, 80, …
$ NEWS <int> 3, 1, 0, 0, 1, 1, 4, 0, 1, 1, 7, 2, 5, 1, 1, 3, 1, 5, 0, 2, 1, 2…
$ syst <int> 150, 145, 169, 154, 122, 146, 65, 116, 162, 132, 110, 166, 123, …
$ dias <int> 98, 67, 69, 106, 67, 106, 42, 56, 72, 96, 85, 90, 78, 80, 72, 81…
$ temp <dbl> 36.8, 35.0, 36.2, 36.9, 36.4, 35.3, 35.6, 37.2, 35.5, 35.3, 37.0…
$ pulse <int> 78, 62, 54, 80, 62, 73, 72, 90, 60, 67, 95, 87, 93, 65, 89, 145,…
$ resp <int> 26, 18, 18, 17, 20, 20, 12, 16, 16, 16, 24, 16, 26, 12, 16, 16, …
$ sat <int> 96, 96, 96, 96, 95, 98, 99, 97, 99, 97, 87, 95, 96, 96, 98, 99, …
$ sup <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ died <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Note the useful dimension information in the first two lines, and the class information in angle brackets. Okay, so there are lots of similar ways of displaying the same information, but glimpse()
is nice and concise both to write and to read. Compare a couple of base r (or purrr) near-equivalents, which are especially messy when the data is quite wide:
synthetic_news_data |>
purrr::map_df(class)
# A tibble: 1 × 12
male age NEWS syst dias temp pulse resp sat sup alert died
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 integer integer integer integ… inte… nume… inte… inte… inte… inte… inte… inte…
# equivalent in base R
lapply(synthetic_news_data, class)
$male
[1] "integer"
$age
[1] "integer"
$NEWS
[1] "integer"
$syst
[1] "integer"
$dias
[1] "integer"
$temp
[1] "numeric"
$pulse
[1] "integer"
$resp
[1] "integer"
$sat
[1] "integer"
$sup
[1] "integer"
$alert
[1] "integer"
$died
[1] "integer"
glimpse()
plays nicely with the pipe, meaning that it’s potentially useful while you’re working on a complicated data transformation. If you have lots of stages piped together, you can insert glimpse()
in periodically to check that each stage of your transformation is working as expected:
synthetic_news_data |>
glimpse() |>
filter(age == 71 & male == 0) |>
glimpse() |>
mutate(pulse_pres = syst-dias) |>
glimpse()
Rows: 1,000
Columns: 12
$ male <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
$ age <int> 68, 94, 85, 44, 77, 58, 25, 69, 91, 70, 87, 93, 61, 75, 97, 80, …
$ NEWS <int> 3, 1, 0, 0, 1, 1, 4, 0, 1, 1, 7, 2, 5, 1, 1, 3, 1, 5, 0, 2, 1, 2…
$ syst <int> 150, 145, 169, 154, 122, 146, 65, 116, 162, 132, 110, 166, 123, …
$ dias <int> 98, 67, 69, 106, 67, 106, 42, 56, 72, 96, 85, 90, 78, 80, 72, 81…
$ temp <dbl> 36.8, 35.0, 36.2, 36.9, 36.4, 35.3, 35.6, 37.2, 35.5, 35.3, 37.0…
$ pulse <int> 78, 62, 54, 80, 62, 73, 72, 90, 60, 67, 95, 87, 93, 65, 89, 145,…
$ resp <int> 26, 18, 18, 17, 20, 20, 12, 16, 16, 16, 24, 16, 26, 12, 16, 16, …
$ sat <int> 96, 96, 96, 96, 95, 98, 99, 97, 99, 97, 87, 95, 96, 96, 98, 99, …
$ sup <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ died <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Rows: 11
Columns: 12
$ male <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ age <int> 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71
$ NEWS <int> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 3
$ syst <int> 128, 128, 137, 128, 114, 106, 128, 114, 132, 137, 102
$ dias <int> 67, 81, 61, 81, 62, 56, 82, 66, 96, 78, 51
$ temp <dbl> 38.2, 37.9, 36.4, 36.2, 35.9, 36.2, 37.9, 38.2, 35.4, 37.0, 35.8
$ pulse <int> 99, 101, 102, 73, 87, 72, 110, 103, 67, 102, 54
$ resp <int> 17, 18, 17, 18, 16, 15, 17, 14, 17, 19, 18
$ sat <int> 98, 96, 95, 95, 95, 97, 97, 99, 99, 95, 100
$ sup <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ died <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Rows: 11
Columns: 13
$ male <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ age <int> 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71
$ NEWS <int> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 3
$ syst <int> 128, 128, 137, 128, 114, 106, 128, 114, 132, 137, 102
$ dias <int> 67, 81, 61, 81, 62, 56, 82, 66, 96, 78, 51
$ temp <dbl> 38.2, 37.9, 36.4, 36.2, 35.9, 36.2, 37.9, 38.2, 35.4, 37.0,…
$ pulse <int> 99, 101, 102, 73, 87, 72, 110, 103, 67, 102, 54
$ resp <int> 17, 18, 17, 18, 16, 15, 17, 14, 17, 19, 18
$ sat <int> 98, 96, 95, 95, 95, 97, 97, 99, 99, 95, 100
$ sup <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
$ alert <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ died <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ pulse_pres <int> 61, 47, 76, 47, 52, 50, 46, 48, 36, 59, 51
rename()
(and rename_with()
)
rename()
renames columns. Like mutate()
, the general syntax for rename()
is new name = old name.
LOS_model |>
rename(age = Age)
# A tibble: 300 × 5
ID Organisation age LOS Death
<int> <ord> <int> <int> <int>
1 1 Trust1 55 2 0
2 2 Trust2 27 1 0
3 3 Trust3 93 12 0
4 4 Trust4 45 3 1
5 5 Trust5 70 11 0
6 6 Trust6 60 7 0
7 7 Trust7 25 4 0
8 8 Trust8 48 4 0
9 9 Trust9 51 7 1
10 10 Trust10 81 1 0
# ℹ 290 more rows
There’s not much more to say about the basic rename()
function beyond that. However, the scoped variants, rename_with()
and rename_all()
provide some useful additional tools. For example, we can use tolower()
- which converts strings to all lower-case - to rename multiple columns at once:
LOS_model |>
rename_with(tolower)
# A tibble: 300 × 5
id organisation age los death
<int> <ord> <int> <int> <int>
1 1 Trust1 55 2 0
2 2 Trust2 27 1 0
3 3 Trust3 93 12 0
4 4 Trust4 45 3 1
5 5 Trust5 70 11 0
6 6 Trust6 60 7 0
7 7 Trust7 25 4 0
8 8 Trust8 48 4 0
9 9 Trust9 51 7 1
10 10 Trust10 81 1 0
# ℹ 290 more rows
rename()
works well with tidyselect operators too, as we discuss in that session.
relocate()
relocate()
re-arranges the order of columns:
LOS_model
# A tibble: 300 × 5
ID Organisation Age LOS Death
<int> <ord> <int> <int> <int>
1 1 Trust1 55 2 0
2 2 Trust2 27 1 0
3 3 Trust3 93 12 0
4 4 Trust4 45 3 1
5 5 Trust5 70 11 0
6 6 Trust6 60 7 0
7 7 Trust7 25 4 0
8 8 Trust8 48 4 0
9 9 Trust9 51 7 1
10 10 Trust10 81 1 0
# ℹ 290 more rows
relocate()
’s default behaviour is to move specified columns to the far left:
LOS_model |>
relocate(LOS)
# A tibble: 300 × 5
LOS ID Organisation Age Death
<int> <int> <ord> <int> <int>
1 2 1 Trust1 55 0
2 1 2 Trust2 27 0
3 12 3 Trust3 93 0
4 3 4 Trust4 45 1
5 11 5 Trust5 70 0
6 7 6 Trust6 60 0
7 4 7 Trust7 25 0
8 4 8 Trust8 48 0
9 7 9 Trust9 51 1
10 1 10 Trust10 81 0
# ℹ 290 more rows
You can specify .before
and .after
if you need finer control over where your column ends up:
LOS_model |>
relocate(LOS, .after=Death)
# A tibble: 300 × 5
ID Organisation Age Death LOS
<int> <ord> <int> <int> <int>
1 1 Trust1 55 0 2
2 2 Trust2 27 0 1
3 3 Trust3 93 0 12
4 4 Trust4 45 1 3
5 5 Trust5 70 0 11
6 6 Trust6 60 0 7
7 7 Trust7 25 0 4
8 8 Trust8 48 0 4
9 9 Trust9 51 1 7
10 10 Trust10 81 0 1
# ℹ 290 more rows
# A tibble: 300 × 5
ID Organisation Age LOS Death
<int> <ord> <int> <int> <int>
1 1 Trust1 55 2 0
2 2 Trust2 27 1 0
3 3 Trust3 93 12 0
4 4 Trust4 45 3 1
5 5 Trust5 70 11 0
6 6 Trust6 60 7 0
7 7 Trust7 25 4 0
8 8 Trust8 48 4 0
9 9 Trust9 51 7 1
10 10 Trust10 81 1 0
# ℹ 290 more rows
case_when()
The ONS mortality data in NHSRdatasets has several categories
ons_mortality |>
count(category_1)
# A tibble: 9 × 2
category_1 n
<chr> <int>
1 All respiratory diseases (ICD-10 J00-J99) ICD-10 536
2 Deaths where COVID-19 was mentioned on the death certificate (ICD-10 U0… 14
3 Females 3927
4 Males 3927
5 Persons 3927
6 Region 5350
7 Total deaths 1056
8 average of same week over 5 years 14
9 <NA> 52
Winter (metrological winter, for clarity) runs from the start of December to the end of February. Can we class the total deaths into winter, and non-winter categories? In practice, I’d be tempted to use lubridate::quarter()
with an appropriate starting offset for this kind of work if possible, especially if I was grouping into regular groups. Winter/non-winter isn’t regular though, so case_when()
is the better option. Here, we’ll work in stages. First, as we’ll need the months, we’ll extract those using lubridate::month()
:
# A tibble: 1,056 × 6
category_1 category_2 counts date week_no month
<chr> <chr> <dbl> <date> <int> <dbl>
1 Total deaths all ages 12968 2010-01-08 1 1
2 Total deaths all ages 12541 2010-01-15 2 1
3 Total deaths all ages 11762 2010-01-22 3 1
4 Total deaths all ages 11056 2010-01-29 4 1
5 Total deaths all ages 10524 2010-02-05 5 2
6 Total deaths all ages 10117 2010-02-12 6 2
7 Total deaths all ages 10102 2010-02-19 7 2
8 Total deaths all ages 10295 2010-02-26 8 2
9 Total deaths all ages 9981 2010-03-05 9 3
10 Total deaths all ages 9792 2010-03-12 10 3
# ℹ 1,046 more rows
Our definition of winter is any case where that month is 1, 2, or 12. So use case_when()
to catch Jan and Feb:
ons_mortality |>
filter(category_1 == "Total deaths") |>
mutate(month = lubridate::month(date)) |>
mutate(thing = case_when(month < 3 ~ "Winter"))
# A tibble: 1,056 × 7
category_1 category_2 counts date week_no month thing
<chr> <chr> <dbl> <date> <int> <dbl> <chr>
1 Total deaths all ages 12968 2010-01-08 1 1 Winter
2 Total deaths all ages 12541 2010-01-15 2 1 Winter
3 Total deaths all ages 11762 2010-01-22 3 1 Winter
4 Total deaths all ages 11056 2010-01-29 4 1 Winter
5 Total deaths all ages 10524 2010-02-05 5 2 Winter
6 Total deaths all ages 10117 2010-02-12 6 2 Winter
7 Total deaths all ages 10102 2010-02-19 7 2 Winter
8 Total deaths all ages 10295 2010-02-26 8 2 Winter
9 Total deaths all ages 9981 2010-03-05 9 3 <NA>
10 Total deaths all ages 9792 2010-03-12 10 3 <NA>
# ℹ 1,046 more rows
and then add an extra case to deal with December:
ons_mortality |>
filter(category_1 == "Total deaths") |>
mutate(month = lubridate::month(date)) |>
mutate(season = case_when(month < 3 ~ "Winter",
month == 12 ~ "Winter"))
# A tibble: 1,056 × 7
category_1 category_2 counts date week_no month season
<chr> <chr> <dbl> <date> <int> <dbl> <chr>
1 Total deaths all ages 12968 2010-01-08 1 1 Winter
2 Total deaths all ages 12541 2010-01-15 2 1 Winter
3 Total deaths all ages 11762 2010-01-22 3 1 Winter
4 Total deaths all ages 11056 2010-01-29 4 1 Winter
5 Total deaths all ages 10524 2010-02-05 5 2 Winter
6 Total deaths all ages 10117 2010-02-12 6 2 Winter
7 Total deaths all ages 10102 2010-02-19 7 2 Winter
8 Total deaths all ages 10295 2010-02-26 8 2 Winter
9 Total deaths all ages 9981 2010-03-05 9 3 <NA>
10 Total deaths all ages 9792 2010-03-12 10 3 <NA>
# ℹ 1,046 more rows
And finally a default to catch all the non-Winter values:
seasonal_ons_mortality <- ons_mortality |>
filter(category_1 == "Total deaths") |>
mutate(month = lubridate::month(date)) |>
mutate(season = case_when(month < 3 ~ "Winter",
month == 12 ~ "Winter",
TRUE ~ "Not winter"))
seasonal_ons_mortality
# A tibble: 1,056 × 7
category_1 category_2 counts date week_no month season
<chr> <chr> <dbl> <date> <int> <dbl> <chr>
1 Total deaths all ages 12968 2010-01-08 1 1 Winter
2 Total deaths all ages 12541 2010-01-15 2 1 Winter
3 Total deaths all ages 11762 2010-01-22 3 1 Winter
4 Total deaths all ages 11056 2010-01-29 4 1 Winter
5 Total deaths all ages 10524 2010-02-05 5 2 Winter
6 Total deaths all ages 10117 2010-02-12 6 2 Winter
7 Total deaths all ages 10102 2010-02-19 7 2 Winter
8 Total deaths all ages 10295 2010-02-26 8 2 Winter
9 Total deaths all ages 9981 2010-03-05 9 3 Not winter
10 Total deaths all ages 9792 2010-03-12 10 3 Not winter
# ℹ 1,046 more rows
seasonal_ons_mortality |>
count(season)
# A tibble: 2 × 2
season n
<chr> <int>
1 Not winter 793
2 Winter 263
case_match()
Rather than a logical test, case_match()
matches against values rather than performing a logical test. This is excellent for the winter classification example:
ons_mortality |>
filter(category_1 == "Total deaths") |>
mutate(month = lubridate::month(date)) |>
mutate(season = case_match(month, c(1, 2, 12) ~ "Winter",
.default = "Not winter"))
# A tibble: 1,056 × 7
category_1 category_2 counts date week_no month season
<chr> <chr> <dbl> <date> <int> <dbl> <chr>
1 Total deaths all ages 12968 2010-01-08 1 1 Winter
2 Total deaths all ages 12541 2010-01-15 2 1 Winter
3 Total deaths all ages 11762 2010-01-22 3 1 Winter
4 Total deaths all ages 11056 2010-01-29 4 1 Winter
5 Total deaths all ages 10524 2010-02-05 5 2 Winter
6 Total deaths all ages 10117 2010-02-12 6 2 Winter
7 Total deaths all ages 10102 2010-02-19 7 2 Winter
8 Total deaths all ages 10295 2010-02-26 8 2 Winter
9 Total deaths all ages 9981 2010-03-05 9 3 Not winter
10 Total deaths all ages 9792 2010-03-12 10 3 Not winter
# ℹ 1,046 more rows
consecutive_id()
consecutive_id(c(1, 1, 1, 2, 1, 1, 2, 2)) # puts groups together by changes. This would have been very useful for my transcripts
[1] 1 1 1 2 3 3 4 4
repeats <- tibble(speaker = c("steve", "steve", "emma", "steve", "emma", "emma"),
comment = letters[1:6]) # repeated data
repeats |>
mutate(group = consecutive_id(speaker)) # add a group that increments with each change of speaker
# A tibble: 6 × 3
speaker comment group
<chr> <chr> <int>
1 steve a 1
2 steve b 1
3 emma c 2
4 steve d 3
5 emma e 4
6 emma f 4
repeats |>
mutate(group = consecutive_id(speaker)) |>
group_by(group) |>
mutate(comment = paste(comment, collapse = " ")) |>
distinct()
# A tibble: 4 × 3
# Groups: group [4]
speaker comment group
<chr> <chr> <int>
1 steve a b 1
2 emma c 2
3 steve d 3
4 emma e f 4