No feedback found for this session
Summarising data with dplyr
Session outline
This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to do three things with dplyr:
- show how to approach summarising data
- explain how grouping works
- show some simple summary functions
You might also like some of the other dplyr-themed practical sessions:
summarise()
Let’s do a little bit of package loading, and then start summarising things:
The description on the dplyr reference page for summarise
is admirably clear:
summarise()
creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.
summarise()
is especially strong in concert with group_by()
:
# A tibble: 274 × 2
org_code `sum(attendances)`
<fct> <dbl>
1 8J094 74303
2 AAH 6808
3 AC008 25808
4 AD913 159739
5 AF002 7542
6 AF003 6046
7 AJN 27425
8 ATQ02 127445
9 AXG 36486
10 AXT02 88073
# ℹ 264 more rows
Two important things to note here:
-
group_by
doesn’t change how the data looks - just how it behaves: - Each call to
summarise()
removes a layer of grouping
ae_attendances |>
group_by(org_code) |>
summarise(sum(attendances)) |>
summarise(sum(`sum(attendances)`)) # horrible default column names, which we'll fix in future
# A tibble: 1 × 1
`sum(\`sum(attendances)\`)`
<dbl>
1 72019402
And you can build simple formulae inside summarise()
:
ae_attendances |>
group_by(org_code) |>
summarise(total = sum(attendances)) |> # rename columns like select
arrange(desc(total))
# A tibble: 274 × 2
org_code total
<fct> <dbl>
1 R1H 1471581
2 RW6 1067518
3 R1K 1033563
4 RQM 913440
5 RF4 873909
6 RJ2 873689
7 RYJ 868344
8 RJZ 856644
9 RAL 806571
10 RJE 746107
# ℹ 264 more rows
ae_attendances |>
group_by(org_code) |>
summarise(non_admissions = sum(attendances - admissions)) |>
arrange(desc(non_admissions))
# A tibble: 274 × 2
org_code non_admissions
<fct> <dbl>
1 R1H 1211923
2 RW6 856664
3 R1K 837535
4 RQM 778879
5 RYJ 750941
6 RJ2 712189
7 RF4 705767
8 RJZ 696891
9 RYX 661468
10 RAL 660741
# ℹ 264 more rows
You can, in another recent change, group inside the summarise itself via .by()
:
ae_attendances |>
summarise(non_admissions = sum(attendances - admissions),
.by = org_code) |>
arrange(desc(non_admissions))
# A tibble: 274 × 2
org_code non_admissions
<fct> <dbl>
1 R1H 1211923
2 RW6 856664
3 R1K 837535
4 RQM 778879
5 RYJ 750941
6 RJ2 712189
7 RF4 705767
8 RJZ 696891
9 RYX 661468
10 RAL 660741
# ℹ 264 more rows
This returns an ungrouped tibble - so important to know that it’s not a direct substitute for an ordinary group_by()
. That’s especially confusing with summarise()
owing to the “remove a single grouping layer each time” approach:
# A tibble: 428 × 3
org_code type att
<fct> <fct> <dbl>
1 RF4 1 697635
2 RF4 2 25667
3 RF4 other 150607
4 R1H 1 998934
5 R1H 2 27286
6 R1H other 445361
7 AD913 other 159739
8 RYX other 661468
9 RQM 1 623161
10 RQM other 290279
# ℹ 418 more rows
# A tibble: 428 × 3
# Groups: org_code [274]
org_code type att
<fct> <fct> <dbl>
1 8J094 other 74303
2 AAH other 6808
3 AC008 other 25808
4 AD913 other 159739
5 AF002 other 7542
6 AF003 other 6046
7 AJN other 27425
8 ATQ02 other 127445
9 AXG other 36486
10 AXT02 other 88073
# ℹ 418 more rows
reframe()
A recent change in dplyr 1.1.0 is that summarise()
now will only return one row per group. A new function, reframe()
, has been developed to produce multiple-row summaries. It works exactly like summarise()
except, rather than removing one grouping layer per operation, it always returns an ungrouped tibble. The syntax is the same as summarise()
:
sum <- ae_attendances |>
group_by(year = lubridate::floor_date(period, unit = "year")) |>
summarise(
year = lubridate::year(year),
non_admissions = sum(attendances - admissions)
)
ref <- ae_attendances |>
group_by(year = lubridate::floor_date(period, unit = "year")) |>
reframe(
year = lubridate::year(year),
non_admissions = sum(attendances - admissions)
)
waldo::compare(sum, ref)
`class(old)`: "grouped_df" "tbl_df" "tbl" "data.frame"
`class(new)`: "tbl_df" "tbl" "data.frame"
`attr(old, 'groups')` is an S3 object of class <tbl_df/tbl/data.frame>, a list
`attr(new, 'groups')` is absent
count()
and tally()
count()
counts unique values, and is equivalent to using group_by()
and then summarise()
:
synthetic_news_data |>
count(died)
# A tibble: 2 × 2
died n
<int> <int>
1 0 930
2 1 70
That’s roughly equivalent to:
# A tibble: 2 × 2
died n
<int> <int>
1 0 930
2 1 70
tally()
works similarly, except without the group_by()
:
A possible source of confusion is that adding a column name to tally()
performs a weighted count of that column:
synthetic_news_data |>
tally(age)
# A tibble: 1 × 1
n
<int>
1 69647
Roughly equivalent to:
count()
has a useful sort option to arrange by group size:
synthetic_news_data |>
count(syst, sort=T)
# A tibble: 125 × 2
syst n
<int> <int>
1 130 28
2 120 26
3 139 26
4 150 25
5 135 24
6 126 22
7 134 22
8 119 21
9 123 21
10 112 20
# ℹ 115 more rows
Both count()
and tally()
have add_
variants, which work like mutate()
in that they add a new column containing the count:
#add_ variants
synthetic_news_data |>
add_count(syst, name="syst_BP_count") |>
select(syst, last_col())
# A tibble: 1,000 × 2
syst syst_BP_count
<int> <int>
1 150 25
2 145 16
3 169 4
4 154 12
5 122 11
6 146 16
7 65 2
8 116 12
9 162 10
10 132 9
# ℹ 990 more rows
add_tally()
gives a col with the same value in each row of a group:
# A tibble: 6 × 13
# Groups: died [2]
male age NEWS syst dias temp pulse resp sat sup alert died n
<int> <int> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 0 68 3 150 98 36.8 78 26 96 0 0 0 930
2 1 94 1 145 67 35 62 18 96 0 0 0 930
3 0 85 0 169 69 36.2 54 18 96 0 0 0 930
4 0 80 5 130 81 37 129 18 97 0 0 1 70
5 0 62 6 108 65 34.4 43 18 99 0 0 1 70
6 1 64 0 114 82 36.2 78 18 97 0 0 1 70
rowwise()
rowwise()
forms groups per row. For example, we could take the average of the three columns in ae_attendances:
# to find the daily mean of attendances, breaches, and admissions
ae_attendances |>
rowwise() |>
mutate(mean = mean(c(attendances, breaches, admissions)))
# A tibble: 12,765 × 7
# Rowwise:
period org_code type attendances breaches admissions mean
<date> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 2017-03-01 RF4 1 21289 2879 5060 9743.
2 2017-03-01 RF4 2 813 22 0 278.
3 2017-03-01 RF4 other 2850 6 0 952
4 2017-03-01 R1H 1 30210 5902 6943 14352.
5 2017-03-01 R1H 2 807 11 0 273.
6 2017-03-01 R1H other 11352 136 0 3829.
7 2017-03-01 AD913 other 4381 2 0 1461
8 2017-03-01 RYX other 19562 258 0 6607.
9 2017-03-01 RQM 1 17414 2030 3597 7680.
10 2017-03-01 RQM other 7817 86 0 2634.
# ℹ 12,755 more rows
Compare and contrast with the results we obtain if we omit rowwise()
, where the mean column contains the averages of the three columns overall, rather than for each date and organisation:
# A tibble: 12,765 × 7
period org_code type attendances breaches admissions mean
<date> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 2017-03-01 RF4 1 21289 2879 5060 2450.
2 2017-03-01 RF4 2 813 22 0 2450.
3 2017-03-01 RF4 other 2850 6 0 2450.
4 2017-03-01 R1H 1 30210 5902 6943 2450.
5 2017-03-01 R1H 2 807 11 0 2450.
6 2017-03-01 R1H other 11352 136 0 2450.
7 2017-03-01 AD913 other 4381 2 0 2450.
8 2017-03-01 RYX other 19562 258 0 2450.
9 2017-03-01 RQM 1 17414 2030 3597 2450.
10 2017-03-01 RQM other 7817 86 0 2450.
# ℹ 12,755 more rows
There’s also a c_across()
function to select columns that looks really promising for rowwise()
work, but bafflingly it is extremely slow here, taking 50x longer than the equivalent mutate()
. This is a known issue - “particularly for long, narrow, data”. So this code is switched off and provided here for information only - although do feel free to try it out if you don’t mind a ten second wait.
Summary functions
nth()
/ first()
/ last()
Get the nth, first, or last values. Very useful inside a summarise or similar when you want to be sure that you’re going to return a sensible result.
tibble(speaker = c("steve", "steve", "emma", "steve", "emma", "emma"),
comment = letters[1:6]) |>
mutate(group = consecutive_id(speaker)) |>
group_by(group, speaker) |>
summarise(comment = last(comment)) |>
ungroup() |>
select(-group)
# A tibble: 4 × 2
speaker comment
<chr> <chr>
1 steve b
2 emma c
3 steve d
4 emma f