Summarising data with dplyr

R
intermediate
Published

May 19, 2025

No feedback found for this session

Session outline

This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to do three things with dplyr:

  • show how to approach summarising data
  • explain how grouping works
  • show some simple summary functions

You might also like some of the other dplyr-themed practical sessions:

summarise()

Let’s do a little bit of package loading, and then start summarising things:

The description on the dplyr reference page for summarise is admirably clear:

summarise() creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

ae_attendances |> 
  summarise(sum(attendances)) 
# A tibble: 1 × 1
  `sum(attendances)`
               <dbl>
1           72019402

summarise() is especially strong in concert with group_by():

ae_attendances |> 
  group_by(org_code) |>
  summarise(sum(attendances)) 
# A tibble: 274 × 2
   org_code `sum(attendances)`
   <fct>                 <dbl>
 1 8J094                 74303
 2 AAH                    6808
 3 AC008                 25808
 4 AD913                159739
 5 AF002                  7542
 6 AF003                  6046
 7 AJN                   27425
 8 ATQ02                127445
 9 AXG                   36486
10 AXT02                 88073
# ℹ 264 more rows

Two important things to note here:

  • group_by doesn’t change how the data looks - just how it behaves:
  • Each call to summarise() removes a layer of grouping
ae_attendances |> 
  group_by(org_code) |>
  summarise(sum(attendances)) |>
  summarise(sum(`sum(attendances)`)) # horrible default column names, which we'll fix in future
# A tibble: 1 × 1
  `sum(\`sum(attendances)\`)`
                        <dbl>
1                    72019402

And you can build simple formulae inside summarise():

ae_attendances |> 
  group_by(org_code) |>
  summarise(total = sum(attendances)) |> # rename columns like select 
  arrange(desc(total)) 
# A tibble: 274 × 2
   org_code   total
   <fct>      <dbl>
 1 R1H      1471581
 2 RW6      1067518
 3 R1K      1033563
 4 RQM       913440
 5 RF4       873909
 6 RJ2       873689
 7 RYJ       868344
 8 RJZ       856644
 9 RAL       806571
10 RJE       746107
# ℹ 264 more rows
ae_attendances |> 
  group_by(org_code) |>
  summarise(non_admissions = sum(attendances - admissions)) |>
  arrange(desc(non_admissions)) 
# A tibble: 274 × 2
   org_code non_admissions
   <fct>             <dbl>
 1 R1H             1211923
 2 RW6              856664
 3 R1K              837535
 4 RQM              778879
 5 RYJ              750941
 6 RJ2              712189
 7 RF4              705767
 8 RJZ              696891
 9 RYX              661468
10 RAL              660741
# ℹ 264 more rows

You can, in another recent change, group inside the summarise itself via .by():

ae_attendances |> 
  summarise(non_admissions = sum(attendances - admissions), 
            .by = org_code) |>
  arrange(desc(non_admissions)) 
# A tibble: 274 × 2
   org_code non_admissions
   <fct>             <dbl>
 1 R1H             1211923
 2 RW6              856664
 3 R1K              837535
 4 RQM              778879
 5 RYJ              750941
 6 RJ2              712189
 7 RF4              705767
 8 RJZ              696891
 9 RYX              661468
10 RAL              660741
# ℹ 264 more rows

This returns an ungrouped tibble - so important to know that it’s not a direct substitute for an ordinary group_by(). That’s especially confusing with summarise() owing to the “remove a single grouping layer each time” approach:

ae_attendances |> 
  summarise(att = sum(attendances), .by = c(org_code, type)) 
# A tibble: 428 × 3
   org_code type     att
   <fct>    <fct>  <dbl>
 1 RF4      1     697635
 2 RF4      2      25667
 3 RF4      other 150607
 4 R1H      1     998934
 5 R1H      2      27286
 6 R1H      other 445361
 7 AD913    other 159739
 8 RYX      other 661468
 9 RQM      1     623161
10 RQM      other 290279
# ℹ 418 more rows
ae_attendances |> 
  group_by(org_code, type) |>
  summarise(att = sum(attendances))
# A tibble: 428 × 3
# Groups:   org_code [274]
   org_code type     att
   <fct>    <fct>  <dbl>
 1 8J094    other  74303
 2 AAH      other   6808
 3 AC008    other  25808
 4 AD913    other 159739
 5 AF002    other   7542
 6 AF003    other   6046
 7 AJN      other  27425
 8 ATQ02    other 127445
 9 AXG      other  36486
10 AXT02    other  88073
# ℹ 418 more rows

reframe()

A recent change in dplyr 1.1.0 is that summarise() now will only return one row per group. A new function, reframe(), has been developed to produce multiple-row summaries. It works exactly like summarise() except, rather than removing one grouping layer per operation, it always returns an ungrouped tibble. The syntax is the same as summarise():

sum <- ae_attendances |>
  group_by(year = lubridate::floor_date(period, unit = "year")) |>
  summarise(
    year = lubridate::year(year),
    non_admissions = sum(attendances - admissions)
  )

ref <- ae_attendances |>
  group_by(year = lubridate::floor_date(period, unit = "year")) |>
  reframe(
    year = lubridate::year(year),
    non_admissions = sum(attendances - admissions)
  )

waldo::compare(sum, ref)
`class(old)`: "grouped_df" "tbl_df" "tbl" "data.frame"
`class(new)`:              "tbl_df" "tbl" "data.frame"

`attr(old, 'groups')` is an S3 object of class <tbl_df/tbl/data.frame>, a list
`attr(new, 'groups')` is absent

count() and tally()

count() counts unique values, and is equivalent to using group_by() and then summarise():

synthetic_news_data |>
  count(died) 
# A tibble: 2 × 2
   died     n
  <int> <int>
1     0   930
2     1    70

That’s roughly equivalent to:

synthetic_news_data |> 
  group_by(died) |> 
  summarise(n = n())  
# A tibble: 2 × 2
   died     n
  <int> <int>
1     0   930
2     1    70

tally() works similarly, except without the group_by():

synthetic_news_data |> 
  # group_by(died) |> 
  summarise(n = n())  
# A tibble: 1 × 1
      n
  <int>
1  1000
synthetic_news_data |>
  tally() 
# A tibble: 1 × 1
      n
  <int>
1  1000

A possible source of confusion is that adding a column name to tally() performs a weighted count of that column:

synthetic_news_data |>
  tally(age)
# A tibble: 1 × 1
      n
  <int>
1 69647

Roughly equivalent to:

synthetic_news_data |>
  pull(age) |>
  sum()
[1] 69647

count() has a useful sort option to arrange by group size:

synthetic_news_data |>
  count(syst, sort=T) 
# A tibble: 125 × 2
    syst     n
   <int> <int>
 1   130    28
 2   120    26
 3   139    26
 4   150    25
 5   135    24
 6   126    22
 7   134    22
 8   119    21
 9   123    21
10   112    20
# ℹ 115 more rows

Both count() and tally() have add_ variants, which work like mutate() in that they add a new column containing the count:

#add_ variants
synthetic_news_data |>
  add_count(syst, name="syst_BP_count") |>
  select(syst, last_col())
# A tibble: 1,000 × 2
    syst syst_BP_count
   <int>         <int>
 1   150            25
 2   145            16
 3   169             4
 4   154            12
 5   122            11
 6   146            16
 7    65             2
 8   116            12
 9   162            10
10   132             9
# ℹ 990 more rows

add_tally() gives a col with the same value in each row of a group:

synthetic_news_data |>
  group_by(died) |>
  add_tally() |>
  slice(1:3) 
# A tibble: 6 × 13
# Groups:   died [2]
   male   age  NEWS  syst  dias  temp pulse  resp   sat   sup alert  died     n
  <int> <int> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1     0    68     3   150    98  36.8    78    26    96     0     0     0   930
2     1    94     1   145    67  35      62    18    96     0     0     0   930
3     0    85     0   169    69  36.2    54    18    96     0     0     0   930
4     0    80     5   130    81  37     129    18    97     0     0     1    70
5     0    62     6   108    65  34.4    43    18    99     0     0     1    70
6     1    64     0   114    82  36.2    78    18    97     0     0     1    70

rowwise()

rowwise() forms groups per row. For example, we could take the average of the three columns in ae_attendances:

# to find the daily mean of attendances, breaches, and admissions
ae_attendances |> 
  rowwise() |>
  mutate(mean = mean(c(attendances, breaches, admissions))) 
# A tibble: 12,765 × 7
# Rowwise: 
   period     org_code type  attendances breaches admissions   mean
   <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl>  <dbl>
 1 2017-03-01 RF4      1           21289     2879       5060  9743.
 2 2017-03-01 RF4      2             813       22          0   278.
 3 2017-03-01 RF4      other        2850        6          0   952 
 4 2017-03-01 R1H      1           30210     5902       6943 14352.
 5 2017-03-01 R1H      2             807       11          0   273.
 6 2017-03-01 R1H      other       11352      136          0  3829.
 7 2017-03-01 AD913    other        4381        2          0  1461 
 8 2017-03-01 RYX      other       19562      258          0  6607.
 9 2017-03-01 RQM      1           17414     2030       3597  7680.
10 2017-03-01 RQM      other        7817       86          0  2634.
# ℹ 12,755 more rows

Compare and contrast with the results we obtain if we omit rowwise(), where the mean column contains the averages of the three columns overall, rather than for each date and organisation:

ae_attendances |> 
  mutate(mean=mean(c(attendances, breaches, admissions))) 
# A tibble: 12,765 × 7
   period     org_code type  attendances breaches admissions  mean
   <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl> <dbl>
 1 2017-03-01 RF4      1           21289     2879       5060 2450.
 2 2017-03-01 RF4      2             813       22          0 2450.
 3 2017-03-01 RF4      other        2850        6          0 2450.
 4 2017-03-01 R1H      1           30210     5902       6943 2450.
 5 2017-03-01 R1H      2             807       11          0 2450.
 6 2017-03-01 R1H      other       11352      136          0 2450.
 7 2017-03-01 AD913    other        4381        2          0 2450.
 8 2017-03-01 RYX      other       19562      258          0 2450.
 9 2017-03-01 RQM      1           17414     2030       3597 2450.
10 2017-03-01 RQM      other        7817       86          0 2450.
# ℹ 12,755 more rows

There’s also a c_across() function to select columns that looks really promising for rowwise() work, but bafflingly it is extremely slow here, taking 50x longer than the equivalent mutate(). This is a known issue - “particularly for long, narrow, data”. So this code is switched off and provided here for information only - although do feel free to try it out if you don’t mind a ten second wait.

ae_attendances |> 
  rowwise() |>
  mutate(mean = mean(c_across(4:6))) 

Summary functions

nth() / first() / last()

Get the nth, first, or last values. Very useful inside a summarise or similar when you want to be sure that you’re going to return a sensible result.

tibble(speaker = c("steve", "steve", "emma", "steve", "emma", "emma"),
       comment = letters[1:6])  |>
  mutate(group = consecutive_id(speaker)) |>
  group_by(group, speaker) |>
  summarise(comment = last(comment)) |>
  ungroup() |>
  select(-group)
# A tibble: 4 × 2
  speaker comment
  <chr>   <chr>  
1 steve   b      
2 emma    c      
3 steve   d      
4 emma    f