## Previous attendees have said...
- 7 previous attendees have left feedback
- 86% would recommend this session to a colleague
- 100% said that this session was pitched correctly
:::{.callout-note}
### Three random comments from previous attendees
- Pitched perfectly to my level, not easy to find training at this level, really useful, extremely relevant to my work as well as highly enjoyable, so yes, cannot recommend the course highly enough, Brendan is a rockstar trainer.
- Great session, providing practical use cases of key concepts around dplyr summarise (and why you should use reframe instead), mutate, group by and so much more. This session provided me with clear understanding of the differences between these functions and when/how to use them. As always, highly recommended session.
- I ended up pretty confused in this session. I use count a lot and summarise never use it so was wanting tidy up my understanding, however the diversions to different code confused me as I'm a slow typer. I do now get group_by etc and I liked the add ons .
:::
Session outline
This session is an 🌶🌶 intermediate practical designed for those with some R experience. The aim of this session is to do three things with dplyr:
show how to approach summarising data
explain how grouping works
show some simple summary functions
You might also like some of the other dplyr-themed practical sessions:
This dataset is especially good for practising summarising, because there are various different plausible groups that we might like to investigate in it - especially the intersection between SIMDQuintiles (indicating different levels of deprivation) with the various date-based year/season/month groups that might be of interest for health improvement work:
summarise() creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.
group_by doesn’t change how the data looks - just how it behaves:
Each call to summarise() removes a layer of grouping
SMR_SIMD|>group_by(Year)|>summarise(sum(NumberOfDeaths))|>summarise(sum(`sum(NumberOfDeaths)`))# horrible default column names, which we'll fix in future
# A tibble: 1 × 1
`sum(\`sum(NumberOfDeaths)\`)`
<dbl>
1 246708
This always returns an ungrouped tibble - so important to know that it’s not a direct substitute for an ordinary group_by()…
summarise() removes one layer of grouping
The most confusing aspect of summarise() is that it removes the bottom layer of grouping each time. Here, we start with our data grouped by Year and Quarter. After summarising, the data is grouped by Year only.:
group_vars() is just one of a group of functions in dplyr for understanding grouping metadata. Let’s start with some simple grouped data. We can discover the groups that we’re working with using groups():
A recent change in dplyr 1.1.0 is that summarise() now will only return one row per group. A new function, reframe(), has been developed to produce multiple-row summaries. It works exactly like summarise() except, rather than removing one grouping layer per operation, it always returns an ungrouped tibble. The syntax is the same as summarise():
sum<-ae_attendances|>group_by(year =lubridate::floor_date(period, unit ="year"))|>summarise( year =lubridate::year(year), non_admissions =sum(attendances-admissions))ref<-ae_attendances|>group_by(year =lubridate::floor_date(period, unit ="year"))|>reframe( year =lubridate::year(year), non_admissions =sum(attendances-admissions))waldo::compare(sum, ref)
`class(old)`: "grouped_df" "tbl_df" "tbl" "data.frame"
`class(new)`: "tbl_df" "tbl" "data.frame"
`attr(old, 'groups')` is an S3 object of class <tbl_df/tbl/data.frame>, a list
`attr(new, 'groups')` is absent
Compare and contrast with the results we obtain if we omit rowwise(), where the mean column contains the averages of the three columns overall, rather than for each date and organisation:
There’s also a c_across() function to select columns that looks really promising for rowwise() work, but bafflingly it is extremely slow here, taking 50x longer than the equivalent mutate(). This is a known issue - “particularly for long, narrow, data”. So this code is switched off and provided here for information only - although do feel free to try it out if you don’t mind a ten second wait.
Get the nth, first, or last values. Very useful inside a summarise() or similar when you want to be sure that you’re going to return a sensible result.