Scope of the possible with R

R
overview
Published

December 9, 2025

No feedback found for this session

Welcome

  • this session is a non-technical overview designed for service leads

Session outline

  • Introducing R, and a bit of chat about the aims of this session
  • Practical demo - take some data, load, tidy, analyse, produce outputs
  • Strengths and weaknesses
    • obvious
    • less obvious
  • Alternatives
  • Skill development

Introducing R

  • free and open-source statistical programming language
  • multi-platform
  • large user base
  • prominent in health, industry, biosciences

Why this session?

  • R can be confusing
    • it’s code-based, and most of us don’t have much code experience
    • it’s used for some inherently complicated tasks
    • it’s a big product with lots of add-ons and oddities
  • But R is probably the best general-purpose toolbox we have for data work at present
    • big user base in health and social care
    • focus on health and care-like applications
    • not that hard to learn
    • extensible and flexible
    • capable of enterprise-y, fancy uses
  • yet there’s signficant resistant to using R in parts of Scotland’s health and care sector

R demo

  • this is about showing what’s possible, and give you a flavour of how R works
  • we won’t explain code in detail during this session
  • using live open data https://www.opendata.nhs.scot/dataset/weekly-accident-and-emergency-activity-and-waiting-times

Hello world!

There are lots of ways to run R. In this session, we’ll demonstrate using the Rstudio Desktop IDE on Windows.

"Hello world!"

[1] “Hello world!”

Packages

R is highly extensible using packages: collections of useful code

library(readr) # plus, nice to see R's folksy daft names for things in action

Create variables

url <- "https://www.opendata.nhs.scot/dataset/0d57311a-db66-4eaa-bd6d-cc622b6cbdfa/resource/a5f7ca94-c810-41b5-a7c9-25c18d43e5a4/download/weekly_ae_activity_20251123.csv"

Load some open data

That’s a link to data about weekly A+E activity. It’s large-ish (approximately 40000 rows)

ae_activity <- read_csv(url)

One small bit of cheating: renaming

names(ae_activity) <- c("date", "country", "hb", "loc", "type", "attend", "n", "n_in_4", "n_4", "perc_4", "n_8", "perc_8", "n_12", "perc_12")

Preview

Preview of data
date country hb loc type attend n n_in_4 n_4 perc_4 n_8 perc_8 n_12 perc_12
20171210 S92000003 S08000022 H202H Type 1 All 600 570 30 95.0 9 1.5 0 0.0
20150802 S92000003 S08000022 H202H Type 1 All 646 616 30 95.4 1 0.2 0 0.0
20210207 S92000003 S08000015 A210H Type 1 All 430 367 63 85.3 19 4.4 6 1.4
20210207 S92000003 S08000030 T101H Type 1 New planned 26 26 0 100.0 0 0.0 0 0.0
20220626 S92000003 S08000031 G405H Type 1 Unplanned 1809 756 1053 41.8 302 16.7 71 3.9

Removing data

library(dplyr)

ae_activity <- ae_activity |>
    select(!c(country, contains("perc_")))
Preview of data
date hb loc type attend n n_in_4 n_4 n_8 n_12
20180916 S08000020 N411H Type 1 Unplanned 479 463 16 0 0
20250316 S08000032 L302H Type 1 New planned 1 1 0 0 0
20230219 S08000024 S314H Type 1 All 2246 914 1332 699 411
20250302 S08000020 N121H Type 1 All 394 352 42 0 0
20170423 S08000032 L106H Type 1 All 1351 1288 63 2 0

Tidying data

library(lubridate)

ae_activity <- ae_activity |>
    mutate(date = ymd(date))
Preview of data
date hb loc type attend n n_in_4 n_4 n_8 n_12
2018-04-08 S08000029 F704H Type 1 All 1226 1185 41 0 0
2023-01-08 S08000022 H212H Type 1 New planned 3 3 0 0 0
2022-12-11 S08000030 T101H Type 1 Unplanned 1219 926 293 32 3
2019-05-12 S08000026 Z102H Type 1 All 140 136 4 0 0
2017-08-06 S08000024 S314H Type 1 All 2240 2114 126 28 3

Subset data

We’ll take a selection of 5 health boards to keep things tidy:

boards_sample <- c("NHS Borders", "NHS Fife", "NHS Grampian", "NHS Highland", "NHS Lanarkshire")

Joining data

Those board codes (like S08000020) aren’t very easy to read. Luckily, we can add the proper “NHS Thing & Thing” board names from another data source.

boards <- read_csv("https://www.opendata.nhs.scot/dataset/9f942fdb-e59e-44f5-b534-d6e17229cc7b/resource/652ff726-e676-4a20-abda-435b98dd7bdc/download/hb14_hb19.csv")
NHS boards
HB HBName HBDateEnacted HBDateArchived Country
S08000015 NHS Ayrshire and Arran 20140401 NA S92000003
S08000027 NHS Tayside 20140401 20180201 S92000003
S08000030 NHS Tayside 20180202 NA S92000003
S08000017 NHS Dumfries and Galloway 20140401 NA S92000003
S08000028 NHS Western Isles 20140401 NA S92000003

We can do something very similar with the A&E locations:

locs <- read_csv("https://www.opendata.nhs.scot/dataset/a877470a-06a9-492f-b9e8-992f758894d0/resource/1a4e3f48-3d9b-4769-80e9-3ef6d27852fe/download/ae_hospital_site_list_09_09_2025.csv") |>
  select(2:4)

names(locs) <- c("loc_name", "loc", "postcode") # a bit of renaming to make the names easier

And we can add the postcodes:

locs <- locs |>
  rowwise() |>
  mutate(long = PostcodesioR::postcode_lookup(postcode)$longitude,
         lat = PostcodesioR::postcode_lookup(postcode)$latitude) # adding in some location information

We can then join our three datasets together to give us data with the NHS Board names, A&E names, and locations:

ae_activity_locs <- ae_activity |>
  filter(attend == "All") |>
  left_join(boards, by = join_by(hb == HB)) |>
  filter(HBName %in% boards_sample) |>
  select(date, HBName, loc, type, n, contains("n_")) |>
  mutate(date = ymd(date)) |>
  left_join(locs) 
Data with NHS board, location, and location names
date HBName loc type n n_in_4 n_4 n_8 n_12 loc_name postcode long lat
2023-09-17 NHS Lanarkshire L106H Type 1 1374 916 458 90 18 University Hospital Monklands ML6 0JS -3.999588 55.86588
2025-05-11 NHS Borders B120H Type 1 663 472 191 40 15 Borders General Hospital TD6 9BS -2.741945 55.59548
2018-05-20 NHS Highland H202H Type 1 692 663 29 1 0 Raigmore Hospital IV2 3UJ -4.192470 57.47381
2015-05-31 NHS Highland H212H Type 1 198 192 6 1 0 Belford Hospital PH336BS -5.104701 56.81941
2017-06-25 NHS Fife F704H Type 1 1308 1245 63 2 0 Victoria Hospital (NHS Fife) KY2 5AH -3.160138 56.12511

Basic plots

library(ggplot2)

ae_activity_locs |>
  filter(HBName == "NHS Highland") |>
  ggplot() +
  geom_line(aes(x = date, y = n, colour = loc_name))

Looking across different measures

library(tidyr)

ae_activity_locs |>
  filter(loc_name == "Raigmore Hospital") |>
  select(date, n_in_4:n_12) |>
  pivot_longer(-date) |>
  group_by(date = floor_date(date, "month"), name) |>
  summarise(value = mean(value)) |>
  ggplot(aes(x = date, y = value, colour = name)) +
  geom_line() +
  geom_smooth(se = F)

Making that re-usable

graphmo <- function(hbname = "NHS Highland"){
  
  ae_activity_locs |>
    filter(HBName %in% hbname) |>
    ggplot() +
    geom_line(aes(x = date, y = n, colour = loc_name)) +
    theme(legend.position = "bottom") +
    xlab("Location") +
    labs(colour = "") # hide the label

}

graphmo(c("NHS Grampian", "NHS Fife"))

A rubbish map

We’ve got latitude and longitude information for our A&E sites, which means we can plot them on a map:

ae_activity_locs |>
  ggplot(aes(x = long, y = lat, label = loc_name, colour = HBName)) +
  geom_point() +
  theme_void() +
  theme(legend.position = "none")

That’s not the most useful map I’ve ever seen. Luckily, there’s a package to help us:

Add to a map

ae_activity_locs |>
    leaflet::leaflet() |>
    leaflet::addTiles() |>
    leaflet::addMarkers(~long, ~lat, label = ~loc_name)

Simple map

Then make that map more useful

ae_activity_locs |>
    group_by(loc_name) |>
    summarise(n = sum(n), long = min(long), lat = min(lat)) |>
    mutate(rate = paste(loc_name, "averages", n, "attendees")) |>
    leaflet::leaflet() |>
    leaflet::addTiles() |>
    leaflet::addMarkers(~long, ~lat, label = ~rate)

More useful map

Strengths

R offers enormous scope and flexibility, largely because of two features. First, R is based on the idea of packages, where you’re encouraged to outsource specialist functions to your R installation in a repeatable and standard way. There’s basically a package for everything - something over 20000 at present. Second, R encourages reproducible analytics: the idea being you write your script once, and then run it many times as your data changes, producing standardised outputs by design.

Together, that design makes R a force-multiplier for fancier data work: use packages to replicate your existing work in a reproducible way, then use the time saved in your routine reporting to improve and extend the work. There are other features of code-based analytics which make collaborating and developing more complex projects typically much smoother than they would be in non-code tools like Excel.

Weaknesses

  • it’s code, and it takes some time (months to years) to achieve real fluency
  • potentially harder to learn than some competitor languages and tools (Power BI, Python)
  • very patchy expertise across H+SC Scotland
  • complex IG landscape
  • messy skills development journey