R beginner’s club 2025-01-09

Authors

Affiliation

KIND Network members

Brendan Clarke

NHS Education for Scotland

Published

July 10, 2025

What’s this session about

This session is about some top tips for working between Excel and R, mainly concentrating on using R for tidying up messy Excel data. We’ll start with a lot of package loading:

# standard toolkit
library(dplyr)
library(tools)

# tidying data
library(tidyr)
library(readr)
library(lubridate)
library(janitor)
library(skimr)

# Excel reading and writing
library(readxl)
library(writexl)

Reading Excel files

R is exceptionally useful as a way of automating tricky Excel tasks. Let’s say we have some slightly broken Excel data. We can read that into R in a couple of different ways. If you’re using Rstudio, the path of least resistance is to find your data in the Files pane, and click on it:

That brings up an interactive data import tool which, joy of joys, writes the import code for you:

You can then copy that code into your script so that your data loading is repeatable - and you can tweak it to suit:

raw <- read_xlsx(here::here("rbc/data/raw.xlsx")) # little bit of here to make the paths behave
head(raw)

# A tibble: 6 × 15
  PracticeCode GPPracticeName         PracticeListSize AddressLine1 AddressLine2
         <dbl> <chr>                  <chr>            <chr>        <chr>       
1        55003 Thurso & Halkirk Medi… 6143             Thurso & Ha… 69 Princes …
2        55041 Nairn Healthcare Group 15756            Nairn Healt… Nairn Town …
3        55075 Dunbeath Surgery       685              Dunbeath Su… Achorn Road 
4        55080 Canisbay & Castletown… 2750             Canisbay & … Murrayfield 
5        55145 The Pearson Practice   2647             The Pearson… Wick Medica…
6        55183 Armadale Medical Prac… 925              Armadale Me… Armadale    
# ℹ 10 more variables: AddressLine3 <chr>, AddressLine4 <chr>, Postcode <chr>,
#   TelephoneNumber <chr>, PracticeType <chr>, `date collected` <chr>,
#   HB <chr>, HSCP <chr>, DataZone <chr>, GPCluster <chr>

(this is a subset of Public Health Scotland’s open data on GP practices. I should emphasise that all the data problems in this sample have been deliberately introduced by me demonstration purposes - the original dataset is lovely and clean)

It’s also worth saying that Excel files are incredibly complicated. In this session, we’ll keep things simple, but there are lots of powerful R tools out there if you need more control (especially the openxlsx2 package).

Initial exploration: some tools for understanding what you’ve loaded

The base-R way, beyond eyeballing the data with View(), is to use summary or str:

raw |>
  summary()

  PracticeCode   GPPracticeName     PracticeListSize   AddressLine1      
 Min.   :55003   Length:64          Length:64          Length:64         
 1st Qu.:55341   Class :character   Class :character   Class :character  
 Median :55589   Mode  :character   Mode  :character   Mode  :character  
 Mean   :55606                                                           
 3rd Qu.:55846                                                           
 Max.   :56576                                                           
 AddressLine2       AddressLine3       AddressLine4         Postcode        
 Length:64          Length:64          Length:64          Length:64         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 TelephoneNumber    PracticeType       date collected          HB           
 Length:64          Length:64          Length:64          Length:64         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
     HSCP             DataZone          GPCluster        
 Length:64          Length:64          Length:64         
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character

raw |>
  str()

tibble [64 × 15] (S3: tbl_df/tbl/data.frame)
 $ PracticeCode    : num [1:64] 55003 55041 55075 55080 55145 ...
 $ GPPracticeName  : chr [1:64] "Thurso & Halkirk Medical Practice" "Nairn Healthcare Group" "Dunbeath Surgery" "Canisbay & Castletown Joint Medical Practice" ...
 $ PracticeListSize: chr [1:64] "6143" "15756" "685" "2750" ...
 $ AddressLine1    : chr [1:64] "Thurso & Halkirk Medical Practice" "Nairn Healthcare Group" "Dunbeath Surgery" "Canisbay & Castletown Joint Medical Practice" ...
 $ AddressLine2    : chr [1:64] "69 Princes Street" "Nairn Town and County Hospital" "Achorn Road" "Murrayfield" ...
 $ AddressLine3    : chr [1:64] "Thurso" "Cawdor Road" "Dunbeath" "Castletown" ...
 $ AddressLine4    : chr [1:64] "Caithness" "Nairn" "Caithness" NA ...
 $ Postcode        : chr [1:64] "KW14 7DH" "IV12 5EE" "KW6 6EZ" "KW14 8TY" ...
 $ TelephoneNumber : chr [1:64] "01847   895   986" "01667   452   096" "01593   731   205" "01847   821   205" ...
 $ PracticeType    : chr [1:64] "17J" "17J" "17J" "17J" ...
 $ date collected  : chr [1:64] "45654" "45659" "45646" "45655" ...
 $ HB              : chr [1:64] "S08000022" "S08000022" "S08000022" "S08000022" ...
 $ HSCP            : chr [1:64] "S37000016" "S37000016" "S37000016" "S37000016" ...
 $ DataZone        : chr [1:64] "S01010805" "S01010556" "S01010772" NA ...
 $ GPCluster       : chr [1:64] "NHSH (North) Caithness" "NHSH (North) Badenoch & Nairn" "NHSH (North) West Sutherland" "NHSH (North) Caithness" ...

Personally, I find summary() actively unhelpful, but str() does have it’s place, especially as a quick check of the data types you have in each column. You might also like to try dplyr::glimpse() for a tweaked version of str() that you can use in piped code to understand what’s going on:

raw |>
  glimpse() |>
  count(GPCluster) |>
  glimpse()

Rows: 64
Columns: 15
$ PracticeCode     <dbl> 55003, 55041, 55075, 55080, 55145, 55183, 55201, 5522…
$ GPPracticeName   <chr> "Thurso & Halkirk Medical Practice", "Nairn Healthcar…
$ PracticeListSize <chr> "6143", "15756", "685", "2750", "2647", "925", "2858"…
$ AddressLine1     <chr> "Thurso & Halkirk Medical Practice", "Nairn Healthcar…
$ AddressLine2     <chr> "69 Princes Street", "Nairn Town and County Hospital"…
$ AddressLine3     <chr> "Thurso", "Cawdor Road", "Dunbeath", "Castletown", "M…
$ AddressLine4     <chr> "Caithness", "Nairn", "Caithness", NA, "Wick", "Caith…
$ Postcode         <chr> "KW14 7DH", "IV12 5EE", "KW6 6EZ", "KW14 8TY", "KW1 5…
$ TelephoneNumber  <chr> "01847   895   986", "01667   452   096", "01593   73…
$ PracticeType     <chr> "17J", "17J", "17J", "17J", "17J", "2C", "17J", "17J"…
$ `date collected` <chr> "45654", "45659", "45646", "45655", "45664", "45661",…
$ HB               <chr> "S08000022", "S08000022", "S08000022", "S08000022", "…
$ HSCP             <chr> "S37000016", "S37000016", "S37000016", "S37000016", "…
$ DataZone         <chr> "S01010805", "S01010556", "S01010772", NA, "S01010781…
$ GPCluster        <chr> "NHSH (North) Caithness", "NHSH (North) Badenoch & Na…
Rows: 16
Columns: 2
$ GPCluster <chr> "NHS highland(North) West Sutherland", "NHSH (Norf) Skye & L…
$ n         <int> 1, 1, 6, 8, 3, 4, 5, 3, 4, 8, 4, 6, 4, 5, 1, 1

A much more capable alternative for inspecting your data is the skimr package:

raw |>
  skimr::skim()

Data summary
Name	raw
Number of rows	64
Number of columns	15
_______________________
Column type frequency:
character	14
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
GPPracticeName	0	1.00	14	51	62
PracticeListSize	1	0.98	3	5	61
AddressLine1	0	1.00	14	51	62
AddressLine2	1	0.98	6	40	56
AddressLine3	2	0.97	5	26	47
AddressLine4	21	0.67	4	17	31
Postcode	0	1.00	6	11	58
TelephoneNumber	1	0.98	17	18	61
PracticeType	0	1.00	2	3	3
date collected	0	1.00	5	13	22
HB	2	0.97	9	9	1
HSCP	2	0.97	9	9	1
DataZone	2	0.97	9	9	51
GPCluster	1	0.98	17	35	15

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PracticeCode	0	1	55606.09	348.92	55003	55341	55589	55845.75	56576	▅▇▇▂▁

That treats different data types differently, so if we coerce one of our columns to numeric, and skim again, we’ll get different output:

raw |>
  mutate(PracticeListSize = as.numeric(PracticeListSize)) |>
  skimr::skim()

Data summary
Name	mutate(raw, PracticeListS…
Number of rows	64
Number of columns	15
_______________________
Column type frequency:
character	13
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
GPPracticeName	0	1.00	14	51	62
AddressLine1	0	1.00	14	51	62
AddressLine2	1	0.98	6	40	56
AddressLine3	2	0.97	5	26	47
AddressLine4	21	0.67	4	17	31
Postcode	0	1.00	6	11	58
TelephoneNumber	1	0.98	17	18	61
PracticeType	0	1.00	2	3	3
date collected	0	1.00	5	13	22
HB	2	0.97	9	9	1
HSCP	2	0.97	9	9	1
DataZone	2	0.97	9	9	51
GPCluster	1	0.98	17	35	15

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PracticeCode	0	1.00	55606.09	348.92	55003	55341	55589	55845.75	56576	▅▇▇▂▁
PracticeListSize	4	0.94	4016.20	3516.20	244	1486	2806	5704.75	15756	▇▃▁▁▁

And you can wrangle the output of skim in a few different ways:

raw |>
  skimr::skim() |>
  select(n_missing) # just look at one metric

# A tibble: 15 × 1
   n_missing
       <int>
 1         0
 2         1
 3         0
 4         1
 5         2
 6        21
 7         0
 8         1
 9         0
10         0
11         2
12         2
13         2
14         1
15         0

raw |>
  skimr::skim() |>
  yank("numeric") # an alternative way of looking at one metric

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PracticeCode	0	1	55606.09	348.92	55003	55341	55589	55845.75	56576	▅▇▇▂▁

raw |>
  mutate(PracticeListSize = as.numeric(PracticeListSize),
         PracticeCode = as.character(PracticeCode)) |>
  group_by(PracticeType) |>
  skim()  # look at groups and columns together

Data summary
Name	group_by(…)
Number of rows	64
Number of columns	15
_______________________
Column type frequency:
character	13
numeric	1
________________________
Group variables	PracticeType

Variable type: character

skim_variable	PracticeType	n_missing	complete_rate	min	max	n_unique
PracticeCode	17C	0	1.00	5	5	1
PracticeCode	17J	0	1.00	5	5	46
PracticeCode	2C	0	1.00	5	5	15
GPPracticeName	17C	0	1.00	29	29	1
GPPracticeName	17J	0	1.00	14	44	46
GPPracticeName	2C	0	1.00	18	51	15
AddressLine1	17C	0	1.00	29	29	1
AddressLine1	17J	0	1.00	14	44	46
AddressLine1	2C	0	1.00	18	51	15
AddressLine2	17C	0	1.00	11	11	1
AddressLine2	17J	1	0.98	6	40	41
AddressLine2	2C	0	1.00	7	23	15
AddressLine3	17C	0	1.00	12	12	1
AddressLine3	17J	1	0.98	5	26	34
AddressLine3	2C	1	0.93	6	20	13
AddressLine4	17C	1	0.00	NA	NA	0
AddressLine4	17J	18	0.62	4	16	22
AddressLine4	2C	2	0.87	6	17	12
Postcode	17C	0	1.00	8	8	1
Postcode	17J	0	1.00	6	11	42
Postcode	2C	0	1.00	7	9	15
TelephoneNumber	17C	0	1.00	17	17	1
TelephoneNumber	17J	1	0.98	17	18	45
TelephoneNumber	2C	0	1.00	17	17	15
date collected	17C	0	1.00	5	5	1
date collected	17J	0	1.00	5	13	19
date collected	2C	0	1.00	5	10	13
HB	17C	0	1.00	9	9	1
HB	17J	2	0.96	9	9	1
HB	2C	0	1.00	9	9	1
HSCP	17C	0	1.00	9	9	1
HSCP	17J	2	0.96	9	9	1
HSCP	2C	0	1.00	9	9	1
DataZone	17C	0	1.00	9	9	1
DataZone	17J	1	0.98	9	9	38
DataZone	2C	1	0.93	9	9	12
GPCluster	17C	0	1.00	21	21	1
GPCluster	17J	0	1.00	17	32	13
GPCluster	2C	1	0.93	17	35	9

Variable type: numeric

skim_variable	PracticeType	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PracticeListSize	17C	0	1.00	928.00	NA	928	928.0	928.0	928.0	928	▁▁▇▁▁
PracticeListSize	17J	3	0.94	4419.67	3153.60	475	2316.0	3458.0	6131.0	15756	▇▅▁▁▁
PracticeListSize	2C	1	0.93	2939.93	4461.68	244	551.5	1249.5	2364.5	13568	▇▁▁▁▁

raw |>
  skimr::skim_tee() # like glimpse, produce skim output but return the original data to allow use inside with piped code

── Data Summary ────────────────────────
                           Values
Name                       data  
Number of rows             64    
Number of columns          15    
_______________________          
Column type frequency:           
  character                14    
  numeric                  1     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────────────────────────
   skim_variable    n_missing complete_rate min max empty n_unique whitespace
 1 GPPracticeName           0         1      14  51     0       62          0
 2 PracticeListSize         1         0.984   3   5     0       61          0
 3 AddressLine1             0         1      14  51     0       62          0
 4 AddressLine2             1         0.984   6  40     0       56          0
 5 AddressLine3             2         0.969   5  26     0       47          0
 6 AddressLine4            21         0.672   4  17     0       31          0
 7 Postcode                 0         1       6  11     0       58          0
 8 TelephoneNumber          1         0.984  17  18     0       61          0
 9 PracticeType             0         1       2   3     0        3          0
10 date collected           0         1       5  13     0       22          0
11 HB                       2         0.969   9   9     0        1          0
12 HSCP                     2         0.969   9   9     0        1          0
13 DataZone                 2         0.969   9   9     0       51          0
14 GPCluster                1         0.984  17  35     0       15          0

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate   mean   sd    p0   p25   p50    p75
1 PracticeCode          0             1 55606. 349. 55003 55341 55589 55846.
   p100 hist 
1 56576 ▅▇▇▂▁

raw |>
  skimr::skim() |>
  partition() # split different data types into separate list items

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
GPPracticeName	0	1.00	14	51	62
PracticeListSize	1	0.98	3	5	61
AddressLine1	0	1.00	14	51	62
AddressLine2	1	0.98	6	40	56
AddressLine3	2	0.97	5	26	47
AddressLine4	21	0.67	4	17	31
Postcode	0	1.00	6	11	58
TelephoneNumber	1	0.98	17	18	61
PracticeType	0	1.00	2	3	3
date collected	0	1.00	5	13	22
HB	2	0.97	9	9	1
HSCP	2	0.97	9	9	1
DataZone	2	0.97	9	9	51
GPCluster	1	0.98	17	35	15

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PracticeCode	0	1	55606.09	348.92	55003	55341	55589	55845.75	56576	▅▇▇▂▁

Fixing data types

Our dataset initially doesn’t look too horrible. But we can generate some very confusing results from it:

min(raw$PracticeListSize, na.rm = T)

[1] "10000"

max(raw$PracticeListSize, na.rm = T) # what!

[1] "990"

What’s happening here is that R is putting our results into alphabetical order, rather than numerical. That’s because these numbers are stored as if they were text owing to a couple of anomalous data points:

raw$PracticeListSize[grep("[^0-9]", raw$PracticeListSize)]

[1] "187*"  "9738a" "5595c"

We could use as.numeric to remove everything other than proper numbers:

raw |>
  mutate(PracticeListSize = as.numeric(PracticeListSize))

# A tibble: 64 × 15
   PracticeCode GPPracticeName        PracticeListSize AddressLine1 AddressLine2
          <dbl> <chr>                            <dbl> <chr>        <chr>       
 1        55003 Thurso & Halkirk Med…             6143 Thurso & Ha… 69 Princes …
 2        55041 Nairn Healthcare Gro…            15756 Nairn Healt… Nairn Town …
 3        55075 Dunbeath Surgery                   685 Dunbeath Su… Achorn Road 
 4        55080 Canisbay & Castletow…             2750 Canisbay & … Murrayfield 
 5        55145 The Pearson Practice              2647 The Pearson… Wick Medica…
 6        55183 Armadale Medical Pra…              925 Armadale Me… Armadale    
 7        55201 Dornoch Medical Prac…             2858 Dornoch Med… Shore Road  
 8        55220 Golspie Medical Prac…             1993 Golspie Med… Golspie Hea…
 9        55249 Lairg Health Centre               1222 Lairg Healt… Main Street 
10        55253 Assynt Medical Pract…              952 Assynt Medi… 6 Main Stre…
# ℹ 54 more rows
# ℹ 10 more variables: AddressLine3 <chr>, AddressLine4 <chr>, Postcode <chr>,
#   TelephoneNumber <chr>, PracticeType <chr>, `date collected` <chr>,
#   HB <chr>, HSCP <chr>, DataZone <chr>, GPCluster <chr>

Or, rather better, we could use readr::parse_number which will do a bit of inference to try and preserve anything that looks like a number:

raw |>
  mutate(PracticeListSize = parse_number(PracticeListSize))

# A tibble: 64 × 15
   PracticeCode GPPracticeName        PracticeListSize AddressLine1 AddressLine2
          <dbl> <chr>                            <dbl> <chr>        <chr>       
 1        55003 Thurso & Halkirk Med…             6143 Thurso & Ha… 69 Princes …
 2        55041 Nairn Healthcare Gro…            15756 Nairn Healt… Nairn Town …
 3        55075 Dunbeath Surgery                   685 Dunbeath Su… Achorn Road 
 4        55080 Canisbay & Castletow…             2750 Canisbay & … Murrayfield 
 5        55145 The Pearson Practice              2647 The Pearson… Wick Medica…
 6        55183 Armadale Medical Pra…              925 Armadale Me… Armadale    
 7        55201 Dornoch Medical Prac…             2858 Dornoch Med… Shore Road  
 8        55220 Golspie Medical Prac…             1993 Golspie Med… Golspie Hea…
 9        55249 Lairg Health Centre               1222 Lairg Healt… Main Street 
10        55253 Assynt Medical Pract…              952 Assynt Medi… 6 Main Stre…
# ℹ 54 more rows
# ℹ 10 more variables: AddressLine3 <chr>, AddressLine4 <chr>, Postcode <chr>,
#   TelephoneNumber <chr>, PracticeType <chr>, `date collected` <chr>,
#   HB <chr>, HSCP <chr>, DataZone <chr>, GPCluster <chr>

We also have some horrible dates. First, we’ve got a few completely non-standard dates:

raw |>
  dplyr::filter(stringr::str_detect(`date collected`, "eve")) |>
  select(PracticeCode, `date collected`)

# A tibble: 1 × 2
  PracticeCode `date collected`
         <dbl> <chr>           
1        55291 Christmas eve

Short of manually replacing them, we can’t do much about those. But we can fix the other dates, which are in Excel 1900 format:

raw |>
  select(`date collected`) |>
  slice_sample(n = 5)

# A tibble: 5 × 1
  `date collected`
  <chr>           
1 45662           
2 45665           
3 45654           
4 45659           
5 45646

Excel counts days from Jan 1st 1900, while R counts days from Jan 1st 1970. That means that you need to subtract those 70 years to convert from one to the other, and then use lubridates as_date to convert that number of days into a proper R date. The quick-and-dirty way - just subtract 70 years-worth of days, which is (of course) 25569 days:

raw |>
  mutate(`date collected` = as_date(as.numeric(`date collected`) - 25569)) |> # 25569 is the magic number
  select(`date collected`) |>
  slice_sample(n = 5)

# A tibble: 5 × 1
  `date collected`
  <date>          
1 2025-01-01      
2 2024-12-20      
3 2025-01-04      
4 2024-12-28      
5 2025-01-05

A nicer, more explicit, approach is to set an origin for your as_date to count from. Weirdly, because of a decades-old glitch in Excel dates, that’s two days before the 1st Jan 1900:

raw |>
  mutate(`date collected` = as_date(as.numeric(`date collected`), origin = "1899-12-30")) |>
  select(`date collected`) |>
  slice_sample(n = 5)

# A tibble: 5 × 1
  `date collected`
  <date>          
1 2024-12-24      
2 2024-12-23      
3 2025-01-05      
4 2024-12-28      
5 2025-01-02

There’s lots to say about date parsing - and would recommend coming to our lubridate session if you’d like to know more.

Writing back to Excel

Once you’ve had a play with your data, and cleaned up as needed, writing back to a basic Excel sheet is straightforward:

raw |>
  mutate(PracticeListSize = parse_number(PracticeListSize)) |>
  mutate(date = parse_number(`date collected`)) |>
  mutate(date = as_date(date, origin = "1899-12-30")) |>
  write_xlsx(here::here("rbc/data/cooked.xlsx"))

You can have a play with the data yourself: