R beginner’s club 2024-12-12

Authors
Affiliation

KIND Network members

Brendan Clarke

NHS Education for Scotland

Published

January 17, 2025

# packages for this session

library(dplyr)
library(purrr)
library(palmerpenguins)

Don’t repeat yourself!

A key coding principle: don’t repeat yourself. This session is a light introduction to functionals, and related tools, that let you apply functions in an intelligent and concise way.

Lists again

We’ll do rather a lot with lists in this session. Lists are a basic data structure in R. You can think of them as a collection of vectors. They have two distinctive properties. First, and unlike vectors, lists can contain several different types of data:

list("clive", 99:1, penguins[2,])
[[1]]
[1] "clive"

[[2]]
 [1] 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75
[26] 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50
[51] 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25
[76] 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

[[3]]
# A tibble: 1 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.5          17.4               186        3800
# ℹ 2 more variables: sex <fct>, year <int>

As that example shows, unlike tibbles, lists can be ragged, containing vectors of different lengths:

test_list <- list(dunn = "clive", 
                  ready_or_not = 99:1,
                  pengs = penguins) # or with names

There’s lots to say about working with lists, but the most helpful reminder is about subsetting. This can trip people up, because you can either return a vector:

test_list$dunn  # returns a vector
[1] "clive"
test_list[["dunn"]] # equivalent
[1] "clive"
test_list[[1]] # subsetting by extracting a vector
[1] "clive"

Or you can return a smaller list:

test_list[1] # subsetting to return a mini-list
$dunn
[1] "clive"

The important thing is to be sure about exactly which of those you’re planning to do, and then check to make sure that you’re actually getting what you’d planned. And this minor pain-point is entirely worthwhile, because lists are so flexible. If in doubt, use a list.

Functions

R is largely functional. We do things by writing expressions that pass objects to functions:

class(penguins)
[1] "tbl_df"     "tbl"        "data.frame"
length(LETTERS)
[1] 26
nums <- c(5:1, 9:2, 8:22)
sum(nums)
[1] 284

Usually that’s simple. But imagine that you want to apply the same function to a group of objects:

sum(nums[1])
[1] 5
sum(nums[2])
[1] 4
sum(nums[3])
[1] 3

This starts to contradict the advice about not repeating yourself. We’re essentially writing the same function call several times. Happily though, R offers several alternative ways of constructing expressions that pass objects to functions. This session will look at two groups of alternative approaches.

do.call

The first is do.call. From the man page:

‘do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.’

# do.call(what = function, args = arguments to that function)
do.call("complex", list(imaginary = 1:3)) # handy if you ever need to calculate a Mandelbrot set in a hurry
[1] 0+1i 0+2i 0+3i
do.call("sum", list(nums)) # the same as just summing everything
[1] 284

For now, that probably doesn’t seem very exciting. But being able to build function calls in a different way, where their arguments are held in a list, can be extremely useful. do.call is also especially useful when you want to use operators as if they were standard functions:

big_nums <- list(c(1:5), c(5:1))
do.call("*", big_nums)
[1] 5 8 9 8 5
# or for collecting several arguments, and then evaluating them
arg <- list(1:10, na.rm = T)
do.call(sum, args = arg)
[1] 55
do.call(mean, args = arg)
[1] 5.5

lapply

lapply is a base-R function that applies a function to an object, and collects the output in a list. Imagine we’ve got a list containing a couple of numeric vectors:

nums <- list(c(1,2), c(3,4))

We can use lapply to sum each vector, and return a new list of those sums:

lapply(nums, sum)
[[1]]
[1] 3

[[2]]
[1] 7

Some other simple examples:

lapply(penguins, class) # gives you back a list of the same length
$species
[1] "factor"

$island
[1] "factor"

$bill_length_mm
[1] "numeric"

$bill_depth_mm
[1] "numeric"

$flipper_length_mm
[1] "integer"

$body_mass_g
[1] "integer"

$sex
[1] "factor"

$year
[1] "integer"
lapply(penguins, mean, na.rm = T)
$species
[1] NA

$island
[1] NA

$bill_length_mm
[1] 43.92193

$bill_depth_mm
[1] 17.15117

$flipper_length_mm
[1] 200.9152

$body_mass_g
[1] 4201.754

$sex
[1] NA

$year
[1] 2008.029
lapply(penguins, "class") # horrible but possible
$species
[1] "factor"

$island
[1] "factor"

$bill_length_mm
[1] "numeric"

$bill_depth_mm
[1] "numeric"

$flipper_length_mm
[1] "integer"

$body_mass_g
[1] "integer"

$sex
[1] "factor"

$year
[1] "integer"

lapply and do.call play very nicely together:

c(lapply(penguins, class)) # nonsense
$species
[1] "factor"

$island
[1] "factor"

$bill_length_mm
[1] "numeric"

$bill_depth_mm
[1] "numeric"

$flipper_length_mm
[1] "integer"

$body_mass_g
[1] "integer"

$sex
[1] "factor"

$year
[1] "integer"
do.call(c, lapply(penguins, class))
          species            island    bill_length_mm     bill_depth_mm 
         "factor"          "factor"         "numeric"         "numeric" 
flipper_length_mm       body_mass_g               sex              year 
        "integer"         "integer"          "factor"         "integer" 
do.call(tibble, lapply(penguins, class))
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>  <chr>          <chr>         <chr>             <chr>      
1 factor  factor numeric        numeric       integer           integer    
# ℹ 2 more variables: sex <chr>, year <chr>
penguins[do.call(c, lapply(penguins, is.numeric))] # wild
# A tibble: 344 × 5
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
            <dbl>         <dbl>             <int>       <int> <int>
 1           39.1          18.7               181        3750  2007
 2           39.5          17.4               186        3800  2007
 3           40.3          18                 195        3250  2007
 4           NA            NA                  NA          NA  2007
 5           36.7          19.3               193        3450  2007
 6           39.3          20.6               190        3650  2007
 7           38.9          17.8               181        3625  2007
 8           39.2          19.6               195        4675  2007
 9           34.1          18.1               193        3475  2007
10           42            20.2               190        4250  2007
# ℹ 334 more rows

There are other kinds of *apply functions in base R, like tapply, sapply and so on. My advice is to ignore them completely as they’re very quirky and hard to use consistently. The purrr package is a much stronger option:

purrr

map(penguins, class) # basically lapply
$species
[1] "factor"

$island
[1] "factor"

$bill_length_mm
[1] "numeric"

$bill_depth_mm
[1] "numeric"

$flipper_length_mm
[1] "integer"

$body_mass_g
[1] "integer"

$sex
[1] "factor"

$year
[1] "integer"
map_vec(penguins, class) # but with lovely programmer-pleasing sweeteners
          species            island    bill_length_mm     bill_depth_mm 
         "factor"          "factor"         "numeric"         "numeric" 
flipper_length_mm       body_mass_g               sex              year 
        "integer"         "integer"          "factor"         "integer" 

map gives you a standard way of applying a function over an object, and then being able to control how your output is returned. Say you’ve got some odd non-vectorised function:

fb <- function(n){
  out <- ""
  if (n %% 3 == 0) out <- "fizz"
  if (n %% 5 == 0) out <- paste0(out, "buzz")
  if(nchar(out) == 0) out <- as.character(n)
  out
}

That works fine on single values, but chokes when supplied with several values:

fb(8)
[1] "8"
fb(9)
[1] "fizz"
try(fb(8:9))
Error in if (n%%3 == 0) out <- "fizz" : the condition has length > 1

We could lapply this, or map it, to produce a list of output:

lapply(8:9, fb)
[[1]]
[1] "8"

[[2]]
[1] "fizz"
map(8:9, fb)
[[1]]
[1] "8"

[[2]]
[1] "fizz"

But the advantage of map is that we can trivially change the output by tweaking the function name. Rather than map we could return a character vector with map_chr:

map_chr(1:20, fb)
 [1] "1"        "2"        "fizz"     "4"        "buzz"     "fizz"    
 [7] "7"        "8"        "fizz"     "buzz"     "11"       "fizz"    
[13] "13"       "14"       "fizzbuzz" "16"       "17"       "fizz"    
[19] "19"       "buzz"