# packages for this session
library(dplyr)
library(purrr)
library(palmerpenguins)
R beginner’s club 2024-12-12
Don’t repeat yourself!
A key coding principle: don’t repeat yourself. This session is a light introduction to functionals, and related tools, that let you apply functions in an intelligent and concise way.
Lists again
We’ll do rather a lot with lists in this session. Lists are a basic data structure in R. You can think of them as a collection of vectors. They have two distinctive properties. First, and unlike vectors, lists can contain several different types of data:
list("clive", 99:1, penguins[2,])
[[1]]
[1] "clive"
[[2]]
[1] 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75
[26] 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50
[51] 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25
[76] 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
[[3]]
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.5 17.4 186 3800
# ℹ 2 more variables: sex <fct>, year <int>
As that example shows, unlike tibbles, lists can be ragged, containing vectors of different lengths:
<- list(dunn = "clive",
test_list ready_or_not = 99:1,
pengs = penguins) # or with names
There’s lots to say about working with lists, but the most helpful reminder is about subsetting. This can trip people up, because you can either return a vector:
$dunn # returns a vector test_list
[1] "clive"
"dunn"]] # equivalent test_list[[
[1] "clive"
1]] # subsetting by extracting a vector test_list[[
[1] "clive"
Or you can return a smaller list:
1] # subsetting to return a mini-list test_list[
$dunn
[1] "clive"
The important thing is to be sure about exactly which of those you’re planning to do, and then check to make sure that you’re actually getting what you’d planned. And this minor pain-point is entirely worthwhile, because lists are so flexible. If in doubt, use a list.
Functions
R is largely functional. We do things by writing expressions that pass objects to functions:
class(penguins)
[1] "tbl_df" "tbl" "data.frame"
length(LETTERS)
[1] 26
<- c(5:1, 9:2, 8:22)
nums sum(nums)
[1] 284
Usually that’s simple. But imagine that you want to apply the same function to a group of objects:
sum(nums[1])
[1] 5
sum(nums[2])
[1] 4
sum(nums[3])
[1] 3
This starts to contradict the advice about not repeating yourself. We’re essentially writing the same function call several times. Happily though, R offers several alternative ways of constructing expressions that pass objects to functions. This session will look at two groups of alternative approaches.
do.call
The first is do.call
. From the man page:
‘do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.’
# do.call(what = function, args = arguments to that function)
do.call("complex", list(imaginary = 1:3)) # handy if you ever need to calculate a Mandelbrot set in a hurry
[1] 0+1i 0+2i 0+3i
do.call("sum", list(nums)) # the same as just summing everything
[1] 284
For now, that probably doesn’t seem very exciting. But being able to build function calls in a different way, where their arguments are held in a list, can be extremely useful. do.call
is also especially useful when you want to use operators as if they were standard functions:
<- list(c(1:5), c(5:1))
big_nums do.call("*", big_nums)
[1] 5 8 9 8 5
# or for collecting several arguments, and then evaluating them
<- list(1:10, na.rm = T)
arg do.call(sum, args = arg)
[1] 55
do.call(mean, args = arg)
[1] 5.5
lapply
lapply
is a base-R function that applies a function to an object, and collects the output in a list. Imagine we’ve got a list containing a couple of numeric vectors:
<- list(c(1,2), c(3,4)) nums
We can use lapply
to sum each vector, and return a new list of those sums:
lapply(nums, sum)
[[1]]
[1] 3
[[2]]
[1] 7
Some other simple examples:
lapply(penguins, class) # gives you back a list of the same length
$species
[1] "factor"
$island
[1] "factor"
$bill_length_mm
[1] "numeric"
$bill_depth_mm
[1] "numeric"
$flipper_length_mm
[1] "integer"
$body_mass_g
[1] "integer"
$sex
[1] "factor"
$year
[1] "integer"
lapply(penguins, mean, na.rm = T)
$species
[1] NA
$island
[1] NA
$bill_length_mm
[1] 43.92193
$bill_depth_mm
[1] 17.15117
$flipper_length_mm
[1] 200.9152
$body_mass_g
[1] 4201.754
$sex
[1] NA
$year
[1] 2008.029
lapply(penguins, "class") # horrible but possible
$species
[1] "factor"
$island
[1] "factor"
$bill_length_mm
[1] "numeric"
$bill_depth_mm
[1] "numeric"
$flipper_length_mm
[1] "integer"
$body_mass_g
[1] "integer"
$sex
[1] "factor"
$year
[1] "integer"
lapply
and do.call
play very nicely together:
c(lapply(penguins, class)) # nonsense
$species
[1] "factor"
$island
[1] "factor"
$bill_length_mm
[1] "numeric"
$bill_depth_mm
[1] "numeric"
$flipper_length_mm
[1] "integer"
$body_mass_g
[1] "integer"
$sex
[1] "factor"
$year
[1] "integer"
do.call(c, lapply(penguins, class))
species island bill_length_mm bill_depth_mm
"factor" "factor" "numeric" "numeric"
flipper_length_mm body_mass_g sex year
"integer" "integer" "factor" "integer"
do.call(tibble, lapply(penguins, class))
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <chr> <chr> <chr> <chr>
1 factor factor numeric numeric integer integer
# ℹ 2 more variables: sex <chr>, year <chr>
do.call(c, lapply(penguins, is.numeric))] # wild penguins[
# A tibble: 344 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<dbl> <dbl> <int> <int> <int>
1 39.1 18.7 181 3750 2007
2 39.5 17.4 186 3800 2007
3 40.3 18 195 3250 2007
4 NA NA NA NA 2007
5 36.7 19.3 193 3450 2007
6 39.3 20.6 190 3650 2007
7 38.9 17.8 181 3625 2007
8 39.2 19.6 195 4675 2007
9 34.1 18.1 193 3475 2007
10 42 20.2 190 4250 2007
# ℹ 334 more rows
There are other kinds of *apply
functions in base R, like tapply
, sapply
and so on. My advice is to ignore them completely as they’re very quirky and hard to use consistently. The purrr package is a much stronger option:
purrr
map(penguins, class) # basically lapply
$species
[1] "factor"
$island
[1] "factor"
$bill_length_mm
[1] "numeric"
$bill_depth_mm
[1] "numeric"
$flipper_length_mm
[1] "integer"
$body_mass_g
[1] "integer"
$sex
[1] "factor"
$year
[1] "integer"
map_vec(penguins, class) # but with lovely programmer-pleasing sweeteners
species island bill_length_mm bill_depth_mm
"factor" "factor" "numeric" "numeric"
flipper_length_mm body_mass_g sex year
"integer" "integer" "factor" "integer"
map
gives you a standard way of applying a function over an object, and then being able to control how your output is returned. Say you’ve got some odd non-vectorised function:
<- function(n){
fb <- ""
out if (n %% 3 == 0) out <- "fizz"
if (n %% 5 == 0) out <- paste0(out, "buzz")
if(nchar(out) == 0) out <- as.character(n)
out }
That works fine on single values, but chokes when supplied with several values:
fb(8)
[1] "8"
fb(9)
[1] "fizz"
try(fb(8:9))
Error in if (n%%3 == 0) out <- "fizz" : the condition has length > 1
We could lapply
this, or map
it, to produce a list of output:
lapply(8:9, fb)
[[1]]
[1] "8"
[[2]]
[1] "fizz"
map(8:9, fb)
[[1]]
[1] "8"
[[2]]
[1] "fizz"
But the advantage of map
is that we can trivially change the output by tweaking the function name. Rather than map
we could return a character vector with map_chr
:
map_chr(1:20, fb)
[1] "1" "2" "fizz" "4" "buzz" "fizz"
[7] "7" "8" "fizz" "buzz" "11" "fizz"
[13] "13" "14" "fizzbuzz" "16" "17" "fizz"
[19] "19" "buzz"