# packages for this session
library(dplyr)
library(purrr)
library(palmerpenguins)R beginner’s club 2024-12-12
Don’t repeat yourself!
A key coding principle: don’t repeat yourself. This session is a light introduction to functionals, and related tools, that let you apply functions in an intelligent and concise way.
Lists again
We’ll do rather a lot with lists in this session. Lists are a basic data structure in R. You can think of them as a collection of vectors. They have two distinctive properties. First, and unlike vectors, lists can contain several different types of data:
list("clive", 99:1, penguins[2,])[[1]]
[1] "clive"
[[2]]
[1] 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75
[26] 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50
[51] 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25
[76] 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
[[3]]
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.5 17.4 186 3800
# ℹ 2 more variables: sex <fct>, year <int>
As that example shows, unlike tibbles, lists can be ragged, containing vectors of different lengths:
test_list <- list(dunn = "clive",
ready_or_not = 99:1,
pengs = penguins) # or with namesThere’s lots to say about working with lists, but the most helpful reminder is about subsetting. This can trip people up, because you can either return a vector:
test_list$dunn # returns a vector[1] "clive"
test_list[["dunn"]] # equivalent[1] "clive"
test_list[[1]] # subsetting by extracting a vector[1] "clive"
Or you can return a smaller list:
test_list[1] # subsetting to return a mini-list$dunn
[1] "clive"
The important thing is to be sure about exactly which of those you’re planning to do, and then check to make sure that you’re actually getting what you’d planned. And this minor pain-point is entirely worthwhile, because lists are so flexible. If in doubt, use a list.
Functions
R is largely functional. We do things by writing expressions that pass objects to functions:
class(penguins)[1] "tbl_df" "tbl" "data.frame"
length(LETTERS)[1] 26
nums <- c(5:1, 9:2, 8:22)
sum(nums)[1] 284
Usually that’s simple. But imagine that you want to apply the same function to a group of objects:
sum(nums[1])[1] 5
sum(nums[2])[1] 4
sum(nums[3])[1] 3
This starts to contradict the advice about not repeating yourself. We’re essentially writing the same function call several times. Happily though, R offers several alternative ways of constructing expressions that pass objects to functions. This session will look at two groups of alternative approaches.
do.call
The first is do.call. From the man page:
‘do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.’
# do.call(what = function, args = arguments to that function)
do.call("complex", list(imaginary = 1:3)) # handy if you ever need to calculate a Mandelbrot set in a hurry[1] 0+1i 0+2i 0+3i
do.call("sum", list(nums)) # the same as just summing everything[1] 284
For now, that probably doesn’t seem very exciting. But being able to build function calls in a different way, where their arguments are held in a list, can be extremely useful. do.call is also especially useful when you want to use operators as if they were standard functions:
big_nums <- list(c(1:5), c(5:1))
do.call("*", big_nums)[1] 5 8 9 8 5
# or for collecting several arguments, and then evaluating them
arg <- list(1:10, na.rm = T)
do.call(sum, args = arg)[1] 55
do.call(mean, args = arg)[1] 5.5
lapply
lapply is a base-R function that applies a function to an object, and collects the output in a list. Imagine we’ve got a list containing a couple of numeric vectors:
nums <- list(c(1,2), c(3,4))We can use lapply to sum each vector, and return a new list of those sums:
lapply(nums, sum)[[1]]
[1] 3
[[2]]
[1] 7
Some other simple examples:
lapply(penguins, class) # gives you back a list of the same length$species
[1] "factor"
$island
[1] "factor"
$bill_length_mm
[1] "numeric"
$bill_depth_mm
[1] "numeric"
$flipper_length_mm
[1] "integer"
$body_mass_g
[1] "integer"
$sex
[1] "factor"
$year
[1] "integer"
lapply(penguins, mean, na.rm = T)$species
[1] NA
$island
[1] NA
$bill_length_mm
[1] 43.92193
$bill_depth_mm
[1] 17.15117
$flipper_length_mm
[1] 200.9152
$body_mass_g
[1] 4201.754
$sex
[1] NA
$year
[1] 2008.029
lapply(penguins, "class") # horrible but possible$species
[1] "factor"
$island
[1] "factor"
$bill_length_mm
[1] "numeric"
$bill_depth_mm
[1] "numeric"
$flipper_length_mm
[1] "integer"
$body_mass_g
[1] "integer"
$sex
[1] "factor"
$year
[1] "integer"
lapply and do.call play very nicely together:
c(lapply(penguins, class)) # nonsense$species
[1] "factor"
$island
[1] "factor"
$bill_length_mm
[1] "numeric"
$bill_depth_mm
[1] "numeric"
$flipper_length_mm
[1] "integer"
$body_mass_g
[1] "integer"
$sex
[1] "factor"
$year
[1] "integer"
do.call(c, lapply(penguins, class)) species island bill_length_mm bill_depth_mm
"factor" "factor" "numeric" "numeric"
flipper_length_mm body_mass_g sex year
"integer" "integer" "factor" "integer"
do.call(tibble, lapply(penguins, class))# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <chr> <chr> <chr> <chr>
1 factor factor numeric numeric integer integer
# ℹ 2 more variables: sex <chr>, year <chr>
penguins[do.call(c, lapply(penguins, is.numeric))] # wild# A tibble: 344 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<dbl> <dbl> <int> <int> <int>
1 39.1 18.7 181 3750 2007
2 39.5 17.4 186 3800 2007
3 40.3 18 195 3250 2007
4 NA NA NA NA 2007
5 36.7 19.3 193 3450 2007
6 39.3 20.6 190 3650 2007
7 38.9 17.8 181 3625 2007
8 39.2 19.6 195 4675 2007
9 34.1 18.1 193 3475 2007
10 42 20.2 190 4250 2007
# ℹ 334 more rows
There are other kinds of *apply functions in base R, like tapply, sapply and so on. My advice is to ignore them completely as they’re very quirky and hard to use consistently. The purrr package is a much stronger option:
purrr
map(penguins, class) # basically lapply$species
[1] "factor"
$island
[1] "factor"
$bill_length_mm
[1] "numeric"
$bill_depth_mm
[1] "numeric"
$flipper_length_mm
[1] "integer"
$body_mass_g
[1] "integer"
$sex
[1] "factor"
$year
[1] "integer"
map_vec(penguins, class) # but with lovely programmer-pleasing sweeteners species island bill_length_mm bill_depth_mm
"factor" "factor" "numeric" "numeric"
flipper_length_mm body_mass_g sex year
"integer" "integer" "factor" "integer"
map gives you a standard way of applying a function over an object, and then being able to control how your output is returned. Say you’ve got some odd non-vectorised function:
fb <- function(n){
out <- ""
if (n %% 3 == 0) out <- "fizz"
if (n %% 5 == 0) out <- paste0(out, "buzz")
if(nchar(out) == 0) out <- as.character(n)
out
}That works fine on single values, but chokes when supplied with several values:
fb(8)[1] "8"
fb(9)[1] "fizz"
try(fb(8:9))Error in if (n%%3 == 0) out <- "fizz" : the condition has length > 1
We could lapply this, or map it, to produce a list of output:
lapply(8:9, fb)[[1]]
[1] "8"
[[2]]
[1] "fizz"
map(8:9, fb)[[1]]
[1] "8"
[[2]]
[1] "fizz"
But the advantage of map is that we can trivially change the output by tweaking the function name. Rather than map we could return a character vector with map_chr:
map_chr(1:20, fb) [1] "1" "2" "fizz" "4" "buzz" "fizz"
[7] "7" "8" "fizz" "buzz" "11" "fizz"
[13] "13" "14" "fizzbuzz" "16" "17" "fizz"
[19] "19" "buzz"