Data masking in R
Introduction
More so than in other programming languages, R functions bias towards helping the user do common tasks easily. One excellent example is the way that tidyverse functions (like dplyr) make assumptions about what users mean when they refer to variables. As an example, if you want to select a column in base R you’d use a quoted column name in single brackets:
stranded_data["age"]
# A tibble: 768 × 1
age
<int>
1 50
2 31
3 32
4 69
5 33
6 75
7 26
8 64
9 53
10 63
# ℹ 758 more rows
Doing that in the tidyverse is easier: use an unquoted column name inside select
:
stranded_data |>
select(age)
# A tibble: 768 × 1
age
<int>
1 50
2 31
3 32
4 69
5 33
6 75
7 26
8 64
9 53
10 63
# ℹ 758 more rows
That’s conceptually helpful because we’re used to referring to ordinary R objects using unquoted variable names. When we specify the age column in this select function, we don’t need to tell R that we specifically mean the age column in the stranded_data
tibble. That’s very helpful, because it saves us having to specify that we want to refer to a specific column in a specific tibble each time we write a line of dplyr. Even if we create another tibble that also has an age column…
new_stranded_data <- stranded_data |>
select(stranded.label, age)
… we can still just refer to the age column of the original stranded_data
without any risk of confusion. This simplification - which we’ll call data masking - is a great advantage of using the pipe, and most of the time data masking just works without giving rise to any problems at all. For example, we can write a vector of column names, and then pass it to select()
, and R will figure out that we want to use those names as column names without any extra effort on our part:
my_cols <- c("age")
stranded_data |>
select(any_of(my_cols)) # need to use a helper function like any_of since dplyr 1.1.0
# A tibble: 768 × 1
age
<int>
1 50
2 31
3 32
4 69
5 33
6 75
7 26
8 64
9 53
10 63
# ℹ 758 more rows
However, when that data masking goes wrong it can be very challenging to fix. To demonstrate, let’s start taking the column-selecting code snippets above, and translating them into functions. Base-R first:
column_pick_base <- function(col){
stranded_data[col]
}
column_pick_base("age")
# A tibble: 768 × 1
age
<int>
1 50
2 31
3 32
4 69
5 33
6 75
7 26
8 64
9 53
10 63
# ℹ 758 more rows
That’s simple - we pass a quoted column name to the function, and it returns us the relevant column of data. But if we do the apparently-simple translation to use select
, things start going wrong:
column_pick_tidy <- function(col){
stranded_data |>
select(col)
}
That works almost as expected if we pass a quoted column name:
column_pick_tidy("age")
# A tibble: 768 × 1
age
<int>
1 50
2 31
3 32
4 69
5 33
6 75
7 26
8 64
9 53
10 63
# ℹ 758 more rows
Okay, so we get a deprecated warning to the effect that selecting functions expect unquoted column names. So let’s update the function call to do that now:
try(column_pick_tidy(age))
Error in eval(expr, envir) : object 'age' not found
Uh-oh. This has gone wrong. Unfortunately, to explain why it’s gone wrong, we’re going to need to think about how data masking really works. We shouldn’t try to use the non-standard quoted column “kludge” as that will complicate e.g. using our new function in another function. A stronger approach is to adjust our function code in the first place, so that we don’t have to call our function in a non-standard way (why write age in some functions, but “age” in others to refer to the same thing).
In this section, we’ll give a bit of helpful theoretical background about data masking. We’ll then go on to look at four ways of resolving some of the difficulties that data masking can cause.
Background
The rlang page on data-masking is very helpful here in setting out a key distinction between kinds of variables that we’ve previously been using synonymously:
- env-variables (things you create with assignment)
- data-variables (e.g. imported data in a tibble)
For beginners, this distinction is not that important, particularly because tidyverse functions do lots of helpful blurring between these different types of variable. Note that many base R functions do often require the user to bear this distinction in mind. For instance, in base R you would specify a data variable differently from an environment variable:
mtcars$cyl # a data variable
cyl <- c(4,6,8) # an environment variable
Whereas in tidyverse, you can:
mtcars |>
select(cyl) # specifying a data variable like an environment variable inside select
Most of the time, data masking doesn’t cause any problems. However, when you start wanting to include tidyverse functions inside other functions - say, if you’re trying to purrrr
something - that blurring raises a problem. We won’t give much of an explanation as to the reasons for this, although do read this introduction to the topic and this more detailed account if you are interested in the technical aspects. Here, we’ll concentrate on four strategies for resolving these kind of data masking problems. These strategies are:
Problem | Solution |
---|---|
data-variable in a function argument |
embracing with {var}
|
env-variable in a vector |
.data[[var]] and .env[[var]] pronouns
|
variables in output |
injection with :=
|
complex cases |
quasiquotation with the injection operator !!
|
Embracing
Slightly confusingly, this practice is also referred to as tunneling data variables
If you want to use a data variable in the argument of a function, you need to {embrace}
the argument. Let’s add some {{}}
to our earlier function:
column_pick_curly <- function(col){
stranded_data |>
select({{col}})
}
column_pick_curly(age)
# A tibble: 768 × 1
age
<int>
1 50
2 31
3 32
4 69
5 33
6 75
7 26
8 64
9 53
10 63
# ℹ 758 more rows
Pronouns
If you want to use quoted strings to select columns, use pronouns:
column_pick_pronouns <- function(col){
stranded_data |>
select(.data[[col]])
}
column_pick_curly("age")
# A tibble: 768 × 1
age
<int>
1 50
2 31
3 32
4 69
5 33
6 75
7 26
8 64
9 53
10 63
# ℹ 758 more rows
For completeness, we can also play with the .env
pronoun, which allows us to explicitly refer to an object in the environment. Let’s do something horrible, and create an age
variable in the global environment:
age <- 50
column_pick_env <- function(col){
stranded_data |>
filter(.data[[col]] > .env[[col]])
}
column_pick_env("age")
# A tibble: 440 × 9
stranded.label age care.home.referral medicallysafe hcop
<chr> <int> <int> <int> <int>
1 Not Stranded 69 1 1 0
2 Stranded 75 1 1 0
3 Not Stranded 64 0 1 1
4 Not Stranded 53 0 1 0
5 Not Stranded 63 1 0 0
6 Not Stranded 77 1 1 1
7 Stranded 80 1 1 0
8 Not Stranded 72 1 0 0
9 Stranded 60 0 1 1
10 Not Stranded 70 1 1 0
# ℹ 430 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
# admit_date <chr>, frailty_index <chr>
Okay, like so much in this session this is hardly best practice, but if you do ever need to make 100% super-safety-sure that you’re referring to unfortunately-named data/env variables, this is probably the least-worst way of working.
Injection
:=
lets you inject variables into your output. Yes, this is what the tidyverse people really call it, and no, this isn’t a very helpful bit of terminology for the rest of us. Say you want to supply an unquoted column name, and return a renamed selection:
column_pick_rename <- function(col){
stranded_data |>
select("new_{{col}}" := {{col}})
}
column_pick_rename(age)
# A tibble: 768 × 1
new_age
<int>
1 50
2 31
3 32
4 69
5 33
6 75
7 26
8 64
9 53
10 63
# ℹ 758 more rows
There’s a bit going on here. First, replacing =
with :=
injects the supplied column name. Then the new column name is created using glue()
syntax. glue()
is a neat replacement for base-R tools like paste0()
. Empirically, though, a quoted string with {{}}
containing the function argument will do the work for you.
:=
is borrowed from mathematics, and is used when defining something new, which is the apparent logic behind its use here. If you try and use =
in this context, it’ll fail because of the consistency checks that tidyverse functions use to check that all’s okay with newly-defined names.
Quasiquotation
A lot of this messing around is effectively concerned with switching between quoted- and unquoted-versions of column names. Quasiquotation is the fancy-sounding name for that messing around, and the tools required to do so:
Quasiquotation is the combination of quoting an expression while allowing immediate evaluation (unquoting) of part of that expression. (rlang quasiquotation manual page)
To give the simplest possible example, !!
gives a generic way of unquoting an argument: