Data masking in R

R
intermediate
functions
debugging
Published

August 5, 2025

Session materials

Introduction

More so than in other programming languages, R functions bias towards helping the user do common tasks easily. One excellent example is the way that tidyverse functions (like dplyr) make assumptions about what users mean when they refer to variables. As an example, if you want to select a column in base R you’d use a quoted column name in single brackets:

stranded_data["age"]
# A tibble: 768 × 1
     age
   <int>
 1    50
 2    31
 3    32
 4    69
 5    33
 6    75
 7    26
 8    64
 9    53
10    63
# ℹ 758 more rows

Doing that in the tidyverse is easier: use an unquoted column name inside select:

stranded_data |> 
  select(age) 
# A tibble: 768 × 1
     age
   <int>
 1    50
 2    31
 3    32
 4    69
 5    33
 6    75
 7    26
 8    64
 9    53
10    63
# ℹ 758 more rows

That’s conceptually helpful because we’re used to referring to ordinary R objects using unquoted variable names. When we specify the age column in this select function, we don’t need to tell R that we specifically mean the age column in the stranded_data tibble. That’s very helpful, because it saves us having to specify that we want to refer to a specific column in a specific tibble each time we write a line of dplyr. Even if we create another tibble that also has an age column…

new_stranded_data <- stranded_data |>
  select(stranded.label, age)

… we can still just refer to the age column of the original stranded_data without any risk of confusion. This simplification - which we’ll call data masking - is a great advantage of using the pipe, and most of the time data masking just works without giving rise to any problems at all. For example, we can write a vector of column names, and then pass it to select(), and R will figure out that we want to use those names as column names without any extra effort on our part:

my_cols <- c("age")

stranded_data |>
  select(any_of(my_cols)) # need to use a helper function like any_of since dplyr 1.1.0
# A tibble: 768 × 1
     age
   <int>
 1    50
 2    31
 3    32
 4    69
 5    33
 6    75
 7    26
 8    64
 9    53
10    63
# ℹ 758 more rows

However, when that data masking goes wrong it can be very challenging to fix. To demonstrate, let’s start taking the column-selecting code snippets above, and translating them into functions. Base-R first:

column_pick_base <- function(col){
  stranded_data[col]
}
column_pick_base("age")
# A tibble: 768 × 1
     age
   <int>
 1    50
 2    31
 3    32
 4    69
 5    33
 6    75
 7    26
 8    64
 9    53
10    63
# ℹ 758 more rows

That’s simple - we pass a quoted column name to the function, and it returns us the relevant column of data. But if we do the apparently-simple translation to use select, things start going wrong:

column_pick_tidy <- function(col){
  stranded_data |>
    select(col)
}

That works almost as expected if we pass a quoted column name:

column_pick_tidy("age")
# A tibble: 768 × 1
     age
   <int>
 1    50
 2    31
 3    32
 4    69
 5    33
 6    75
 7    26
 8    64
 9    53
10    63
# ℹ 758 more rows

Okay, so we get a deprecated warning to the effect that selecting functions expect unquoted column names. So let’s update the function call to do that now:

try(column_pick_tidy(age))
Error in eval(expr, envir) : object 'age' not found

Uh-oh. This has gone wrong. Unfortunately, to explain why it’s gone wrong, we’re going to need to think about how data masking really works. We shouldn’t try to use the non-standard quoted column “kludge” as that will complicate e.g. using our new function in another function. A stronger approach is to adjust our function code in the first place, so that we don’t have to call our function in a non-standard way (why write age in some functions, but “age” in others to refer to the same thing).

In this section, we’ll give a bit of helpful theoretical background about data masking. We’ll then go on to look at four ways of resolving some of the difficulties that data masking can cause.

Background

The rlang page on data-masking is very helpful here in setting out a key distinction between kinds of variables that we’ve previously been using synonymously:

  • env-variables (things you create with assignment)
  • data-variables (e.g. imported data in a tibble)

For beginners, this distinction is not that important, particularly because tidyverse functions do lots of helpful blurring between these different types of variable. Note that many base R functions do often require the user to bear this distinction in mind. For instance, in base R you would specify a data variable differently from an environment variable:

mtcars$cyl      # a data variable
cyl <- c(4,6,8) # an environment variable

Whereas in tidyverse, you can:

mtcars |>
  select(cyl) # specifying a data variable like an environment variable inside select

Most of the time, data masking doesn’t cause any problems. However, when you start wanting to include tidyverse functions inside other functions - say, if you’re trying to purrrr something - that blurring raises a problem. We won’t give much of an explanation as to the reasons for this, although do read this introduction to the topic and this more detailed account if you are interested in the technical aspects. Here, we’ll concentrate on four strategies for resolving these kind of data masking problems. These strategies are:

Problem Solution
data-variable in a function argument embracing with {var}
env-variable in a vector .data[[var]] and .env[[var]] pronouns
variables in output injection with :=
complex cases quasiquotation with the injection operator !!

Embracing

Slightly confusingly, this practice is also referred to as tunneling data variables

If you want to use a data variable in the argument of a function, you need to {embrace} the argument. Let’s add some {{}} to our earlier function:

column_pick_curly <- function(col){
  stranded_data |>
    select({{col}})
}

column_pick_curly(age)
# A tibble: 768 × 1
     age
   <int>
 1    50
 2    31
 3    32
 4    69
 5    33
 6    75
 7    26
 8    64
 9    53
10    63
# ℹ 758 more rows

Pronouns

If you want to use quoted strings to select columns, use pronouns:

column_pick_pronouns <- function(col){
  stranded_data |>
    select(.data[[col]])
}

column_pick_curly("age")
# A tibble: 768 × 1
     age
   <int>
 1    50
 2    31
 3    32
 4    69
 5    33
 6    75
 7    26
 8    64
 9    53
10    63
# ℹ 758 more rows

For completeness, we can also play with the .env pronoun, which allows us to explicitly refer to an object in the environment. Let’s do something horrible, and create an age variable in the global environment:

age <- 50

column_pick_env <- function(col){
  stranded_data |>
    filter(.data[[col]] > .env[[col]]) 
}

column_pick_env("age")
# A tibble: 440 × 9
   stranded.label   age care.home.referral medicallysafe  hcop
   <chr>          <int>              <int>         <int> <int>
 1 Not Stranded      69                  1             1     0
 2 Stranded          75                  1             1     0
 3 Not Stranded      64                  0             1     1
 4 Not Stranded      53                  0             1     0
 5 Not Stranded      63                  1             0     0
 6 Not Stranded      77                  1             1     1
 7 Stranded          80                  1             1     0
 8 Not Stranded      72                  1             0     0
 9 Stranded          60                  0             1     1
10 Not Stranded      70                  1             1     0
# ℹ 430 more rows
# ℹ 4 more variables: mental_health_care <int>, periods_of_previous_care <int>,
#   admit_date <chr>, frailty_index <chr>

Okay, like so much in this session this is hardly best practice, but if you do ever need to make 100% super-safety-sure that you’re referring to unfortunately-named data/env variables, this is probably the least-worst way of working.

Injection

:= lets you inject variables into your output. Yes, this is what the tidyverse people really call it, and no, this isn’t a very helpful bit of terminology for the rest of us. Say you want to supply an unquoted column name, and return a renamed selection:

column_pick_rename <- function(col){
  stranded_data |>
    select("new_{{col}}" := {{col}}) 
}

column_pick_rename(age)
# A tibble: 768 × 1
   new_age
     <int>
 1      50
 2      31
 3      32
 4      69
 5      33
 6      75
 7      26
 8      64
 9      53
10      63
# ℹ 758 more rows

There’s a bit going on here. First, replacing = with := injects the supplied column name. Then the new column name is created using glue() syntax. glue() is a neat replacement for base-R tools like paste0(). Empirically, though, a quoted string with {{}} containing the function argument will do the work for you.

:= is borrowed from mathematics, and is used when defining something new, which is the apparent logic behind its use here. If you try and use = in this context, it’ll fail because of the consistency checks that tidyverse functions use to check that all’s okay with newly-defined names.

Quasiquotation

A lot of this messing around is effectively concerned with switching between quoted- and unquoted-versions of column names. Quasiquotation is the fancy-sounding name for that messing around, and the tools required to do so:

Quasiquotation is the combination of quoting an expression while allowing immediate evaluation (unquoting) of part of that expression. (rlang quasiquotation manual page)

To give the simplest possible example, !! gives a generic way of unquoting an argument:

quoted_variable <- "age_from_var"

stranded_data |> 
  rename("{quoted_variable}" := age) |>
  select(!!quoted_variable)
# A tibble: 768 × 1
   age_from_var
          <int>
 1           50
 2           31
 3           32
 4           69
 5           33
 6           75
 7           26
 8           64
 9           53
10           63
# ℹ 758 more rows