4 min read

Very basic use of tidyverse in R packages

tl;dr

90% of my usecases when putting a function that uses tidyverse with non-standard evaluation (NSE) into an R package can be resolved by importFrom rlang .data. In these cases, I’m only looking for a way to be able to use tidyverse internally in the package, and have it pass the R package check, rather than allowing the user to supply function arguments in the NSE form.

Why bother?

I don’t think there is any reason to specifically use tidyverse (dplyr, tidyr, ggplot2 and so on) in R packages, and I’d prefer base R solutions. However, if one already uses tidyverse on a daily basis and has created a bunch of functions that she wants to make into a package, this seems a reasonable usecase.

However, I can’t use tidyverse functions in a package the same way as I use them interactively. Otherwise when I run package check (devtools::check() or R CMD check), it will complain because I have used NSE in my functions. For example, this function:

#' Using non-standard evaluation in a function
#' Results in NOTE: no visible binding for global variable "state"
#'
#' @param df A tibble
#' @return A tibble
#' @importFrom dplyr filter
#' @export
filter_nse <- function(df) {
  filter(df, state == "treated")
}

when included in a dummy package pkgsandbox, leads to the following NOTE during package check.

checking R code for possible problems ... NOTE
filter_nse: no visible binding for global variable ‘state’
Undefined global functions or variables:
  state

.data

To suppress it, to my knowledge the way to go as of now (end of 2018) is to use .data from rlang. The function looks like this:

#' Using .data in a function, existing column
#'
#' @param df A tibble
#' @return A tibble
#' @importFrom dplyr filter
#' @importFrom rlang .data
#' @export
filter_dotdata <- function(df) {
  filter(df, .data$state == "treated")
}

Note that our filter_* functions are only intended to work with data Puromycin because of the specific column state:

head(Puromycin, n=4)
##   conc rate   state
## 1 0.02   76 treated
## 2 0.02   47 treated
## 3 0.06   97 treated
## 4 0.06  107 treated
table(Puromycin$state)
## 
##   treated untreated 
##        12        11

As expected, this works:

res <- pkgsandbox::filter_dotdata(Puromycin)
table(res$state)
## 
##   treated untreated 
##        12         0

Alternatives and problems

!! + sym

After reading the “Programming with dplyr” tutorial, I had initially assumed I should always be using the !!sym("column_name") syntax. However it’s not necessarily the case when "column_name" is internal to the function, i.e. not supplied to the function as an argument. Consider this function:

#' Using !! sym in a function, existing column
#'
#' @param df A tibble
#' @return A tibble
#' @importFrom dplyr filter
#' @importFrom rlang !! sym
#' @export
filter_bangbangsym <- function(df) {
  filter(df, !!sym("state") == "treated")
}

It works fine when df contains the state column:

res <- pkgsandbox::filter_bangbangsym(Puromycin)
table(res$state)
## 
##   treated untreated 
##        12         0

However, the problem comes when the input data doesn’t have column state, but for some reason a variable state is available in the global environment. We expect this to fail because iris doesn’t have a state column:

state <- "North Carolina"
pkgsandbox::filter_dotdata(iris)
## Column `state` not found in `.data`

But the following doesn’t fail! And the result I get is dependent on the value of state in my global environment. This can be dangerous when the function and environment gets more complicated.

state <- "North Carolina"
res_wrong_iris1 <- pkgsandbox::filter_bangbangsym(iris)
nrow(res_wrong_iris1) # 0 because "North Carolina" != "treated"
## [1] 0
state <- "treated"
res_wrong_iris2 <- pkgsandbox::filter_bangbangsym(iris)
nrow(res_wrong_iris2) # this didn't apply any filtering  because "treated" == "treated"
## [1] 150

So indeed, these results silently depend on what’s there in the evironment where the function gets called. This is obviously not ideal, and the same problem happens when we ignore the package check note and use NSE in our package:

state <- "treated"
res_wrong_iris3 <- pkgsandbox::filter_nse(iris)
nrow(res_wrong_iris3)
## [1] 150

filter_

The other alternative is to use the underscore verbs, in this case, filter_. Both of the following works and passes package check: filter_(df, ~ state == "treated") and filter_(Puromycin, "state == 'treated'"). But these *_ verbs are phasing out, along with aes_string from ggplot2. So I guess these are not recommended either.