gentle tidy eval with examples

copy-pasteable example code for programming with the tidyverse.

rstats
tidyverse
notes to self
Author

Alex Hayes

Published

August 7, 2017

I’ve been using the tidy eval framework introduced with dplyr 0.7 for about two months now, and it’s time for an update to my original post on tidy eval. My goal is not to explain tidy eval to you, but rather to show you some simple examples that you can easily generalize from.

library(tidyverse)

starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Using strings to refer to column names

To refer to columns in a data frame with strings, we need to convert those strings into symbol objects with rlang::sym and rlang::syms. We then use the created symbol objects in dplyr functions with the prefixes !! and !!!. This is because dplyr verbs expect input that looks like code. Using the sym/syms functions we can convert strings into objects that look like code.

mass <- rlang::sym("mass")                        # create a single symbol
groups <- rlang::syms(c("homeworld", "species"))  # create a list of symbols

starwars %>%
  group_by(!!!groups) %>%               # use list of symbols with !!!
  summarize(avg_mass = mean(!!mass))    # use single symbol with !!
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species   avg_mass
   <chr>          <chr>        <dbl>
 1 Alderaan       Human         NA  
 2 Aleen Minor    Aleena        15  
 3 Bespin         Human         79  
 4 Bestine IV     Human        110  
 5 Cato Neimoidia Neimodian     90  
 6 Cerea          Cerean        82  
 7 Champala       Chagrian      NA  
 8 Chandrila      Human         NA  
 9 Concord Dawn   Human         79  
10 Corellia       Human         78.5
# … with 48 more rows

The usage mass <- rlang::sym("mass") is Hadley approved:

I believe it is also the current tidyverse code style standard. We use rlang::sym and rlang::syms identically inside functions.

summarize_by <- function(df, groups, to_summarize) {
  df %>%
    group_by(!!!rlang::syms(groups)) %>%
    summarize(summarized_mean = mean(!!rlang::sym(to_summarize)))
}

summarize_by(starwars, c("homeworld", "species"), "mass")
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species   summarized_mean
   <chr>          <chr>               <dbl>
 1 Alderaan       Human                NA  
 2 Aleen Minor    Aleena               15  
 3 Bespin         Human                79  
 4 Bestine IV     Human               110  
 5 Cato Neimoidia Neimodian            90  
 6 Cerea          Cerean               82  
 7 Champala       Chagrian             NA  
 8 Chandrila      Human                NA  
 9 Concord Dawn   Human                79  
10 Corellia       Human                78.5
# … with 48 more rows

Details about unquoting

!! and !!! are syntactic sugar on top of the functions UQ() and UQS(), respectively. It used to be that !! and !!! had low operator precedence, meaning that in terms of PEMDAS they came pretty much last. But now we can use them more intuitively:

homeworld <- rlang::sym("homeworld")

filter(starwars, !!homeworld == "Alderaan")
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Bail Pre…    191    NA black      tan        brown             67 male  mascu…
3 Raymus A…    188    79 brown      light      brown             NA male  mascu…
# … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

We can also use UQ and UQS directly to be explicit about what we’re unquoting.

filter(starwars, UQ(homeworld) == "Alderaan")
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Bail Pre…    191    NA black      tan        brown             67 male  mascu…
3 Raymus A…    188    79 brown      light      brown             NA male  mascu…
# … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Creating non-standard functions

Sometimes it is nice to write functions that use accept non-standard inputs, like dplyr verbs. For example, we might want to write a function with the same effect as

starwars %>% 
  group_by(homeworld, species) %>% 
  summarize(avg_mass = mean(mass))
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species   avg_mass
   <chr>          <chr>        <dbl>
 1 Alderaan       Human         NA  
 2 Aleen Minor    Aleena        15  
 3 Bespin         Human         79  
 4 Bestine IV     Human        110  
 5 Cato Neimoidia Neimodian     90  
 6 Cerea          Cerean        82  
 7 Champala       Chagrian      NA  
 8 Chandrila      Human         NA  
 9 Concord Dawn   Human         79  
10 Corellia       Human         78.5
# … with 48 more rows

To this we need to capture our input in quosures with quo and quos when programming interactively.

groups <- quos(homeworld, species)   # capture a list of variables as raw input
mass <- quo(mass)                    # capture a single variable as raw input

starwars %>% 
  group_by(!!!groups) %>%            # use !!! to access variables from `quos`
  summarize(avg_mass = sum(!!mass))  # use !! to access the variable in `quo`
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species   avg_mass
   <chr>          <chr>        <dbl>
 1 Alderaan       Human           NA
 2 Aleen Minor    Aleena          15
 3 Bespin         Human           79
 4 Bestine IV     Human          110
 5 Cato Neimoidia Neimodian       90
 6 Cerea          Cerean          82
 7 Champala       Chagrian        NA
 8 Chandrila      Human           NA
 9 Concord Dawn   Human           79
10 Corellia       Human          157
# … with 48 more rows

There’s some nice symmetry here in that we unwrap both rlang::sym and quo with !! and both rlang::syms and quos with !!!.

We might be interested in using this behavior in a function. To do this we replace calls to quo with calls to enquo.

summarize_by <- function(df, to_summarize, ...) {

  to_summarize <- enquo(to_summarize)  # enquo captures a single argument
  groups <- quos(...)                  # quos captures multiple arguments

  df %>%
    group_by(!!!groups) %>%                 # unwrap quos with !!!
    summarize(summ = sum(!!to_summarize))   # unwrap enquo with !!
}

Now our function call is non-standardized. Note that quos can capture an arbitrary number of arguments, like we have here. So both of the following calls are valid

summarize_by(starwars, mass, homeworld)
# A tibble: 49 × 2
   homeworld       summ
   <chr>          <dbl>
 1 Alderaan          NA
 2 Aleen Minor       15
 3 Bespin            79
 4 Bestine IV       110
 5 Cato Neimoidia    90
 6 Cerea             82
 7 Champala          NA
 8 Chandrila         NA
 9 Concord Dawn      79
10 Corellia         157
# … with 39 more rows
summarize_by(starwars, mass, homeworld, species)
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species    summ
   <chr>          <chr>     <dbl>
 1 Alderaan       Human        NA
 2 Aleen Minor    Aleena       15
 3 Bespin         Human        79
 4 Bestine IV     Human       110
 5 Cato Neimoidia Neimodian    90
 6 Cerea          Cerean       82
 7 Champala       Chagrian     NA
 8 Chandrila      Human        NA
 9 Concord Dawn   Human        79
10 Corellia       Human       157
# … with 48 more rows

For more details, see the programming with dplyr vignette.