gentle tidy eval with examples

copy-pasteable example code for programming with the tidyverse.

rstats
tidyverse
notes to self
Author
Published

August 7, 2017

I’ve been using the tidy eval framework introduced with dplyr 0.7 for about two months now, and it’s time for an update to my original post on tidy eval. My goal is not to explain tidy eval to you, but rather to show you some simple examples that you can easily generalize from.

library(tidyverse)

starwars
# A tibble: 87 × 14
   name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
   <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
 1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
 2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
 3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
 4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
 5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
 6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
 7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
 8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
 9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
# … with 77 more rows, 4 more variables: species <chr>, films <list>,
#   vehicles <list>, starships <list>, and abbreviated variable names
#   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

Using strings to refer to column names

To refer to columns in a data frame with strings, we need to convert those strings into symbol objects with rlang::sym and rlang::syms. We then use the created symbol objects in dplyr functions with the prefixes !! and !!!. This is because dplyr verbs expect input that looks like code. Using the sym/syms functions we can convert strings into objects that look like code.

mass <- rlang::sym("mass")                        # create a single symbol
groups <- rlang::syms(c("homeworld", "species"))  # create a list of symbols

starwars %>%
  group_by(!!!groups) %>%               # use list of symbols with !!!
  summarize(avg_mass = mean(!!mass))    # use single symbol with !!
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species   avg_mass
   <chr>          <chr>        <dbl>
 1 Alderaan       Human         NA  
 2 Aleen Minor    Aleena        15  
 3 Bespin         Human         79  
 4 Bestine IV     Human        110  
 5 Cato Neimoidia Neimodian     90  
 6 Cerea          Cerean        82  
 7 Champala       Chagrian      NA  
 8 Chandrila      Human         NA  
 9 Concord Dawn   Human         79  
10 Corellia       Human         78.5
# … with 48 more rows

The usage mass <- rlang::sym("mass") is Hadley approved:

I believe it is also the current tidyverse code style standard. We use rlang::sym and rlang::syms identically inside functions.

summarize_by <- function(df, groups, to_summarize) {
  df %>%
    group_by(!!!rlang::syms(groups)) %>%
    summarize(summarized_mean = mean(!!rlang::sym(to_summarize)))
}

summarize_by(starwars, c("homeworld", "species"), "mass")
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species   summarized_mean
   <chr>          <chr>               <dbl>
 1 Alderaan       Human                NA  
 2 Aleen Minor    Aleena               15  
 3 Bespin         Human                79  
 4 Bestine IV     Human               110  
 5 Cato Neimoidia Neimodian            90  
 6 Cerea          Cerean               82  
 7 Champala       Chagrian             NA  
 8 Chandrila      Human                NA  
 9 Concord Dawn   Human                79  
10 Corellia       Human                78.5
# … with 48 more rows

Details about unquoting

!! and !!! are syntactic sugar on top of the functions UQ() and UQS(), respectively. It used to be that !! and !!! had low operator precedence, meaning that in terms of PEMDAS they came pretty much last. But now we can use them more intuitively:

homeworld <- rlang::sym("homeworld")

filter(starwars, !!homeworld == "Alderaan")
# A tibble: 3 × 14
  name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
  <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
1 Leia Organa     150    49 brown   light   brown        19 fema… femin… Aldera…
2 Bail Presto…    191    NA black   tan     brown        67 male  mascu… Aldera…
3 Raymus Anti…    188    79 brown   light   brown        NA male  mascu… Aldera…
# … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
#   ³​eye_color, ⁴​birth_year, ⁵​homeworld

We can also use UQ and UQS directly to be explicit about what we’re unquoting.

filter(starwars, UQ(homeworld) == "Alderaan")
# A tibble: 3 × 14
  name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
  <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
1 Leia Organa     150    49 brown   light   brown        19 fema… femin… Aldera…
2 Bail Presto…    191    NA black   tan     brown        67 male  mascu… Aldera…
3 Raymus Anti…    188    79 brown   light   brown        NA male  mascu… Aldera…
# … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
#   ³​eye_color, ⁴​birth_year, ⁵​homeworld

Creating non-standard functions

Sometimes it is nice to write functions that use accept non-standard inputs, like dplyr verbs. For example, we might want to write a function with the same effect as

starwars %>% 
  group_by(homeworld, species) %>% 
  summarize(avg_mass = mean(mass))
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species   avg_mass
   <chr>          <chr>        <dbl>
 1 Alderaan       Human         NA  
 2 Aleen Minor    Aleena        15  
 3 Bespin         Human         79  
 4 Bestine IV     Human        110  
 5 Cato Neimoidia Neimodian     90  
 6 Cerea          Cerean        82  
 7 Champala       Chagrian      NA  
 8 Chandrila      Human         NA  
 9 Concord Dawn   Human         79  
10 Corellia       Human         78.5
# … with 48 more rows

To this we need to capture our input in quosures with quo and quos when programming interactively.

groups <- quos(homeworld, species)   # capture a list of variables as raw input
mass <- quo(mass)                    # capture a single variable as raw input

starwars %>% 
  group_by(!!!groups) %>%            # use !!! to access variables from `quos`
  summarize(avg_mass = sum(!!mass))  # use !! to access the variable in `quo`
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species   avg_mass
   <chr>          <chr>        <dbl>
 1 Alderaan       Human           NA
 2 Aleen Minor    Aleena          15
 3 Bespin         Human           79
 4 Bestine IV     Human          110
 5 Cato Neimoidia Neimodian       90
 6 Cerea          Cerean          82
 7 Champala       Chagrian        NA
 8 Chandrila      Human           NA
 9 Concord Dawn   Human           79
10 Corellia       Human          157
# … with 48 more rows

There’s some nice symmetry here in that we unwrap both rlang::sym and quo with !! and both rlang::syms and quos with !!!.

We might be interested in using this behavior in a function. To do this we replace calls to quo with calls to enquo.

summarize_by <- function(df, to_summarize, ...) {

  to_summarize <- enquo(to_summarize)  # enquo captures a single argument
  groups <- quos(...)                  # quos captures multiple arguments

  df %>%
    group_by(!!!groups) %>%                 # unwrap quos with !!!
    summarize(summ = sum(!!to_summarize))   # unwrap enquo with !!
}

Now our function call is non-standardized. Note that quos can capture an arbitrary number of arguments, like we have here. So both of the following calls are valid

summarize_by(starwars, mass, homeworld)
# A tibble: 49 × 2
   homeworld       summ
   <chr>          <dbl>
 1 Alderaan          NA
 2 Aleen Minor       15
 3 Bespin            79
 4 Bestine IV       110
 5 Cato Neimoidia    90
 6 Cerea             82
 7 Champala          NA
 8 Chandrila         NA
 9 Concord Dawn      79
10 Corellia         157
# … with 39 more rows
summarize_by(starwars, mass, homeworld, species)
# A tibble: 58 × 3
# Groups:   homeworld [49]
   homeworld      species    summ
   <chr>          <chr>     <dbl>
 1 Alderaan       Human        NA
 2 Aleen Minor    Aleena       15
 3 Bespin         Human        79
 4 Bestine IV     Human       110
 5 Cato Neimoidia Neimodian    90
 6 Cerea          Cerean       82
 7 Champala       Chagrian     NA
 8 Chandrila      Human        NA
 9 Concord Dawn   Human        79
10 Corellia       Human       157
# … with 48 more rows

For more details, see the programming with dplyr vignette.