'.' in the Tidyverse

What is the meaning of a . — i.e., the symbol for a period — in the Tidyverse? The purpose of this post is to provide a rough answer to this question, since it has come up in class recently. I am not sure if it is 100% correct. Even better, I hope to find a thorough explanation towards which I can point students in the future.

Summary: In the Tidyverse, a . refers to the object passed forward from the most recent pipe. Within a map function which uses an anonymous function, however, a . refers to each element of the .x argument to the map function. These are different things!

In Pipes

After a pipe, i.e., after %>%, the . refers to the object which passes out of the last pipe. Example:

suppressPackageStartupMessages(library(tidyverse))

mtcars %>% 
  nrow(x = .)
## [1] 32

The . refers to whatever came out of %>%, which, of course, is just the mtcars data frame which was fed into it. Of course, experienced users will know that you don’t really need the . here. This simpler code works the same.

mtcars %>% 
  nrow()
## [1] 32

The reason this works is that the pipe operator automagically passes its result as the first argument to the function which follows. (This description isn’t exactly right.) See the help page for %>% for details. Key section:

Description

Pipe an object forward into a function or call expression.

Usage

lhs %>% rhs

Arguments

lhs – A value or the magrittr placeholder.
rhs – A function call using the magrittr semantics.

Details

Using %>% with unary function calls
When functions require only one argument, x %>% f is equivalent to f(x) (not exactly equivalent; see technical note below.)
Placing lhs as the first argument in rhs call
The default behavior of %>% when multiple arguments are required in the rhs call, is to place lhs as the first argument, i.e. x %>% f(y) is equivalent to f(x, y).
Placing lhs elsewhere in rhs call
Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z).

Most of the functions in the Tidyverse are built to take advantage of this behavior. The first argument of Tidyverse-compatible functions is almost always a tibble. So, everything “just works.” Unfortunately, this is not true for older functions.

mtcars %>% 
  lm(mpg ~ cyl)
## Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame

As the help page for lm() shows, the first argument is formula. The error is caused by lm() trying to interpret mtcars as a formula, when it is not one. The solution is to use the . explicitly.

mtcars %>% 
  lm(mpg ~ cyl, data = .)
## 
## Call:
## lm(formula = mpg ~ cyl, data = .)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

In other words, the . refers to the output of the pipe, which is mtcars in this case.

In Map Functions Using Anonymous Functions

If that were the only common use of . in the Tidyverse, we might be OK. However, . is commonly (?) used in a completely different way within map functions which define their own anonymous functions. Before we look at an example, recall the use of nest().

mtcars %>% 
  group_by(cyl) %>% 
  nest()
## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl data              
##   <dbl> <list>            
## 1     6 <tibble [7 × 10]> 
## 2     4 <tibble [11 × 10]>
## 3     8 <tibble [14 × 10]>

This is a very common idiom in data science, especially when we want to create the same statistical model within each level of a category in the data. A map function is the next step.

mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  mutate(obs = map_int(data, ~ nrow(.)))
## # A tibble: 3 x 3
## # Groups:   cyl [3]
##     cyl data                 obs
##   <dbl> <list>             <int>
## 1     6 <tibble [7 × 10]>      7
## 2     4 <tibble [11 × 10]>    11
## 3     8 <tibble [14 × 10]>    14

You see the difficulty? If . still referred, as above, to the result out of the last pipe, then obs would be 3 for each level of cyl since there are 3 rows in the tibble which is output by nest(). Instead, the ., because it is used within an anonymous function, refers to each row of data. And that is why it, correctly, provides a different value for obs in each row.

To unpack this, let’s consider three different ways we might perform the same calculation:

mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  mutate(obs_1 = map_int(data, ~ nrow(.)),
         obs_2 = map_int(data, nrow),
         obs_3 = map_int(data, function(z){nrow(z)}))
## # A tibble: 3 x 5
## # Groups:   cyl [3]
##     cyl data               obs_1 obs_2 obs_3
##   <dbl> <list>             <int> <int> <int>
## 1     6 <tibble [7 × 10]>      7     7     7
## 2     4 <tibble [11 × 10]>    11    11    11
## 3     8 <tibble [14 × 10]>    14    14    14

Following this discussion in R4DS, we see that there are 3 different ways to use/create a function in this context. (Recall that the second argument to all map functions is .f, which must be a function.) The first, and most common, is an anonymous function. The second is the name of a built in function. That works, but only if there is a built in function which does what we want. (And note that you just give the name, without the parantheses.) The third option is a full function definition which we have constructed on the fly. It is verbose and annoying to type. map_int(), like all other map functions, applies the .f function to each element of the .x argument which, in this case, is the data column.

What is perhaps most confusing to students is why these versions don’t work

mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  mutate(obs_1 = map_int(data, ~ nrow()))
## Error in nrow(): argument "x" is missing, with no default

As the error message explains, nrow() requires an argument. We don’t provide one, so it fails. One might hope that map_int() would just “figure out” that we want to send each row of data into nrow(), but that does not happen automagically.

mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  mutate(obs_1 = map_int(data, nrow(.)))
## Error: Result 1 must be a single integer, not a double vector of length 11

The above fails for reasons which are unclear to me. Where does the 11 come from?

Let’s look more closely at the anonymous function example. The . is just its first argument. The job of map_int() is to take the list which is data and iterate through it, applying each piece to the anonymous function, a function which takes one argument. In other words, once we have defined the anonymous function, the . disappears. It doesn’t matter anymore, which is why it does not refer to the incoming data frame from the pipe.

Lesson

This sure is confusing, especially for students knew to R. Perhaps the best plan is to only teach the full function approach? Although it is more verbose, it is also clearer. Also, perhaps we should always define the function outside of the map function call. Time to rewrite the Primer?

Sources

These Stack Overflow answers (1 and 2) are useful.

David Kane
Preceptor in Statistical Methods and Mathematics
comments powered by Disqus