Reproducible preprocessing with recipes

class: center, middle, title-slide

# Reproducible preprocessing with recipes
## Happy Scientist Seminar
### Emil Hvitfeldt
### 2019-4-27

---

```
## Loading required package: dplyr
```

```
## 
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
## 
##     filter, lag
```

```
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
```

```
## 
## Attaching package: 'recipes'
```

```
## The following object is masked from 'package:stats':
## 
##     step
```

.blue {
 color: #3381F7;
}
</style>

## What happens to the data between `read_data()` and `fit_model()`?

---

## Prices of 54,000 round cut diamonds

```r
library(ggplot2)
diamonds
```

```
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
```

---

## Formula expression in modeling

<code class ='r hljs remark-code'>model <- lm(price ~ cut:color + carat + log(depth), &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;data = diamonds)</code>

- Select .orange[outcome] & .blue[predictors]

---

## Formula expression in modeling

<code class ='r hljs remark-code'>model <- lm(price ~ cut:color + carat + log(depth), &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;data = diamonds)</code>

- Select outcome & predictors
- .orange[Operators] to matrix of predictors

---

## Formula expression in modeling

<code class ='r hljs remark-code'>model <- lm(price ~ cut:color + carat + log(depth), &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;data = diamonds)</code>

- Select outcome & predictors
- Operators to matrix of predictors
- .orange[Inline functions]

---

## Work under the hood - model.matrix

```r
model.matrix(price ~ cut:color + carat + log(depth) + table, 
             data = diamonds)
```

```
## Rows: 53,940
## Columns: 39
## $ `(Intercept)` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26,…
## $ `log(depth)` <dbl> 4.119037, 4.091006, 4.041295, 4.133565, 4.14788…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56,…
## $ `cutFair:colorD` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `cutGood:colorD` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `cutVery Good:colorD` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `cutPremium:colorD` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `cutIdeal:colorD` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `cutFair:colorE` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ `cutGood:colorE` <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
...
```
 
---

## Downsides

- **Tidious typing with many variables**

---

## Downsides

- Tedious typing with many variables
- **Functions have to manually be applied to each variable**

```r
lm(y ~ log(x01) + log(x02) + log(x03) + log(x04) + log(x05) + log(x06) + log(x07) +
       log(x08) + log(x09) + log(x10) + log(x11) + log(x12) + log(x13) + log(x14) + 
       log(x15) + log(x16) + log(x17) + log(x18) + log(x19) + log(x20) + log(x21) + 
       log(x22) + log(x23) + log(x24) + log(x25) + log(x26) + log(x27) + log(x28) + 
       log(x29) + log(x30) + log(x31) + log(x32) + log(x33) + log(x34) + log(x35),
   data = dat)
```

---

## Downsides

- Tedious typing with many variables
- Functions have to manually be applied to each variable
- **Operations are constrained to single columns**

```r
# Not possible
lm(y ~ pca(x01, x02, x03, x04, x05), data = dat)
```

---

## Downsides

- Tedious typing with many variables
- Functions have to manually be applied to each variable
- Operations are constrained to single columns
- **Everything happens at once**

You can't apply multiple transformations to the same variable.

---

## Downsides

- Tedious typing with many variables
- Functions have to manually be applied to each variable
- Operations are constrained to single columns
- Everything happens at once
- **Connected to the model, calculations are not saved between models**

One could manually use `model.matrix` and pass the result to the modeling function.

---

.center[
![:scale 45%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/recipes.png)
]

---

# Recipes