-
Notifications
You must be signed in to change notification settings - Fork 10
/
README.Rmd
220 lines (164 loc) · 7.13 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# workflowsets
<!-- badges: start -->
[![R-CMD-check](https://github.com/tidymodels/workflowsets/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/workflowsets/actions/workflows/R-CMD-check.yaml)
[![R-CMD-check-no-suggests](https://github.com/tidymodels/workflowsets/actions/workflows/R-CMD-check-no-suggests.yaml/badge.svg)](https://github.com/tidymodels/workflowsets/actions/workflows/R-CMD-check-no-suggests.yaml)
[![Codecov test coverage](https://codecov.io/gh/tidymodels/workflowsets/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/workflowsets?branch=main)
[![Lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html)
<!-- badges: end -->
The goal of workflowsets is to allow users to create and easily fit a large number of models. workflowsets can create a _workflow set_ that holds multiple workflow objects. These objects can be created by crossing all combinations of preprocessors (e.g., formula, recipe, etc) and model specifications. This set can be tuned or resampled using a set of specific functions.
## Installation
You can install the released version of workflowsets from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("workflowsets")
```
And the development version from [GitHub](https://github.com/) with:
``` r
install.packages("pak")
pak::pak("tidymodels/workflowsets")
```
## Example
Sometimes it is a good idea to try different types of models and preprocessing methods on a specific data set. The tidymodels framework provides tools for this purpose: [recipes](https://recipes.tidymodels.org/) for preprocessing/feature engineering and [parsnip model](https://parsnip.tidymodels.org/) specifications. The workflowsets package has functions for creating and evaluating combinations of these modeling elements.
For example, the Chicago train ridership data has many numeric predictors that are highly correlated. There are a few approaches to compensating for this issue during modeling:
1. Use a feature filter to remove redundant predictors.
2. Apply principal component analysis to decorrelate the data.
3. Use a regularized model to make the estimation process insensitive to correlated predictors.
The first two methods can be used with any model while the last option is only available for specific models. Let's create a basic recipe that we will build on:
```{r sshhh, include = FALSE}
library(tidymodels)
library(glmnet)
library(rpart)
library(vctrs)
library(Matrix)
library(rlang)
theme_set(theme_bw())
```
```{r recs}
library(tidymodels)
data(Chicago)
# Use a small sample to keep file sizes down:
Chicago <- Chicago %>% slice(1:365)
base_recipe <-
recipe(ridership ~ ., data = Chicago) %>%
# create date features
step_date(date) %>%
step_holiday(date) %>%
# remove date from the list of predictors
update_role(date, new_role = "id") %>%
# create dummy variables from factor columns
step_dummy(all_nominal()) %>%
# remove any columns with a single unique value
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
```
To enact a correlation filter, an additional step is used:
```{r filter}
filter_rec <-
base_recipe %>%
step_corr(all_of(stations), threshold = tune())
```
Similarly, for PCA:
```{r pca}
pca_rec <-
base_recipe %>%
step_pca(all_of(stations), num_comp = tune()) %>%
step_normalize(all_predictors())
```
We might want to assess a few different models, including a regularized method (`glmnet`):
```{r models}
regularized_spec <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
cart_spec <-
decision_tree(cost_complexity = tune(), min_n = tune()) %>%
set_engine("rpart") %>%
set_mode("regression")
knn_spec <-
nearest_neighbor(neighbors = tune(), weight_func = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
```
Rather than creating all 9 combinations of these preprocessors and models, we can create a _workflow set_:
```{r set}
chi_models <-
workflow_set(
preproc = list(
simple = base_recipe, filter = filter_rec,
pca = pca_rec
),
models = list(
glmnet = regularized_spec, cart = cart_spec,
knn = knn_spec
),
cross = TRUE
)
chi_models
```
It doesn't make sense to use PCA or a filter with a `glmnet` model. We can remove these easily:
```{r rm}
chi_models <-
chi_models %>%
anti_join(tibble(wflow_id = c("pca_glmnet", "filter_glmnet")),
by = "wflow_id"
)
```
These models all have tuning parameters. To resolve these, we'll need a resampling set. In this case, a time-series resampling method is used:
```{r rs}
splits <-
sliding_period(
Chicago,
date,
"day",
lookback = 300, # Each resample has 300 days for modeling
assess_stop = 7, # One week for performance assessment
step = 7 # Ensure non-overlapping weeks for assessment
)
splits
```
We'll use simple grid search for these models by running `workflow_map()`. This will execute a resampling or tuning function over the workflows in the `workflow` column:
```{r tune}
set.seed(123)
chi_models <-
chi_models %>%
# The first argument is a function name from the {{tune}} package
# such as `tune_grid()`, `fit_resamples()`, etc.
workflow_map("tune_grid",
resamples = splits, grid = 10,
metrics = metric_set(mae), verbose = TRUE
)
chi_models
```
The `results` column contains the results of each call to `tune_grid()` for the workflows.
The `autoplot()` method shows the rankings of the workflows:
```{r plot, fig.height = 4, dev = "svg"}
autoplot(chi_models)
```
or the best from each workflow:
```{r plot-best, fig.height = 4, dev = "svg"}
autoplot(chi_models, select_best = TRUE)
```
We can determine how well each combination did by looking at the best results per workflow:
```{r best}
rank_results(chi_models, rank_metric = "mae", select_best = TRUE) %>%
select(rank, mean, model, wflow_id, .config)
```
```{r save, eval = FALSE, echo = FALSE}
save(chi_models, file = "data/chi_models.rda", compress = "bzip2", version = 2)
```
## Contributing
This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on Posit Community](https://forum.posit.co/new-topic?category_id=15&tags=tidymodels,question).
- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/workflowsets/issues).
- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.
- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).