Add `forecast()` method and "deprecate" `get_test_data()`

This is a proposal for an addition to the procedure for preprocessing -> fitting -> predicting, currently used in the package.

Current behaviour:

```r
tib <- tibble(
  time_value = c(1:10, 1:10), geo_value = rep(letters[1:2], each = 5), 
  x = rnorm(20), y = rnorm(20)
) |> as_epi_df()
r <- epi_recipe(tib) |> # stores a data template
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
ewf <- fit(ewf, tib) # this runs the preprocessing and model fitting on tib, first checking that tib 
    # matches the template in r updates the workflow, returning a new "fitted" workflow
td <- get_test_data(r, tib) # grabs the necessary rows of tib so that we can process it based on r and 
    # then produce a single prediction at the latest time value for all keys (geo_value + any others 
    # in the metadata) depending on the steps in `r`, this is likely tib[c(5, 10), ] plus some additional
    # preceeding time values
p <- predict(ewf, new_data = td) # produces a forecast because we used the "tail" of the training data 
```

Alternative, non-forecast as currently implemented. Not used, really, but should work:

```r
tib2 <- tib[0, ]
r <- epi_recipe(tib2) |> # stores a data template, no difference in behaviour from the above
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
ewf <- fit(ewf, tib) # this runs the preprocessing and model fitting on tib, first checking that tib 
    # matches the template in `r`, updates the workflow, returning a new "fitted" workflow
tda <- tib[c(1, 6), ] # first time value in each geo
p1 <- predict(ewf, new_data = tda) # this will work, assuming that we can create the desired 
    # leads/lags (as specified in r) with tda
```


**Proposed adjustment:**

```r
r <- epi_recipe(tib) |> # stores all the data, not just the template
  step_*() |>
  # more steps
f <- frosting() |>
  layer_*() |>
  # more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
p <- forecast(ewf) # automatically treat the stored template as training data, process it, 
    # fit the workflow, then predict only the future. Note that horizons are specified in the recipe.
    # no "test-time" data is needed
```

Side issue: inheritance from `{tidymodels}` means that we store template information about the original data frame in the `epi_recipe` S3 object. `{recipes}` stores the entire data. An `epi_recipe` only stores a 0-row tibble with the column names. To get this proposal to work, we would need to change to match the `{recipes}` behaviour and store the original data. This could potentially be large (the reason I avoided doing this before), though note that it is the original data, not the processed data. As currently implemented, certain test-time preprocessing operations that could benefit from access to the training data (smoothing, rolling averages, etc) can potentially be buggy because they are applied only to the test-time data (`td`). 

Storing the training data would help here. However, `{tidymodels}` actually doesn't want to merge train-time and test-time data  because it tries to emphasize (pedagogically?) that operations performed on train-time data should save the necessary summary statistics to be reused on test-time data. For example, centering and scaling a predictor should save the mean and sd at train time, and use those to adjust the test-time data (rather then computing the mean and sd of the test data and using those). As with most things, time series makes this complicated, and forecasts can potentially depend on all available data (rather than just "new" data). It's likely worth thinking carefully about this problem (though perhaps that's exactly what we're doing here).

`forecast()` would only need the workflow as an argument, though we could potentially allow an optional `additional_data` argument. They that would be added to the train-time data with the forecast now produced after the end of the `additional_data`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `forecast()` method and "deprecate" `get_test_data()` #293

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add forecast() method and "deprecate" get_test_data() #293

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add `forecast()` method and "deprecate" `get_test_data()` #293