Handling insufficient training data.

@dsweber2 @dshemetov 

Brainstorming about how `epipredict` should handle insufficient training data.

A workflow is a container with (1) preprocessor; (2) engine; (3) postprocessor. By itself, it hasn't seen any data (training or test).
When `fit()` is called on a workflow + training data, the preprocessor processes that training data. 

Typically, a `check_*()` in the recipe (when placed after any lagging/leading/growth rate calculations), would error with a message like "insufficient training data". We could make this a warning (or perhaps have an argument to the check that determines whether we get a warning or an error). The most obvious result in that case, would be an "unfit workflow". The downstream impact is that you can't `predict()` an unfit workflow (this produces an error through it's `workflow` class). This makes sense typically, but perhaps not in the "retrospective evaluation task". Would we actually prefer that `predict(unfit_workflow, new_data)` produce another warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in `new_data` but NA predictions?

It's possible that the `new_data` could have insufficient information to create the necessary lags. So from a fit workflow, we would not be able to produce a prediction. `get_test_data()` tries to ensure that there is enough (and only enough), but you could pass in your own (say if you don't wish to refit the model but predict using the most recent available data with the old model). So we could add a `check_*()` here as well. Again, now should `predict(fit_workflow, bad_new_data` produce a warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in `new_data` but NA predictions?

Finally, we have `arx_forecaster()` (or any of the other canned forecasters). It's not just a `workflow`, but also sees the training data (and the testing data) at workflow build time. So we have more flexibility. So we could do any combination of the above options, or we could check at the beginning, outside the workflow, and return an appropriate object. For your current task, you're only using `$predictions`, so the rest doesn't matter (it also returns the fit workflow and a list of metadata). So we could just skip the postprocessor and return an empty tibble or NA predictions (even if we do something different in general).

I suspect there are some other paths in the tree I'm not thinking of at the moment. What do you guys think?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling insufficient training data. #272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling insufficient training data. #272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions