Description
Brainstorming about how epipredict
should handle insufficient training data.
A workflow is a container with (1) preprocessor; (2) engine; (3) postprocessor. By itself, it hasn't seen any data (training or test).
When fit()
is called on a workflow + training data, the preprocessor processes that training data.
Typically, a check_*()
in the recipe (when placed after any lagging/leading/growth rate calculations), would error with a message like "insufficient training data". We could make this a warning (or perhaps have an argument to the check that determines whether we get a warning or an error). The most obvious result in that case, would be an "unfit workflow". The downstream impact is that you can't predict()
an unfit workflow (this produces an error through it's workflow
class). This makes sense typically, but perhaps not in the "retrospective evaluation task". Would we actually prefer that predict(unfit_workflow, new_data)
produce another warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in new_data
but NA predictions?
It's possible that the new_data
could have insufficient information to create the necessary lags. So from a fit workflow, we would not be able to produce a prediction. get_test_data()
tries to ensure that there is enough (and only enough), but you could pass in your own (say if you don't wish to refit the model but predict using the most recent available data with the old model). So we could add a check_*()
here as well. Again, now should predict(fit_workflow, bad_new_data
produce a warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in new_data
but NA predictions?
Finally, we have arx_forecaster()
(or any of the other canned forecasters). It's not just a workflow
, but also sees the training data (and the testing data) at workflow build time. So we have more flexibility. So we could do any combination of the above options, or we could check at the beginning, outside the workflow, and return an appropriate object. For your current task, you're only using $predictions
, so the rest doesn't matter (it also returns the fit workflow and a list of metadata). So we could just skip the postprocessor and return an empty tibble or NA predictions (even if we do something different in general).
I suspect there are some other paths in the tree I'm not thinking of at the moment. What do you guys think?