Skip to content

Handling insufficient training data. #272

Closed
@dajmcdon

Description

@dajmcdon

@dsweber2 @dshemetov

Brainstorming about how epipredict should handle insufficient training data.

A workflow is a container with (1) preprocessor; (2) engine; (3) postprocessor. By itself, it hasn't seen any data (training or test).
When fit() is called on a workflow + training data, the preprocessor processes that training data.

Typically, a check_*() in the recipe (when placed after any lagging/leading/growth rate calculations), would error with a message like "insufficient training data". We could make this a warning (or perhaps have an argument to the check that determines whether we get a warning or an error). The most obvious result in that case, would be an "unfit workflow". The downstream impact is that you can't predict() an unfit workflow (this produces an error through it's workflow class). This makes sense typically, but perhaps not in the "retrospective evaluation task". Would we actually prefer that predict(unfit_workflow, new_data) produce another warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in new_data but NA predictions?

It's possible that the new_data could have insufficient information to create the necessary lags. So from a fit workflow, we would not be able to produce a prediction. get_test_data() tries to ensure that there is enough (and only enough), but you could pass in your own (say if you don't wish to refit the model but predict using the most recent available data with the old model). So we could add a check_*() here as well. Again, now should predict(fit_workflow, bad_new_data produce a warning? Silently/verbosely return an empty tibble? Silently/verbosely return a tibble with all the keys in new_data but NA predictions?

Finally, we have arx_forecaster() (or any of the other canned forecasters). It's not just a workflow, but also sees the training data (and the testing data) at workflow build time. So we have more flexibility. So we could do any combination of the above options, or we could check at the beginning, outside the workflow, and return an appropriate object. For your current task, you're only using $predictions, so the rest doesn't matter (it also returns the fit workflow and a list of metadata). So we could just skip the postprocessor and return an empty tibble or NA predictions (even if we do something different in general).

I suspect there are some other paths in the tree I'm not thinking of at the moment. What do you guys think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions