-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add forecast()
method and "deprecate" get_test_data()
#293
Comments
Re side issue: why not
Pros:
Cons:
We also saw in some hardhat/tidymodels reading another place where they say not to look at test data first / near train time, forget how they worded it. I think the caution we should take here is to maybe not save data at epi_workflow creation time. We handle not cheating via Random thoughts:
|
Potentially related feature/application to think about below: combining models into joint via sampling.We're considering trying out separate backfill-projection&missingness-imputation and forecasting steps. One way to get reasonable uncertainty propagation in this method is to sample many trajectories (FinalizedPast | Provisional) from the backfill projection model, then take each one of those partial trajectories and tack on a sample from FinalizedFuture | FinalizedPast, then finally marginalize the samples into samples for each target (TargetFn(FinalizedFuture, FinalizedPast)) & kernel smooth to get a distribution for each target. This isn't the only way you might combine these two types of models, but seems like one of the simplest ways. It'd be great if there was something within epipredict to do this sort of composition for you, but if there's not, then having a Notes:
|
Looking into the details:
If additional data were passed, I suspect you would want to concatenate and then re prep/bake. We could probably add flags in various places that we inspect to determine if this is necessary, but that might get us back to the "need to adjust the new data handling specifically for every possible step" that we face now with Aside: if we are super concerned with space (not clear we are at the moment, seems a "nice to have" rather than a "mandatory for adding new features"), we may want to investigate ways to use |
If we throw out space considerations and there are no recalculation issues, there are still interface issues regarding storing/using the template data.
After realizing that Option 2 and Option 3's main con is fairly easily detectable after the fact (I think you get predictions for the wrong times, not cheating predictions for the right times), the only no-go seems to be Option 4. This would change if |
I'm still catching up to the discussion here, so apologies if my comments are missing something obvious here. Something that's not quite clear to me is how storing all the data with the original recipe helps us remove the difficult logic that's in |
On the options for interface: Meta: The To me, the main interface question is "when do you call
Either of these are maybe closest to Option 4? I think my option 2 is a bit more flexible, because you can always train with any data you want and then forecast from it. But my option 1 is more user-friendly because you only pass in training data once. The option 2 flexibility is still "available" to you, because that's how things currently work: you have to call @dshemetov This is potentially correct, but hidden from the user. Also important. If there is stored data, when calling So, now having thought through @dshemetov question and writing it down, the choice of which to store and when impacts the available options. I think that forecasting a fit workflow should not allow To deal with the fact that (by default) we would be storing data in the workflow, we could also include somme help. First, we add an argument to |
So above, I think I got mixed up and felt part of any change here would also include:
Thus why I kept talking about ewf <- fit(ewf, tib)
p <- predict(ewf) or, with the changes I was imagining above, to do ewf <- fit(ewf, tib)
p <- predict(ewf, new_data = tib) basically just turning But for the forecasters I can think of, we'd normally want to just
@dajmcdon I think you're proposing "Maybe in template, maybe in |
One more thought here. While I still think we can probably get away without time window logic for version-unaware forecasters, it would likely be useful for efficiency purposes for preparing version-aware training sets, if we really buy into recipes for preprocessing. For version-aware forecasting, we very commonly want to line up test instance data with "analogous" versioned training data, particularly lags, or lags of 7-day-averages. We could
In all of these cases, there's the question of how to do things efficiently. In the first two cases, we can use something like There's yet another context where we might want to know the time window needed for a computation, and that's archive -> archive slides (like we want here). Though I think Dmitry/David pointed out we could try something time-window-unaware first and see if it actually is slow. Plus this could be even more complicated, because it's probably about prep + bake, not just bake. |
This is a proposal for an addition to the procedure for preprocessing -> fitting -> predicting, currently used in the package.
Current behaviour:
Alternative, non-forecast as currently implemented. Not used, really, but should work:
Proposed adjustment:
Side issue: inheritance from
{tidymodels}
means that we store template information about the original data frame in theepi_recipe
S3 object.{recipes}
stores the entire data. Anepi_recipe
only stores a 0-row tibble with the column names. To get this proposal to work, we would need to change to match the{recipes}
behaviour and store the original data. This could potentially be large (the reason I avoided doing this before), though note that it is the original data, not the processed data. As currently implemented, certain test-time preprocessing operations that could benefit from access to the training data (smoothing, rolling averages, etc) can potentially be buggy because they are applied only to the test-time data (td
).Storing the training data would help here. However,
{tidymodels}
actually doesn't want to merge train-time and test-time data because it tries to emphasize (pedagogically?) that operations performed on train-time data should save the necessary summary statistics to be reused on test-time data. For example, centering and scaling a predictor should save the mean and sd at train time, and use those to adjust the test-time data (rather then computing the mean and sd of the test data and using those). As with most things, time series makes this complicated, and forecasts can potentially depend on all available data (rather than just "new" data). It's likely worth thinking carefully about this problem (though perhaps that's exactly what we're doing here).forecast()
would only need the workflow as an argument, though we could potentially allow an optionaladditional_data
argument. They that would be added to the train-time data with the forecast now produced after the end of theadditional_data
.The text was updated successfully, but these errors were encountered: