-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added multiseries VARMAX regressor #4238
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4238 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 353 355 +2
Lines 38643 38915 +272
=======================================
+ Hits 38522 38794 +272
Misses 121 121
|
de50687
to
7854455
Compare
13f9f33
to
9001204
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're working on this, just blasting this out now.
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
return X, y | ||
|
||
def _set_forecast(self, X: pd.DataFrame): | ||
from sktime.forecasting.base import ForecastingHorizon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And should probably move this to the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import is set here just in case the user does not have sktime installed (since it's an optional dependency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have another pattern that we use for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just double checked and we do this pattern for ARIMA only but we import at the top for most other sktime items (such as metrics). I'll update this and ARIMA to import at the top.
|
||
# we can only calculate the difference if the indices are of the same type | ||
units_diff = 1 | ||
if isinstance(X.index[0], type(self.last_X_index)) and isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reference to self.last_X_index
seems a little shaky...I know I'm reviewing the code from top to bottom, but it seems we could run into trouble with the order of the calls and whether this property is defined yet or not. Maybe it's better to just pass it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reset to be at parity with the ARIMA version. I can throw a check if it isn't defined yet if that helps?
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
[units_diff + i for i in range(len(X))], | ||
is_relative=True, | ||
) | ||
return fh_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit, but should X
here be more like X_forecast
? It's what we're forecasting on, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't X what we're forecasting on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're forecasting on the entire dataset? or a subset of X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean. X in this case is the covariate data which is indeed what we're forecasting on. I can change X to be X_forecast in this case if that clarifies it.
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
a118f03
to
bff6c9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to come back to take a closer look at the tests, but I wanted to get my first round in sooner rather than later! This is a beast, thanks so much for tackling
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This continues to be a beast. I left a bunch of nitpicks and a few suggestions, but this is really looking great.
|
||
def __init__( | ||
self, | ||
series_id: Optional[Hashable] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is missing from the docstring! Wonder how that made it through lint 🤔
Checking on its usage, however, it doesn't seem like it's actually used anywhere. Am I missing anything, or is this parameter actually unnecessary? I assume we should have some check to ensure we don't run VARMAX outside of the multiseries case, but if we don't actually use series_id
we may need to check it elsewhere instead.
Note - if we end up removing the series_id
parameter (it's not currently included in the multiseries baseline, fwiw), we'd need to update a bunch of tests as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think it's a remnant from when I thought about having to stack/unstack by component. I'm fine with removing it for the project scope for now.
I'm guessing it made it through the linter because the comment is for the class rather than the __init__()
function. I wonder it's worth moving the args to under __init__
/do we even want to make that change. But that's out of scope of this PR 😅
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
if y is None: | ||
raise ValueError("VARMAX Regressor requires y as input.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall correctly, VARMAX can predict on single series data just fine. I'm curious of your thoughts, should we enforce that we only run it for multiseries? We can do that easily here by double checking y's shape - it would also help prevent the case where we try to predict on stacked multiseries data (I did that in local VARMAX testing out of curiosity and it didn't error but also didn't do what we'd want it to do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sktime enforces that there must be two or more columns for y iirc. The docstring for sktime fit is not super clear on this but it says it errors out if it has a specific multivariate tag.
I'm open to having a ValueError raised if there's a single column saying it might need to be unstacked if you think it's worth having a more specific error message!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If VARMAX raises it, I'm sure that's fine 😀
if X is not None: | ||
self.last_X_index = X.index[-1] | ||
X = X.ww.select(exclude=["Datetime"]) | ||
|
||
X.ww.set_types( | ||
{ | ||
col: "Double" | ||
for col in X.ww.select(["Boolean"], return_schema=True).columns | ||
}, | ||
) | ||
X, y = match_indices(X, y) | ||
|
||
if not X.empty and self.use_covariates: | ||
self._component_obj.fit(y=y, X=X) | ||
else: | ||
self._component_obj.fit(y=y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small potential simplification here - can we just check if self.use_covariates and X is not None and not X.empty
for this section, rather than nesting that final if statement? It'd save us some time when use_covariates
is false, and save us from the edge case where an improperly sized X is passed in but use_covariates is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll still need the X.empty section just incase excluding the datetime makes X empty but will move the use_covariates up.
evalml/pipelines/components/estimators/regressors/varmax_regressor.py
Outdated
Show resolved
Hide resolved
vx = VARMAXRegressor(time_index="dates", series_id="series_id") | ||
vx.fit(X, y) | ||
preds = vx.predict(X) | ||
assert all(preds.isna().eq(False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is identical to the previous one save the mocking of fit and the very last line.... that's a lot of repeated code. Can we pull out the data into a new fixture/add the possibility of boolean to an existing fixture, or combine these two tests, or something?
Really, as I'm thinking about it, I don't think we need the first test if we have the second - it doesn't really matter that VARMAX converts bools to floats so long as at the end of the day, it handles them. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can combine the tests! I think we should test to make sure that the values are converted and that we are getting results.
pytest.param( | ||
True, | ||
marks=pytest.mark.xfail( | ||
reason="Currently, using covariates with VARMAX causes inconsistent results when predicting", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any guesses (for posterity/documentation) why this might be? Does this test always fail, or only fail intermittently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it took me a while to understand but basically X_test_last_5
only has covariate data for the last 5 values. Since we cut off the first 3 rows (test dataset is 8 rows) of covariate information, the X data differs across predict(X_test)
and predict(X_test_last_5)
in that X_test_last_5 has all 0 values for the first 3 rows. As a result, the estimator has less covariate data to use to make the prediction making the predictions very slightly different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn is this fixable? If its not a big lift lets do it in this PR and if not let's file an issue to resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the more that I think about it, I don't think there's really anything to fix here since we should expect a different result if we're able to feed in more past covariate data. I'm going to update this to clarify we should only expect equal results only for the case where we train the model without covariate data.
76cf807
to
459528c
Compare
if y is None: | ||
raise ValueError("VARMAX Regressor requires y as input.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If VARMAX raises it, I'm sure that's fine 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just some comments
X.ww.set_types( | ||
{ | ||
col: "Double" | ||
for col in X.ww.select(["Boolean"], return_schema=True).columns | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block shows up a couple times - can we consolidate into a method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, consolidated into convert_bool_to_double()
pytest.param( | ||
True, | ||
marks=pytest.mark.xfail( | ||
reason="Currently, using covariates with VARMAX causes inconsistent results when predicting", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn is this fixable? If its not a big lift lets do it in this PR and if not let's file an issue to resolve this.
return X, y | ||
|
||
def _set_forecast(self, X: pd.DataFrame): | ||
from sktime.forecasting.base import ForecastingHorizon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have another pattern that we use for this?
[units_diff + i for i in range(len(X))], | ||
is_relative=True, | ||
) | ||
return fh_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're forecasting on the entire dataset? or a subset of X?
f1a26b0
to
aa262e9
Compare
Resolves #4234