Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multiseries VARMAX regressor #4238

Merged
merged 7 commits into from
Aug 8, 2023
Merged

Added multiseries VARMAX regressor #4238

merged 7 commits into from
Aug 8, 2023

Conversation

christopherbunn
Copy link
Contributor

@christopherbunn christopherbunn commented Jul 14, 2023

Resolves #4234

@codecov
Copy link

codecov bot commented Jul 14, 2023

Codecov Report

Merging #4238 (aa262e9) into main (4982dd1) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4238     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        353     355      +2     
  Lines      38643   38915    +272     
=======================================
+ Hits       38522   38794    +272     
  Misses       121     121             
Files Changed Coverage Δ
evalml/pipelines/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/estimators/__init__.py 100.0% <ø> (ø)
evalml/pipelines/time_series_pipeline_base.py 100.0% <ø> (ø)
...valml/pipelines/time_series_regression_pipeline.py 100.0% <ø> (ø)
...alml/tests/component_tests/test_arima_regressor.py 100.0% <ø> (ø)
...alml/tests/model_family_tests/test_model_family.py 100.0% <ø> (ø)
evalml/utils/gen_utils.py 99.3% <ø> (ø)
evalml/model_family/model_family.py 100.0% <100.0%> (ø)
...lines/components/estimators/regressors/__init__.py 100.0% <100.0%> (ø)
... and 7 more

@christopherbunn christopherbunn changed the title Add multiseries VARMAX regressor Added multiseries VARMAX regressor Jul 18, 2023
@christopherbunn christopherbunn marked this pull request as ready for review July 18, 2023 17:59
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're working on this, just blasting this out now.

return X, y

def _set_forecast(self, X: pd.DataFrame):
from sktime.forecasting.base import ForecastingHorizon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And should probably move this to the top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import is set here just in case the user does not have sktime installed (since it's an optional dependency).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have another pattern that we use for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just double checked and we do this pattern for ARIMA only but we import at the top for most other sktime items (such as metrics). I'll update this and ARIMA to import at the top.


# we can only calculate the difference if the indices are of the same type
units_diff = 1
if isinstance(X.index[0], type(self.last_X_index)) and isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reference to self.last_X_index seems a little shaky...I know I'm reviewing the code from top to bottom, but it seems we could run into trouble with the order of the calls and whether this property is defined yet or not. Maybe it's better to just pass it in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset to be at parity with the ARIMA version. I can throw a check if it isn't defined yet if that helps?

[units_diff + i for i in range(len(X))],
is_relative=True,
)
return fh_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit, but should X here be more like X_forecast? It's what we're forecasting on, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't X what we're forecasting on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're forecasting on the entire dataset? or a subset of X?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean. X in this case is the covariate data which is indeed what we're forecasting on. I can change X to be X_forecast in this case if that clarifies it.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to come back to take a closer look at the tests, but I wanted to get my first round in sooner rather than later! This is a beast, thanks so much for tackling

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This continues to be a beast. I left a bunch of nitpicks and a few suggestions, but this is really looking great.


def __init__(
self,
series_id: Optional[Hashable] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is missing from the docstring! Wonder how that made it through lint 🤔

Checking on its usage, however, it doesn't seem like it's actually used anywhere. Am I missing anything, or is this parameter actually unnecessary? I assume we should have some check to ensure we don't run VARMAX outside of the multiseries case, but if we don't actually use series_id we may need to check it elsewhere instead.

Note - if we end up removing the series_id parameter (it's not currently included in the multiseries baseline, fwiw), we'd need to update a bunch of tests as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think it's a remnant from when I thought about having to stack/unstack by component. I'm fine with removing it for the project scope for now.

I'm guessing it made it through the linter because the comment is for the class rather than the __init__() function. I wonder it's worth moving the args to under __init__/do we even want to make that change. But that's out of scope of this PR 😅

Comment on lines +136 to +133
if y is None:
raise ValueError("VARMAX Regressor requires y as input.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, VARMAX can predict on single series data just fine. I'm curious of your thoughts, should we enforce that we only run it for multiseries? We can do that easily here by double checking y's shape - it would also help prevent the case where we try to predict on stacked multiseries data (I did that in local VARMAX testing out of curiosity and it didn't error but also didn't do what we'd want it to do)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sktime enforces that there must be two or more columns for y iirc. The docstring for sktime fit is not super clear on this but it says it errors out if it has a specific multivariate tag.

I'm open to having a ValueError raised if there's a single column saying it might need to be unstacked if you think it's worth having a more specific error message!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If VARMAX raises it, I'm sure that's fine 😀

Comment on lines 139 to 146
if X is not None:
self.last_X_index = X.index[-1]
X = X.ww.select(exclude=["Datetime"])

X.ww.set_types(
{
col: "Double"
for col in X.ww.select(["Boolean"], return_schema=True).columns
},
)
X, y = match_indices(X, y)

if not X.empty and self.use_covariates:
self._component_obj.fit(y=y, X=X)
else:
self._component_obj.fit(y=y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small potential simplification here - can we just check if self.use_covariates and X is not None and not X.empty for this section, rather than nesting that final if statement? It'd save us some time when use_covariates is false, and save us from the edge case where an improperly sized X is passed in but use_covariates is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll still need the X.empty section just incase excluding the datetime makes X empty but will move the use_covariates up.

evalml/tests/component_tests/test_varmax_regressor.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_varmax_regressor.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_varmax_regressor.py Outdated Show resolved Hide resolved
vx = VARMAXRegressor(time_index="dates", series_id="series_id")
vx.fit(X, y)
preds = vx.predict(X)
assert all(preds.isna().eq(False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is identical to the previous one save the mocking of fit and the very last line.... that's a lot of repeated code. Can we pull out the data into a new fixture/add the possibility of boolean to an existing fixture, or combine these two tests, or something?

Really, as I'm thinking about it, I don't think we need the first test if we have the second - it doesn't really matter that VARMAX converts bools to floats so long as at the end of the day, it handles them. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can combine the tests! I think we should test to make sure that the values are converted and that we are getting results.

pytest.param(
True,
marks=pytest.mark.xfail(
reason="Currently, using covariates with VARMAX causes inconsistent results when predicting",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any guesses (for posterity/documentation) why this might be? Does this test always fail, or only fail intermittently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it took me a while to understand but basically X_test_last_5 only has covariate data for the last 5 values. Since we cut off the first 3 rows (test dataset is 8 rows) of covariate information, the X data differs across predict(X_test) and predict(X_test_last_5) in that X_test_last_5 has all 0 values for the first 3 rows. As a result, the estimator has less covariate data to use to make the prediction making the predictions very slightly different.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn is this fixable? If its not a big lift lets do it in this PR and if not let's file an issue to resolve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the more that I think about it, I don't think there's really anything to fix here since we should expect a different result if we're able to feed in more past covariate data. I'm going to update this to clarify we should only expect equal results only for the case where we train the model without covariate data.

@christopherbunn christopherbunn force-pushed the TML-7804_VARMAX branch 3 times, most recently from 76cf807 to 459528c Compare August 7, 2023 19:41
Comment on lines +136 to +133
if y is None:
raise ValueError("VARMAX Regressor requires y as input.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If VARMAX raises it, I'm sure that's fine 😀

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just some comments

Comment on lines 140 to 145
X.ww.set_types(
{
col: "Double"
for col in X.ww.select(["Boolean"], return_schema=True).columns
},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block shows up a couple times - can we consolidate into a method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, consolidated into convert_bool_to_double()

pytest.param(
True,
marks=pytest.mark.xfail(
reason="Currently, using covariates with VARMAX causes inconsistent results when predicting",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn is this fixable? If its not a big lift lets do it in this PR and if not let's file an issue to resolve this.

return X, y

def _set_forecast(self, X: pd.DataFrame):
from sktime.forecasting.base import ForecastingHorizon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have another pattern that we use for this?

[units_diff + i for i in range(len(X))],
is_relative=True,
)
return fh_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're forecasting on the entire dataset? or a subset of X?

@christopherbunn christopherbunn enabled auto-merge (squash) August 8, 2023 20:14
@christopherbunn christopherbunn merged commit 8d7e235 into main Aug 8, 2023
22 checks passed
@christopherbunn christopherbunn deleted the TML-7804_VARMAX branch August 8, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VARMAX for Independent Multiseries Time Series (EvalML)
4 participants