Added multiseries VARMAX regressor #4238

christopherbunn · 2023-07-14T21:21:12Z

Resolves #4234

codecov · 2023-07-14T21:27:01Z

Codecov Report

Merging #4238 (aa262e9) into main (4982dd1) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4238     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        353     355      +2     
  Lines      38643   38915    +272     
=======================================
+ Hits       38522   38794    +272     
  Misses       121     121

Files Changed	Coverage Δ
evalml/pipelines/__init__.py	`100.0% <ø> (ø)`
evalml/pipelines/components/__init__.py	`100.0% <ø> (ø)`
evalml/pipelines/components/estimators/__init__.py	`100.0% <ø> (ø)`
evalml/pipelines/time_series_pipeline_base.py	`100.0% <ø> (ø)`
...valml/pipelines/time_series_regression_pipeline.py	`100.0% <ø> (ø)`
...alml/tests/component_tests/test_arima_regressor.py	`100.0% <ø> (ø)`
...alml/tests/model_family_tests/test_model_family.py	`100.0% <ø> (ø)`
evalml/utils/gen_utils.py	`99.3% <ø> (ø)`
evalml/model_family/model_family.py	`100.0% <100.0%> (ø)`
...lines/components/estimators/regressors/__init__.py	`100.0% <100.0%> (ø)`
... and 7 more

chukarsten

Since you're working on this, just blasting this out now.

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

chukarsten · 2023-07-27T14:17:02Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+        return X, y
+
+    def _set_forecast(self, X: pd.DataFrame):
+        from sktime.forecasting.base import ForecastingHorizon


And should probably move this to the top.

Import is set here just in case the user does not have sktime installed (since it's an optional dependency).

Don't we have another pattern that we use for this?

I just double checked and we do this pattern for ARIMA only but we import at the top for most other sktime items (such as metrics). I'll update this and ARIMA to import at the top.

chukarsten · 2023-07-27T14:20:45Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+
+        # we can only calculate the difference if the indices are of the same type
+        units_diff = 1
+        if isinstance(X.index[0], type(self.last_X_index)) and isinstance(


This reference to self.last_X_index seems a little shaky...I know I'm reviewing the code from top to bottom, but it seems we could run into trouble with the order of the calls and whether this property is defined yet or not. Maybe it's better to just pass it in.

Reset to be at parity with the ARIMA version. I can throw a check if it isn't defined yet if that helps?

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

chukarsten · 2023-07-27T14:29:04Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+            [units_diff + i for i in range(len(X))],
+            is_relative=True,
+        )
+        return fh_


Super nit, but should X here be more like X_forecast? It's what we're forecasting on, right?

Isn't X what we're forecasting on?

We're forecasting on the entire dataset? or a subset of X?

Ah I see what you mean. X in this case is the covariate data which is indeed what we're forecasting on. I can change X to be X_forecast in this case if that clarifies it.

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

eccabay

I need to come back to take a closer look at the tests, but I wanted to get my first round in sooner rather than later! This is a beast, thanks so much for tackling

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

evalml/pipelines/time_series_pipeline_base.py

evalml/tests/conftest.py

eccabay

This continues to be a beast. I left a bunch of nitpicks and a few suggestions, but this is really looking great.

eccabay · 2023-08-04T13:29:39Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+
+    def __init__(
+        self,
+        series_id: Optional[Hashable] = None,


Looks like this is missing from the docstring! Wonder how that made it through lint 🤔

Checking on its usage, however, it doesn't seem like it's actually used anywhere. Am I missing anything, or is this parameter actually unnecessary? I assume we should have some check to ensure we don't run VARMAX outside of the multiseries case, but if we don't actually use series_id we may need to check it elsewhere instead.

Note - if we end up removing the series_id parameter (it's not currently included in the multiseries baseline, fwiw), we'd need to update a bunch of tests as well)

Agreed, I think it's a remnant from when I thought about having to stack/unstack by component. I'm fine with removing it for the project scope for now.

I'm guessing it made it through the linter because the comment is for the class rather than the __init__() function. I wonder it's worth moving the args to under __init__/do we even want to make that change. But that's out of scope of this PR 😅

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

eccabay · 2023-08-04T13:34:18Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+        if y is None:
+            raise ValueError("VARMAX Regressor requires y as input.")


If I recall correctly, VARMAX can predict on single series data just fine. I'm curious of your thoughts, should we enforce that we only run it for multiseries? We can do that easily here by double checking y's shape - it would also help prevent the case where we try to predict on stacked multiseries data (I did that in local VARMAX testing out of curiosity and it didn't error but also didn't do what we'd want it to do)

I think sktime enforces that there must be two or more columns for y iirc. The docstring for sktime fit is not super clear on this but it says it errors out if it has a specific multivariate tag.

I'm open to having a ValueError raised if there's a single column saying it might need to be unstacked if you think it's worth having a more specific error message!

If VARMAX raises it, I'm sure that's fine 😀

eccabay · 2023-08-04T13:43:33Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+        if X is not None:
+            self.last_X_index = X.index[-1]
+            X = X.ww.select(exclude=["Datetime"])
+
+            X.ww.set_types(
+                {
+                    col: "Double"
+                    for col in X.ww.select(["Boolean"], return_schema=True).columns
+                },
+            )
+            X, y = match_indices(X, y)
+
+            if not X.empty and self.use_covariates:
+                self._component_obj.fit(y=y, X=X)
+            else:
+                self._component_obj.fit(y=y)


Small potential simplification here - can we just check if self.use_covariates and X is not None and not X.empty for this section, rather than nesting that final if statement? It'd save us some time when use_covariates is false, and save us from the edge case where an improperly sized X is passed in but use_covariates is false.

We'll still need the X.empty section just incase excluding the datetime makes X empty but will move the use_covariates up.

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

evalml/tests/component_tests/test_varmax_regressor.py

eccabay · 2023-08-04T15:02:39Z

evalml/tests/component_tests/test_varmax_regressor.py

+    vx = VARMAXRegressor(time_index="dates", series_id="series_id")
+    vx.fit(X, y)
+    preds = vx.predict(X)
+    assert all(preds.isna().eq(False))


This test is identical to the previous one save the mocking of fit and the very last line.... that's a lot of repeated code. Can we pull out the data into a new fixture/add the possibility of boolean to an existing fixture, or combine these two tests, or something?

Really, as I'm thinking about it, I don't think we need the first test if we have the second - it doesn't really matter that VARMAX converts bools to floats so long as at the end of the day, it handles them. Right?

I can combine the tests! I think we should test to make sure that the values are converted and that we are getting results.

eccabay · 2023-08-04T15:06:39Z

evalml/tests/component_tests/test_varmax_regressor.py

+        pytest.param(
+            True,
+            marks=pytest.mark.xfail(
+                reason="Currently, using covariates with VARMAX causes inconsistent results when predicting",


Do you have any guesses (for posterity/documentation) why this might be? Does this test always fail, or only fail intermittently?

Yep, it took me a while to understand but basically X_test_last_5 only has covariate data for the last 5 values. Since we cut off the first 3 rows (test dataset is 8 rows) of covariate information, the X data differs across predict(X_test) and predict(X_test_last_5) in that X_test_last_5 has all 0 values for the first 3 rows. As a result, the estimator has less covariate data to use to make the prediction making the predictions very slightly different.

@christopherbunn is this fixable? If its not a big lift lets do it in this PR and if not let's file an issue to resolve this.

Actually, the more that I think about it, I don't think there's really anything to fix here since we should expect a different result if we're able to feed in more past covariate data. I'm going to update this to clarify we should only expect equal results only for the case where we train the model without covariate data.

eccabay · 2023-08-08T12:49:12Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+        if y is None:
+            raise ValueError("VARMAX Regressor requires y as input.")


If VARMAX raises it, I'm sure that's fine 😀

jeremyliweishih

LGTM just some comments

jeremyliweishih · 2023-08-08T14:10:13Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+            X.ww.set_types(
+                {
+                    col: "Double"
+                    for col in X.ww.select(["Boolean"], return_schema=True).columns
+                },
+            )


This block shows up a couple times - can we consolidate into a method?

Sure, consolidated into convert_bool_to_double()

jeremyliweishih · 2023-08-08T14:16:40Z

evalml/tests/component_tests/test_varmax_regressor.py

+        pytest.param(
+            True,
+            marks=pytest.mark.xfail(
+                reason="Currently, using covariates with VARMAX causes inconsistent results when predicting",


@christopherbunn is this fixable? If its not a big lift lets do it in this PR and if not let's file an issue to resolve this.

chukarsten · 2023-08-08T17:57:50Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+        return X, y
+
+    def _set_forecast(self, X: pd.DataFrame):
+        from sktime.forecasting.base import ForecastingHorizon


Don't we have another pattern that we use for this?

chukarsten · 2023-08-08T17:59:04Z

evalml/pipelines/components/estimators/regressors/varmax_regressor.py

+            [units_diff + i for i in range(len(X))],
+            is_relative=True,
+        )
+        return fh_


We're forecasting on the entire dataset? or a subset of X?

christopherbunn changed the title ~~Add multiseries VARMAX regressor~~ Added multiseries VARMAX regressor Jul 18, 2023

christopherbunn force-pushed the TML-7804_VARMAX branch from de50687 to 7854455 Compare July 18, 2023 02:52

christopherbunn marked this pull request as ready for review July 18, 2023 17:59

auto-assign bot assigned christopherbunn Jul 18, 2023

christopherbunn force-pushed the TML-7804_VARMAX branch from 13f9f33 to 9001204 Compare July 26, 2023 20:47

chukarsten suggested changes Jul 27, 2023

View reviewed changes

christopherbunn force-pushed the TML-7804_VARMAX branch 2 times, most recently from a118f03 to bff6c9c Compare August 1, 2023 17:50

christopherbunn requested review from chukarsten, jeremyliweishih, fjlanasa, MichaelFu512, eccabay and remyogasawara August 2, 2023 14:57

eccabay reviewed Aug 2, 2023

View reviewed changes

christopherbunn requested a review from eccabay August 3, 2023 21:22

eccabay reviewed Aug 4, 2023

View reviewed changes

christopherbunn force-pushed the TML-7804_VARMAX branch 3 times, most recently from 76cf807 to 459528c Compare August 7, 2023 19:41

christopherbunn requested a review from eccabay August 7, 2023 20:38

eccabay approved these changes Aug 8, 2023

View reviewed changes

jeremyliweishih approved these changes Aug 8, 2023

View reviewed changes

chukarsten approved these changes Aug 8, 2023

View reviewed changes

christopherbunn added 5 commits August 8, 2023 15:55

Initial commit

e4c6be2

Fixed ts test fixture

c366088

Rolled back sampling test

c3c1e03

Added fixture

c965b4a

Rm accidental extra param

414379b

christopherbunn added 2 commits August 8, 2023 15:56

Consolidate tests

ebdd2f6

Moved forecasting horizon out

aa262e9

christopherbunn force-pushed the TML-7804_VARMAX branch from f1a26b0 to aa262e9 Compare August 8, 2023 19:56

christopherbunn enabled auto-merge (squash) August 8, 2023 20:14

christopherbunn merged commit 8d7e235 into main Aug 8, 2023
22 checks passed

christopherbunn deleted the TML-7804_VARMAX branch August 8, 2023 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added multiseries VARMAX regressor #4238

Added multiseries VARMAX regressor #4238

christopherbunn commented Jul 14, 2023 •

edited

Loading

codecov bot commented Jul 14, 2023 •

edited

Loading

chukarsten left a comment

chukarsten Jul 27, 2023

christopherbunn Jul 31, 2023

chukarsten Aug 8, 2023

christopherbunn Aug 8, 2023

chukarsten Jul 27, 2023

christopherbunn Jul 31, 2023

chukarsten Jul 27, 2023

christopherbunn Jul 31, 2023

chukarsten Aug 8, 2023

christopherbunn Aug 8, 2023

eccabay left a comment

eccabay left a comment

eccabay Aug 4, 2023

christopherbunn Aug 7, 2023

eccabay Aug 4, 2023

christopherbunn Aug 7, 2023

eccabay Aug 8, 2023

eccabay Aug 4, 2023

christopherbunn Aug 7, 2023

eccabay Aug 4, 2023

christopherbunn Aug 7, 2023

eccabay Aug 4, 2023

christopherbunn Aug 7, 2023

jeremyliweishih Aug 8, 2023

christopherbunn Aug 8, 2023

eccabay Aug 8, 2023

jeremyliweishih left a comment

jeremyliweishih Aug 8, 2023

christopherbunn Aug 8, 2023

jeremyliweishih Aug 8, 2023

chukarsten Aug 8, 2023

chukarsten Aug 8, 2023

		if y is None:
		raise ValueError("VARMAX Regressor requires y as input.")

Added multiseries VARMAX regressor #4238

Added multiseries VARMAX regressor #4238

Conversation

christopherbunn commented Jul 14, 2023 • edited Loading

codecov bot commented Jul 14, 2023 • edited Loading

Codecov Report

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn commented Jul 14, 2023 •

edited

Loading

codecov bot commented Jul 14, 2023 •

edited

Loading