Add baseline multiseries regressor #4246

eccabay · 2023-07-20T14:56:13Z

Implementation assumes one column per target series, which is not the final state of the input. Changes will probably have to be made once stacking/unstacking functions are implemented and this is integrated into search.

Implementation also assumes integer indices.

codecov · 2023-07-24T14:34:05Z

Codecov Report

Merging #4246 (38eb434) into main (5b80a8e) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4246     +/-   ##
=======================================
- Coverage   99.7%   99.7%   -0.0%     
=======================================
  Files        349     351      +2     
  Lines      38413   38497     +84     
=======================================
+ Hits       38293   38376     +83     
- Misses       120     121      +1

Files Changed	Coverage Δ
evalml/pipelines/components/__init__.py	`100.0% <ø> (ø)`
evalml/pipelines/components/estimators/__init__.py	`100.0% <ø> (ø)`
evalml/tests/component_tests/test_utils.py	`99.2% <ø> (ø)`
evalml/tests/pipeline_tests/test_pipeline_utils.py	`99.6% <ø> (-<0.1%)`	⬇️
evalml/utils/gen_utils.py	`99.3% <ø> (ø)`
evalml/pipelines/components/component_base.py	`100.0% <100.0%> (ø)`
...lines/components/estimators/regressors/__init__.py	`100.0% <100.0%> (ø)`
...sors/multiseries_time_series_baseline_regressor.py	`100.0% <100.0%> (ø)`
...ansformers/preprocessing/time_series_featurizer.py	`100.0% <100.0%> (ø)`
evalml/pipelines/utils.py	`99.4% <100.0%> (-0.2%)`	⬇️
... and 4 more

jeremyliweishih

Good work 😸

jeremyliweishih · 2023-07-27T14:38:56Z

evalml/pipelines/components/transformers/preprocessing/time_series_featurizer.py

- for t in self.statistically_significant_lags:
- lagged_features[self.target_colname_prefix.format(t)] = y.shift(t)
+ if isinstance(y, pd.DataFrame):
+ lagged_features.update(self._delay_df(y, y.columns))


thought: should we just run self._encode_y_while_preserving_index(y) even though we won't expect categorical columns just yet?

🤔 interesting point, will we ever expect categorical columns? We're only supporting regression problems for multiseries

potentially sometime in the distant future! Just thought it would make it one step easier for whoever implements that 😄 it'll be a no-op anyways right now

I like the idea in theory, but I'm worried about increasing runtime with checking if y contains any categorical columns. We'd have to do so in all cases, which feels wasteful when we know we won't be dealing with it.

christopherbunn

A few questions for my clarification + a suggestion but once answered LGTM

christopherbunn · 2023-07-31T12:00:02Z

evalml/tests/conftest.py

@@ -830,6 +830,22 @@ def X_y_regression():
 return X, y


+@pytest.fixture
+def X_y_multiseries_regression():


I think I'm going to utilize this for my VARMAX testing. In that case, does it make sense to have it align more with the inputs we expect for a mutliseries regressor (e.g. a series_id column and the column names having series_id value suffixes)

Also since this is a time series pipeline should we be extending ts_data() instead? If we do decide to keep this we should also make it a time series dataset by adding a datetime column + changing the name to identify it as a time series dataset.

My bad I forgot we have the multiseries_ts_data_unstacked and multiseries_ts_data_unstacked functions. In that case, is there a reason why the test_test_multiseries_baseline_regressor.py test cases use those mocks?

Good callout that this should more closely match the actual expected input - I wrote these before I wrote the stacking and unstacking functions and never updated it 😅 - same with why these tests don't use multiseries_ts_data_unstacked. I'll refactor the test fixtures a bit to be consolidated and up to date.

christopherbunn · 2023-07-31T12:03:28Z

evalml/pipelines/components/transformers/preprocessing/time_series_featurizer.py

+ if isinstance(y, pd.DataFrame):
+ self.statistically_significant_lags = [self.start_delay]


So for the multiseries case, do we not try to find the significant lags? Does this just use all or none of the lags?

We don't need to find the significant lags because we're not actually doing feature engineering here, just getting the properly lagged column that our baseline regressor relies on. By setting the lags that we calculate to be just self.start_delay, we only compute the one we know we need.

Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.

christopherbunn · 2023-07-31T12:05:08Z

evalml/pipelines/components/estimators/regressors/multiseries_time_series_baseline_regressor.py

+ Args:
+ X (pd.DataFrame): Data of shape [n_samples, n_features].
+
+ Returns:
+ pd.Series: Predicted values.


I'm pretty sure this is correct, but just to double check we're returning the predictions with the predicted values stacked right (i.e. in series form and not as a dataframe)

Lol, this is a copypasta fail. We're returning the unstacked dataframe here

evalml/pipelines/components/estimators/regressors/multiseries_time_series_baseline_regressor.py

chukarsten

Just a few nits, great work

evalml/pipelines/components/estimators/regressors/multiseries_time_series_baseline_regressor.py

chukarsten · 2023-07-31T18:15:28Z

evalml/pipelines/components/estimators/regressors/multiseries_time_series_baseline_regressor.py

+ @property
+ def feature_importance(self):
+ """Returns importance associated with each feature.
+
+ Since baseline estimators do not use input features to calculate predictions, returns an array of zeroes.
+
+ Returns:
+ np.ndarray (float): An array of zeroes.
+ """
+ importance = np.array([0] * self._num_features)
+ return importance


Probably another nit, but if you're calling out all Baseline Estimators...is it worth putting together a story to add a BaselineEstimator class to the inheritance chain and have them all inherit this prop def?

Filed #4255

chukarsten · 2023-07-31T18:34:03Z

evalml/pipelines/components/transformers/preprocessing/time_series_featurizer.py

+ if isinstance(y, pd.DataFrame):
+ self.statistically_significant_lags = [self.start_delay]


Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.

chukarsten · 2023-07-31T18:40:21Z

evalml/pipelines/components/transformers/preprocessing/time_series_featurizer.py

+ if categorical_columns and col_name in categorical_columns:
+ col = X_categorical[col_name]
+ for t in self.statistically_significant_lags:
+ lagged_features[f"{col_name}_delay_{t}"] = col.shift(t)


We're not going to be doing any external matching on this name format, right? If so, I think we might want to establish a pattern of making this string format like a module level thing or accessible via the class

Good call - adjusting!

eccabay added 4 commits July 20, 2023 10:11

Implement baseline pipeline

ee5b9e6

Add multiseries to init files

004a1dd

Add tests

3f45eb2

Merge branch 'main' into 4241_baseline_multiseries

67cbd75

eccabay added 4 commits July 24, 2023 11:48

Release notes

308c8bc

Test fixes

0f35eec

Merge branch 'main' into 4241_baseline_multiseries

a00d25a

Revert test fix

3f597b4

eccabay marked this pull request as ready for review July 24, 2023 20:56

auto-assign bot assigned eccabay Jul 24, 2023

eccabay requested review from christopherbunn, jeremyliweishih, MichaelFu512, chukarsten and remyogasawara July 24, 2023 20:56

eccabay marked this pull request as draft July 25, 2023 19:31

eccabay added 5 commits July 26, 2023 14:05

Update baseline to expect delayed columns in X

c179eeb

Refactor TS featurizer to handle y as a df

fc932d5

Adjust baseline tests to account for change

7b20f7f

Missing test case

c66b572

Test fixes

42f2a26

eccabay marked this pull request as ready for review July 26, 2023 19:40

jeremyliweishih approved these changes Jul 27, 2023

View reviewed changes

Merge branch 'main' into 4241_baseline_multiseries

bab201f

christopherbunn requested changes Jul 31, 2023

View reviewed changes

eccabay added 2 commits July 31, 2023 11:40

Consolidate test fixtures to agreed framework

6489510

Docstring corrections

e08f1c5

eccabay requested a review from christopherbunn July 31, 2023 15:42

chukarsten approved these changes Jul 31, 2023

View reviewed changes

PR comments

38eb434

eccabay enabled auto-merge (squash) July 31, 2023 20:53

christopherbunn approved these changes Aug 1, 2023

View reviewed changes

eccabay merged commit 7468580 into main Aug 1, 2023
22 checks passed

eccabay deleted the 4241_baseline_multiseries branch August 1, 2023 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add baseline multiseries regressor #4246

Add baseline multiseries regressor #4246

eccabay commented Jul 20, 2023

codecov bot commented Jul 24, 2023 •

edited

Loading

jeremyliweishih left a comment

jeremyliweishih Jul 27, 2023

eccabay Jul 27, 2023

jeremyliweishih Jul 27, 2023

eccabay Jul 28, 2023

christopherbunn left a comment

christopherbunn Jul 31, 2023

christopherbunn Jul 31, 2023 •

edited

Loading

eccabay Jul 31, 2023

christopherbunn Jul 31, 2023

eccabay Jul 31, 2023

chukarsten Jul 31, 2023

christopherbunn Jul 31, 2023

eccabay Jul 31, 2023

chukarsten left a comment

chukarsten Jul 31, 2023

eccabay Jul 31, 2023

chukarsten Jul 31, 2023

chukarsten Jul 31, 2023

eccabay Jul 31, 2023

		if isinstance(y, pd.DataFrame):
		self.statistically_significant_lags = [self.start_delay]

Add baseline multiseries regressor #4246

Add baseline multiseries regressor #4246

Conversation

eccabay commented Jul 20, 2023

codecov bot commented Jul 24, 2023 • edited Loading

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 24, 2023 •

edited

Loading

christopherbunn Jul 31, 2023 •

edited

Loading