Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add baseline multiseries regressor #4246

Merged
merged 17 commits into from
Aug 1, 2023
Merged

Add baseline multiseries regressor #4246

merged 17 commits into from
Aug 1, 2023

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Jul 20, 2023

Closes #4241

Implementation assumes one column per target series, which is not the final state of the input. Changes will probably have to be made once stacking/unstacking functions are implemented and this is integrated into search.

Implementation also assumes integer indices.

@codecov
Copy link

codecov bot commented Jul 24, 2023

Codecov Report

Merging #4246 (38eb434) into main (5b80a8e) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4246     +/-   ##
=======================================
- Coverage   99.7%   99.7%   -0.0%     
=======================================
  Files        349     351      +2     
  Lines      38413   38497     +84     
=======================================
+ Hits       38293   38376     +83     
- Misses       120     121      +1     
Files Changed Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/estimators/__init__.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 99.2% <ø> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.6% <ø> (-<0.1%) ⬇️
evalml/utils/gen_utils.py 99.3% <ø> (ø)
evalml/pipelines/components/component_base.py 100.0% <100.0%> (ø)
...lines/components/estimators/regressors/__init__.py 100.0% <100.0%> (ø)
...sors/multiseries_time_series_baseline_regressor.py 100.0% <100.0%> (ø)
...ansformers/preprocessing/time_series_featurizer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.4% <100.0%> (-0.2%) ⬇️
... and 4 more

@eccabay eccabay marked this pull request as ready for review July 24, 2023 20:56
@eccabay eccabay marked this pull request as draft July 25, 2023 19:31
@eccabay eccabay marked this pull request as ready for review July 26, 2023 19:40
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work 😸

for t in self.statistically_significant_lags:
lagged_features[self.target_colname_prefix.format(t)] = y.shift(t)
if isinstance(y, pd.DataFrame):
lagged_features.update(self._delay_df(y, y.columns))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: should we just run self._encode_y_while_preserving_index(y) even though we won't expect categorical columns just yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 interesting point, will we ever expect categorical columns? We're only supporting regression problems for multiseries

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially sometime in the distant future! Just thought it would make it one step easier for whoever implements that 😄 it'll be a no-op anyways right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea in theory, but I'm worried about increasing runtime with checking if y contains any categorical columns. We'd have to do so in all cases, which feels wasteful when we know we won't be dealing with it.

Copy link
Contributor

@christopherbunn christopherbunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions for my clarification + a suggestion but once answered LGTM

@@ -830,6 +830,22 @@ def X_y_regression():
return X, y


@pytest.fixture
def X_y_multiseries_regression():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to utilize this for my VARMAX testing. In that case, does it make sense to have it align more with the inputs we expect for a mutliseries regressor (e.g. a series_id column and the column names having series_id value suffixes)

Copy link
Contributor

@christopherbunn christopherbunn Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also since this is a time series pipeline should we be extending ts_data() instead? If we do decide to keep this we should also make it a time series dataset by adding a datetime column + changing the name to identify it as a time series dataset.

My bad I forgot we have the multiseries_ts_data_unstacked and multiseries_ts_data_unstacked functions. In that case, is there a reason why the test_test_multiseries_baseline_regressor.py test cases use those mocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout that this should more closely match the actual expected input - I wrote these before I wrote the stacking and unstacking functions and never updated it 😅 - same with why these tests don't use multiseries_ts_data_unstacked. I'll refactor the test fixtures a bit to be consolidated and up to date.

Comment on lines +127 to +128
if isinstance(y, pd.DataFrame):
self.statistically_significant_lags = [self.start_delay]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the multiseries case, do we not try to find the significant lags? Does this just use all or none of the lags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to find the significant lags because we're not actually doing feature engineering here, just getting the properly lagged column that our baseline regressor relies on. By setting the lags that we calculate to be just self.start_delay, we only compute the one we know we need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.

Comment on lines 82 to 86
Args:
X (pd.DataFrame): Data of shape [n_samples, n_features].

Returns:
pd.Series: Predicted values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is correct, but just to double check we're returning the predictions with the predicted values stacked right (i.e. in series form and not as a dataframe)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, this is a copypasta fail. We're returning the unstacked dataframe here

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, great work

Comment on lines +102 to +112
@property
def feature_importance(self):
"""Returns importance associated with each feature.

Since baseline estimators do not use input features to calculate predictions, returns an array of zeroes.

Returns:
np.ndarray (float): An array of zeroes.
"""
importance = np.array([0] * self._num_features)
return importance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably another nit, but if you're calling out all Baseline Estimators...is it worth putting together a story to add a BaselineEstimator class to the inheritance chain and have them all inherit this prop def?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #4255

Comment on lines +127 to +128
if isinstance(y, pd.DataFrame):
self.statistically_significant_lags = [self.start_delay]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.

if categorical_columns and col_name in categorical_columns:
col = X_categorical[col_name]
for t in self.statistically_significant_lags:
lagged_features[f"{col_name}_delay_{t}"] = col.shift(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not going to be doing any external matching on this name format, right? If so, I think we might want to establish a pattern of making this string format like a module level thing or accessible via the class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - adjusting!

@eccabay eccabay enabled auto-merge (squash) July 31, 2023 20:53
@eccabay eccabay merged commit 7468580 into main Aug 1, 2023
22 checks passed
@eccabay eccabay deleted the 4241_baseline_multiseries branch August 1, 2023 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Baseline Multiseries Time Series Regressor
4 participants