Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend TimeSeriesImputer to handle multiple series #4291

Merged
merged 19 commits into from
Sep 5, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Release Notes
* Enhancements
* Added support for prediction intervals for VARMAX regressor :pr:`4267`
* Integrated multiseries time series into AutoMLSearch :pr:`4270`
* Extended TimeSeriesImputer to handle multiple series :pr:`4291`
* Fixes
* Fixed error when stacking data with no exogenous variables :pr:`4275`
* Changes
Expand Down
23 changes: 16 additions & 7 deletions evalml/pipelines/components/component_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from abc import ABC, abstractmethod

import cloudpickle
import pandas as pd

from evalml.exceptions import MethodPropertyNotFoundError
from evalml.pipelines.components.component_base_meta import ComponentBaseMeta
Expand Down Expand Up @@ -256,7 +257,8 @@
Args:
X (pd.DataFrame, optional): Input data to a component of shape [n_samples, n_features].
May contain nullable types.
y (pd.Series, optional): The target of length [n_samples]. May contain nullable types.
y (pd.Series or pd.DataFrame, optional): The target of length [n_samples] or the unstacked target for a multiseries problem.
May contain nullable types.
MichaelFu512 marked this conversation as resolved.
Show resolved Hide resolved

Returns:
X, y with any incompatible nullable types downcasted to compatible equivalents.
Expand All @@ -273,10 +275,17 @@
y_bool_incompatible = "y" in self._boolean_nullable_incompatibilities
y_int_incompatible = "y" in self._integer_nullable_incompatibilities
if y is not None and (y_bool_incompatible or y_int_incompatible):
y = _downcast_nullable_y(
y,
handle_boolean_nullable=y_bool_incompatible,
handle_integer_nullable=y_int_incompatible,
)

if isinstance(y, pd.Series):
y = _downcast_nullable_y(
y,
handle_boolean_nullable=y_bool_incompatible,
handle_integer_nullable=y_int_incompatible,
)
# if y is a dataframe (from unstacked multiseries) use _downcast_nullable_X since downcast_nullable_y is for series
MichaelFu512 marked this conversation as resolved.
Show resolved Hide resolved
else:
y = _downcast_nullable_X(

Check warning on line 286 in evalml/pipelines/components/component_base.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/component_base.py#L286

Added line #L286 was not covered by tests
y,
handle_boolean_nullable=y_bool_incompatible,
handle_integer_nullable=y_int_incompatible,
)
return X, y
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@
self._backwards_cols = None
self._interpolate_cols = None
self._impute_target = None
self._y_all_null_cols = None
super().__init__(
parameters=parameters,
component_obj=None,
Expand Down Expand Up @@ -137,11 +138,17 @@
self._backwards_cols = _filter_cols("backwards_fill", X)
self._interpolate_cols = _filter_cols("interpolate", X)

if y is not None:
if isinstance(y, pd.Series):
y = infer_feature_types(y)
if y.isnull().any():
self._impute_target = self.parameters["target_impute_strategy"]

elif isinstance(y, pd.DataFrame):
y = infer_feature_types(y)
y_nan_ratio = y.isna().sum() / y.shape[0]
self._y_all_null_cols = y_nan_ratio[y_nan_ratio == 1].index.tolist()
if y.isnull().values.any():
self._impute_target = self.parameters["target_impute_strategy"]

Check warning on line 151 in evalml/pipelines/components/transformers/imputers/time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/transformers/imputers/time_series_imputer.py#L147-L151

Added lines #L147 - L151 were not covered by tests
return self

def transform(self, X, y=None):
Expand Down Expand Up @@ -212,19 +219,33 @@
new_ltypes.update(new_int_ltypes)
X_not_all_null.ww.init(schema=original_schema, logical_types=new_ltypes)

y_imputed = pd.Series(y)
y_imputed = (
y.ww.drop(self._y_all_null_cols)
if isinstance(y, pd.DataFrame)
else pd.Series(y)
)
if y is not None and len(y) > 0:
if self._impute_target == "forwards_fill":
y_imputed = y.pad()
y_imputed = y_imputed.pad()

Check warning on line 229 in evalml/pipelines/components/transformers/imputers/time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/transformers/imputers/time_series_imputer.py#L229

Added line #L229 was not covered by tests
y_imputed.bfill(inplace=True)
elif self._impute_target == "backwards_fill":
y_imputed = y.bfill()
y_imputed = y_imputed.bfill()

Check warning on line 232 in evalml/pipelines/components/transformers/imputers/time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/transformers/imputers/time_series_imputer.py#L232

Added line #L232 was not covered by tests
y_imputed.pad(inplace=True)
elif self._impute_target == "interpolate":
y_imputed = y.interpolate()
y_imputed = y_imputed.interpolate()

Check warning on line 235 in evalml/pipelines/components/transformers/imputers/time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/transformers/imputers/time_series_imputer.py#L235

Added line #L235 was not covered by tests
y_imputed.bfill(inplace=True)
# Re-initialize woodwork with the downcast logical type
y_imputed = ww.init_series(y_imputed, logical_type=y.ww.logical_type)
if isinstance(y, pd.Series):
y_imputed = ww.init_series(y_imputed, logical_type=y.ww.logical_type)
else:
y_original_schema = y.ww.schema.get_subset_schema(

Check warning on line 241 in evalml/pipelines/components/transformers/imputers/time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/transformers/imputers/time_series_imputer.py#L241

Added line #L241 was not covered by tests
list(y_imputed.columns),
)
y_new_ltypes = {

Check warning on line 244 in evalml/pipelines/components/transformers/imputers/time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/transformers/imputers/time_series_imputer.py#L244

Added line #L244 was not covered by tests
col: _determine_non_nullable_equivalent(ltype)
for col, ltype in y_original_schema.logical_types.items()
}
y_imputed.ww.init(schema=y_original_schema, logical_types=y_new_ltypes)

Check warning on line 248 in evalml/pipelines/components/transformers/imputers/time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/transformers/imputers/time_series_imputer.py#L248

Added line #L248 was not covered by tests

return X_not_all_null, y_imputed

Expand All @@ -234,15 +255,21 @@
Args:
X (pd.DataFrame, optional): Input data to a component of shape [n_samples, n_features].
May contain nullable types.
y (pd.Series, optional): The target of length [n_samples]. May contain nullable types.
y (pd.Series or pd.DataFrame, optional): The target of length [n_samples] or the unstacked target for a multiseries problem.
May contain nullable types.

Returns:
X, y with any incompatible nullable types downcasted to compatible equivalents when interpolate is used. Is NoOp otherwise.
"""
if self._impute_target == "interpolate":
# For BooleanNullable, we have to avoid Categorical columns
# since the category dtype also has incompatibilities with linear interpolate, which is expected
if isinstance(y.ww.logical_type, BooleanNullable):
# TODO: Avoid categorical columns for BooleanNullable in multiseries when
# multiseries timeseries supports categorical
if isinstance(y, pd.Series) and isinstance(

Check warning on line 269 in evalml/pipelines/components/transformers/imputers/time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/pipelines/components/transformers/imputers/time_series_imputer.py#L269

Added line #L269 was not covered by tests
y.ww.logical_type,
BooleanNullable,
):
y = ww.init_series(y, Double)
else:
_, y = super()._handle_nullable_types(None, y)
Expand Down
37 changes: 37 additions & 0 deletions evalml/tests/component_tests/test_time_series_imputer.py
Original file line number Diff line number Diff line change
Expand Up @@ -722,3 +722,40 @@
_, nullable_series = imputer._handle_nullable_types(None, nullable_series)

nullable_series.interpolate()


@pytest.mark.parametrize(

Check warning on line 727 in evalml/tests/component_tests/test_time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/tests/component_tests/test_time_series_imputer.py#L727

Added line #L727 was not covered by tests
"nans_present",
[True, False],
)
def test_time_series_imputer_multiseries(multiseries_ts_data_unstacked, nans_present):
X, y = multiseries_ts_data_unstacked
imputer = TimeSeriesImputer(target_impute_strategy="interpolate")
if nans_present:
c = 1
for x in y:
y[x][c] = np.nan
c += 1
imputer.fit(X, y)
assert imputer._y_all_null_cols == []
_, y_imputed = imputer.transform(X, y)
assert isinstance(y_imputed, pd.DataFrame)
y_expected = pd.DataFrame({f"target_{i}": range(i, 100, 5) for i in range(5)})
assert_frame_equal(y_imputed, y_expected, check_dtype=False)

Check warning on line 744 in evalml/tests/component_tests/test_time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/tests/component_tests/test_time_series_imputer.py#L731-L744

Added lines #L731 - L744 were not covered by tests


def test_imputer_multiseries_drops_columns_with_all_nan(multiseries_ts_data_unstacked):
X, y = multiseries_ts_data_unstacked
for col in y:
y[col] = np.nan
Comment on lines +787 to +788
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would benefit from another test (parametrized here) where only some of the columns are NaN, but not all of them!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Sorry, should have clarified 😅 I meant a test where some of the columns are all NaN, so we drop some columns and impute or pass through others!

imputer = TimeSeriesImputer(target_impute_strategy="interpolate")
imputer.fit(X, y)
assert imputer._y_all_null_cols == y.columns.tolist()
_, y_imputed = imputer.transform(X, y)
expected = y.drop(y.columns.tolist(), axis=1)
assert_frame_equal(

Check warning on line 756 in evalml/tests/component_tests/test_time_series_imputer.py

View check run for this annotation

Codecov / codecov/patch

evalml/tests/component_tests/test_time_series_imputer.py#L747-L756

Added lines #L747 - L756 were not covered by tests
y_imputed,
expected,
check_column_type=False,
check_index_type=False,
)
Loading