Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Lag or windows features grouped by #727

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
4d653a9
add group by variables to base forecast transformer
Ezzaldin97 Feb 23, 2024
4e9d849
add group by variables to lag_features
Ezzaldin97 Feb 23, 2024
7f40391
add group by window features
Ezzaldin97 Feb 25, 2024
b476748
add group by expanding window features
Ezzaldin97 Feb 25, 2024
02c59bd
add test cases of groupby timeseries features
Ezzaldin97 Feb 25, 2024
0dd92cc
ensure code style tests
Ezzaldin97 Feb 25, 2024
47de2d6
fixing typehint errors
Ezzaldin97 Feb 25, 2024
dd43c27
fixing docs indentation issue
Ezzaldin97 Feb 25, 2024
7459811
fixing docs indentation issue in lag_features
Ezzaldin97 Feb 25, 2024
12aa825
adjust formatting and code style in tests
Ezzaldin97 Feb 29, 2024
c3bee66
refactoring timeseries & reformatting the code
Ezzaldin97 Feb 29, 2024
67725dc
adjust code formatting & style in tests
Ezzaldin97 Mar 2, 2024
9cb01ea
fix create lag features using groupby & freq parameters
Ezzaldin97 Mar 2, 2024
72ce43c
adjust code style
Ezzaldin97 Mar 2, 2024
9d999b0
add test cases to ensure code coverage
Ezzaldin97 Mar 2, 2024
b7b8bc9
add group_by docstring to _docstring
Ezzaldin97 Apr 1, 2024
ba375a4
remove check input of group_by
Ezzaldin97 Apr 1, 2024
90f08f4
enhance performance of group_by window features operations
Ezzaldin97 Apr 1, 2024
66baa75
enhance performance of group_by expanding window features operations
Ezzaldin97 Apr 1, 2024
92f996d
fix reindexing to original index after grouping bug
Ezzaldin97 Apr 1, 2024
152c037
fix reindexing to original index after grouping operation bug
Ezzaldin97 Apr 1, 2024
5343e50
replacing group_by docstring with group_by_docstring
Ezzaldin97 Apr 1, 2024
ef1eaa8
adjust code-style and formatting
Ezzaldin97 Apr 1, 2024
09db782
remove white spaces
Ezzaldin97 Apr 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions feature_engine/selection/drop_psi_features.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import datetime
from typing import List, Union
from typing import Dict, List, Union

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -475,7 +475,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series = None):
threshold_cat = self.threshold

# Compute the PSI by looping over the features
self.psi_values_ = {}
solegalli marked this conversation as resolved.
Show resolved Hide resolved
self.psi_values_: Dict = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We resolved this in a different PR. Could we remove this change from here please?

self.features_to_drop_ = []

# Compute PSI for numerical features
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu

{drop_original}

group_by: str, str, int, or list of strings or integers, default=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using pandas groupby under the hood, so the docs here should probably be identical or just a summary of what we see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html and then refer the user to pandas groupby's documentation for more details

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated it using a summary of pandas groupby

variable of list of variables to create lag features based on.

Attributes
----------
{feature_names_in_}
Expand All @@ -64,6 +67,7 @@ def __init__(
variables: Union[None, int, str, List[Union[str, int]]] = None,
missing_values: str = "raise",
drop_original: bool = False,
group_by: Union[None, int, str, List[Union[str, int]]] = None,
) -> None:

if missing_values not in ["raise", "ignore"]:
Expand All @@ -81,6 +85,7 @@ def __init__(
self.variables = _check_variables_input_value(variables)
self.missing_values = missing_values
self.drop_original = drop_original
self.group_by = _check_variables_input_value(group_by)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we defer the functionality to pandas, then we don't need this check. We just assign and let pandas handle the rest.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, pandas will handle it, Thanks for your help 🙏


def _check_index(self, X: pd.DataFrame):
"""
Expand Down
82 changes: 74 additions & 8 deletions feature_engine/timeseries/forecasting/expanding_window_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from __future__ import annotations

from typing import List
from typing import List, Union

import pandas as pd

Expand Down Expand Up @@ -93,6 +93,9 @@ class ExpandingWindowFeatures(BaseForecastTransformer):

{drop_original}

group_by: str, str, int, or list of strings or integers, default=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are repeating the same string over and over, instead of writing it multiple times, we'd create a single text in the _docstrings module, and import it instead. Like we do with fit_transform for example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to _docstring, Thanks for the clarification.

variable of list of variables to create lag features based on.

Attributes
----------
variables_:
Expand Down Expand Up @@ -139,6 +142,36 @@ class ExpandingWindowFeatures(BaseForecastTransformer):
2 2022-09-20 3 8 1.5 6.5
3 2022-09-21 4 9 2.0 7.0
4 2022-09-22 5 10 2.5 7.5
create expanding window features based on other variables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example in the class' docstrings is just meant for the user to "copy and paste" a simple example, not a full blown demo. For that we have the user guide. Could we please keep the original example?

>>> import pandas as pd
>>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures
>>> X = pd.DataFrame(dict(date = ["2022-09-18",
>>> "2022-09-19",
>>> "2022-09-20",
>>> "2022-09-21",
>>> "2022-09-22",
>>> "2022-09-18",
>>> "2022-09-19",
>>> "2022-09-20",
>>> "2022-09-21",
>>> "2022-09-22"],
>>> x1 = [1,2,3,4,5, 3,5,6,8,11],
>>> x2 = [6,7,8,9,10, 2,9,10,15,2],
>>> x3=['a','a','a','a','a', 'b','b','b','b','b']
>>> ))
>>> ewf = ExpandingWindowFeatures(group_by='x3')
>>> ewf.fit_transform(X)
date x1 x2 x3 x1_expanding_mean x2_expanding_mean
0 2022-09-18 1 6 a NaN NaN
1 2022-09-19 2 7 a 1.000000 6.0
2 2022-09-20 3 8 a 1.500000 6.5
3 2022-09-21 4 9 a 2.000000 7.0
4 2022-09-22 5 10 a 2.500000 7.5
5 2022-09-18 3 2 b NaN NaN
6 2022-09-19 5 9 b 3.000000 2.0
7 2022-09-20 6 10 b 4.000000 5.5
8 2022-09-21 8 15 b 4.666667 7.0
9 2022-09-22 11 2 b 5.500000 9.0
"""

def __init__(
Expand All @@ -151,6 +184,7 @@ def __init__(
sort_index: bool = True,
missing_values: str = "raise",
drop_original: bool = False,
group_by: Union[None, int, str, List[Union[str, int]]] = None,
) -> None:

if not isinstance(functions, (str, list)) or not all(
Expand All @@ -168,7 +202,7 @@ def __init__(
f"periods must be a non-negative integer. Got {periods} instead."
)

super().__init__(variables, missing_values, drop_original)
super().__init__(variables, missing_values, drop_original, group_by)

self.min_periods = min_periods
self.functions = functions
Expand All @@ -193,12 +227,17 @@ def transform(self, X: pd.DataFrame) -> pd.DataFrame:
# Common dataframe checks and setting up.
X = self._check_transform_input_and_state(X)

tmp = (
X[self.variables_]
.expanding(min_periods=self.min_periods)
.agg(self.functions)
.shift(periods=self.periods, freq=self.freq)
)
if self.group_by:
tmp = self._agg_expanding_window_features(
grouped_df=X.groupby(self.group_by)
)
else:
tmp = (
X[self.variables_]
.expanding(min_periods=self.min_periods)
.agg(self.functions)
.shift(periods=self.periods, freq=self.freq)
)

tmp.columns = self._get_new_features_name()

Expand All @@ -224,3 +263,30 @@ def _get_new_features_name(self) -> List:
]

return feature_names

def _agg_expanding_window_features(
self,
grouped_df: pd.core.groupby.generic.DataFrameGroupBy,
) -> Union[pd.Series, pd.DataFrame]:
"""generate expanding window features based on groups
Parameters
----------
grouped_df : pd.core.groupby.generic.DataFrameGroupBy
dataframe of groups
solegalli marked this conversation as resolved.
Show resolved Hide resolved

Returns
-------
Union[pd.Series, pd.DataFrame]
returned expanding window features
"""
tmp_data = []
for _, group in grouped_df:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do we need to loop?

Are we creating a grouped df for every variable passed to group_by_variables?

And is this the desired functionality? For time series forecasting, would we not have all ts in 1 col and then we would group by one or more variables that identify the ts, but we would not create many groups?

When would we need to create many groups?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me explain what I need to do here, the reason behind adding group_by_variables to time series transformers is because of this issue #668 , when we need to create some lags, rolling window, or expanding window features based on a set of groups.
the above code loop over the set of groups to create the features for every group then concatenate them, and sort by index to return the dataframe to its original
let me explain it in the following code

X = pd.DataFrame(dict(date = ["2022-09-18",
                             "2022-09-19",
                             "2022-09-20",
                             "2022-09-21",
                             "2022-09-22",
                             "2022-09-18",
                             "2022-09-19",
                             "2022-09-20",
                             "2022-09-21",
                             "2022-09-22"],
                     x1 = [1,2,3,4,5, 3,5,6,8,11],
                     x2 = [6,7,8,9,10, 2,9,10,15,2],
                     x3=['a','a','a','a','a', 'b','b','b','b','b'],
                     x4=['c','c','c','w','w','c','c','w','w','w']
))

X_grouped = X.groupby(['x3', 'x4'])
for _, group in X_grouped:
    print(group)

the result is the dataframes of every group of ('x3', 'x4')

date  x1  x2 x3 x4
0  2022-09-18   1   6  a  c
1  2022-09-19   2   7  a  c
2  2022-09-20   3   8  a  c
         date  x1  x2 x3 x4
3  2022-09-21   4   9  a  w
4  2022-09-22   5  10  a  w
         date  x1  x2 x3 x4
5  2022-09-18   3   2  b  c
6  2022-09-19   5   9  b  c
         date  x1  x2 x3 x4
7  2022-09-20   6  10  b  w
8  2022-09-21   8  15  b  w
9  2022-09-22  11   2  b  w

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the explanation. Pandas should apply shift and rolling and expanding to the groups out of the box, there is no need to loop, as far as I understand. See for example these resources: https://www.statology.org/pandas-lag-by-group/

tmp = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to loop over each group. Pandas does that under the hood if I recall correctly. So we'd just add groupby before .expanding. Check these resources:

https://www.statology.org/pandas-lag-by-group/
https://stackoverflow.com/questions/37231844/pandas-creating-a-lagged-column-with-grouped-data

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a simple way to perform the group_by operation to calculate expanding window features using the .apply() method in pandas

group[self.variables_]
.expanding(min_periods=self.min_periods)
.agg(self.functions)
.shift(periods=self.periods, freq=self.freq)
)
tmp_data.append(tmp)
tmp = pd.concat(tmp_data).sort_index()
return tmp
107 changes: 90 additions & 17 deletions feature_engine/timeseries/forecasting/lag_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,9 @@ class LagFeatures(BaseForecastTransformer):

{drop_original}

group_by: str, str, int, or list of strings or integers, default=None
variable of list of variables to create lag features based on.

Attributes
----------
variables_:
Expand Down Expand Up @@ -117,6 +120,26 @@ class LagFeatures(BaseForecastTransformer):
2 2022-09-20 3 8 2.0 7.0 1.0 6.0
3 2022-09-21 4 9 3.0 8.0 2.0 7.0
4 2022-09-22 5 10 4.0 9.0 3.0 8.0
create lags based on other variables.
>>> import pandas as pd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please keep the original example? Demos go in the user-guide :)

>>> from feature_engine.timeseries.forecasting import LagFeatures
>>> X = pd.DataFrame(dict(date = ["2022-09-18",
>>> "2022-09-19",
>>> "2022-09-20",
>>> "2022-09-21",
>>> "2022-09-22"],
>>> x1 = [1,2,3,4,5],
>>> x2 = [6,7,8,9,10],
>>> x3 = ['a','b','a','b','a']
>>> ))
>>> lf = LagFeatures(periods=[1,2], group_by_variables='x3')
>>> lf.fit_transform(X)
date x1 x2 x3 x1_lag_1 x2_lag_1 x1_lag_2 x2_lag_2
0 2022-09-18 1 6 a NaN NaN NaN NaN
1 2022-09-19 2 7 b NaN NaN NaN NaN
2 2022-09-20 3 8 a 1.0 6.0 NaN NaN
3 2022-09-21 4 9 b 2.0 7.0 NaN NaN
4 2022-09-22 5 10 a 3.0 8.0 1.0 6.0
"""

def __init__(
Expand All @@ -127,6 +150,7 @@ def __init__(
sort_index: bool = True,
missing_values: str = "raise",
drop_original: bool = False,
group_by: Union[None, int, str, List[Union[str, int]]] = None,
) -> None:

if not (
Expand All @@ -151,7 +175,7 @@ def __init__(
"sort_index takes values True and False." f"Got {sort_index} instead."
)

super().__init__(variables, missing_values, drop_original)
super().__init__(variables, missing_values, drop_original, group_by)

self.periods = periods
self.freq = freq
Expand Down Expand Up @@ -180,35 +204,57 @@ def transform(self, X: pd.DataFrame) -> pd.DataFrame:
if isinstance(self.freq, list):
df_ls = []
for fr in self.freq:
tmp = X[self.variables_].shift(
freq=fr,
axis=0,
)
if self.group_by:
tmp = self._agg_freq_lags(
grouped_df=X.groupby(self.group_by),
freq=fr,
)
else:
tmp = X[self.variables_].shift(
freq=fr,
axis=0,
)
df_ls.append(tmp)
tmp = pd.concat(df_ls, axis=1)

else:
tmp = X[self.variables_].shift(
freq=self.freq,
axis=0,
)
if self.group_by:
tmp = self._agg_freq_lags(
grouped_df=X.groupby(self.group_by),
freq=self.freq,
)
else:
tmp = X[self.variables_].shift(
freq=self.freq,
axis=0,
)

else:
if isinstance(self.periods, list):
df_ls = []
for pr in self.periods:
tmp = X[self.variables_].shift(
periods=pr,
axis=0,
)
if self.group_by:
tmp = X.groupby(self.group_by)[self.variables_].shift(
periods=pr,
)
else:
tmp = X[self.variables_].shift(
periods=pr,
axis=0,
)
df_ls.append(tmp)
tmp = pd.concat(df_ls, axis=1)

else:
tmp = X[self.variables_].shift(
periods=self.periods,
axis=0,
)
if self.group_by:
tmp = X.groupby(self.group_by)[self.variables_].shift(
periods=self.periods,
)
else:
tmp = X[self.variables_].shift(
periods=self.periods,
axis=0,
)

tmp.columns = self._get_new_features_name()

Expand Down Expand Up @@ -243,3 +289,30 @@ def _get_new_features_name(self) -> List:
]

return feature_names

def _agg_freq_lags(
self,
grouped_df: pd.core.groupby.generic.DataFrameGroupBy,
freq: Union[str, List[str]],
) -> Union[pd.Series, pd.DataFrame]:
"""_summary_

Parameters
----------
grouped_df : pd.core.groupby.generic.DataFrameGroupBy
dataframe of groups
freq : Union[str, List[str]]
Offset to use from the tseries module or time rule.

Returns
-------
Union[pd.Series, pd.DataFrame]
lag feature or dataframe of lag features
"""
tmp_data = []
for _, group in grouped_df:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to loop over the groups to apply the lags? pandas does the lags per group automatically.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried many approaches to simplify this approach, but it is only working when using periods argument with shift() method like the line in 231
, however when using freq argument with shift() method it doesn't work, so I used loop to make it work.
kindly advice if we can simplify it.

original_idx = group.index
tmp = group[self.variables_].shift(freq=freq).reindex(original_idx)
tmp_data.append(tmp)
tmp = pd.concat(tmp_data).sort_index()
return tmp
Loading