Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor out time series-related functionality into a time series Task object #989

Merged
merged 171 commits into from
Jun 19, 2023

Conversation

EgorKraevTransferwise
Copy link
Collaborator

@EgorKraevTransferwise EgorKraevTransferwise commented Apr 10, 2023

Why are these changes needed?

This continues the process that began with the task-based-refactor PR, now merged, Time-series related models are moved into a subpackage, related bits in the Task object are moved into a separate Task. In the process, some existing bugs are fixed (such as hcrystalball wrapper with certain configs only supporting forecasting for as far ahead as the length of the validation set).
Also, input data is auto-enriched with features derived from the timestamps.

Related issue number

Checks

markharley and others added 30 commits November 14, 2022 19:44
Moved some of the packages into an automl subpackage to tidy before the
task-based refactor. This is in response to discussions with the group
and a comment on the first task-based PR.

Only changes here are moving subpackages and modules into the new
automl, fixing imports to work with this structure and fixing some
dependencies in setup.py.
I'd moved this to automl as that's where it's used internally, but had
missed that this is actually part of the public interface so makes sense
to live where it was.
flaml.data, flaml.ml and flaml.model are re-added to the top level,
being re-exported from flaml.automl for backwards compatability. Adding
a deprecation warning so that we can have a planned removal later.
Got to the point where the methods from AutoML are pulled to
GenericTask. Started removing private markers and removing the passing
of automl to these methods. Done with decide_split_type, started on
prepare_data. Need to do the others after
…into multivariate_target

# Conflicts:
#	flaml/automl/model.py
#	flaml/default/suggest.py
Copy link
Contributor

@sonichi sonichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ts_forecast notebook tested with this PR?

return val_loss, metric, train_time, pred_time

def default_estimator_list(self, estimator_list: List[str], is_spark_dataframe: bool) -> List[str]:
assert not is_spark_dataframe, "Spark is not yet supported for time series"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark -> Spark dataframe

Comment on lines 130 to 131
print(lgbm.feature_names_in_)
print(lgbm.feature_importances_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two properties covered by any other test? If not, do not remove them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Why the special treatment for just these properties of just this estimator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most FLAML estimators have these properties. Here only one estimator is tested. Better than no test.

)
y = np.array([0, 1, 0, 1, 0, 0])
lgbm.predict(X[:2])
lgbm.fit(X, y, period=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this use case (period=2) covered by other test? If not or unsure, do not remove it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want to support this usecase, with repeating dates? As phrased above, this is not a time series problem, the same date repeats three times with different labels, and the dates are irregularly spaced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeating dates are for the deduplication feature I suppose. Not sure about the irregular space. @int-chaos Could you comment?

@ZmeiGorynych
Copy link

The test fails come from merging main, have nothing to do with this PR as best I can tell

Comment on lines +516 to +521
logger.warning("Duplicate timestamp values found in timestamp column. " f"\n{X.loc[duplicates, X][time_col]}")
X = X.drop_duplicates()
logger.warning("Removed duplicate rows based on all columns")
assert (
X[[X.columns[0]]].duplicated() is None
), "Duplicate timestamp values with different values for other columns."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coverage of this part of the code gets removed by the change in the test code.

@sonichi sonichi enabled auto-merge June 13, 2023 13:40
@sonichi sonichi added this pull request to the merge queue Jun 19, 2023
Merged via the queue into microsoft:main with commit 5245efb Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants