-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Factor out time series-related functionality into a time series Task object #989
Conversation
Moved some of the packages into an automl subpackage to tidy before the task-based refactor. This is in response to discussions with the group and a comment on the first task-based PR. Only changes here are moving subpackages and modules into the new automl, fixing imports to work with this structure and fixing some dependencies in setup.py.
I'd moved this to automl as that's where it's used internally, but had missed that this is actually part of the public interface so makes sense to live where it was.
flaml.data, flaml.ml and flaml.model are re-added to the top level, being re-exported from flaml.automl for backwards compatability. Adding a deprecation warning so that we can have a planned removal later.
Got to the point where the methods from AutoML are pulled to GenericTask. Started removing private markers and removing the passing of automl to these methods. Done with decide_split_type, started on prepare_data. Need to do the others after
…into multivariate_target # Conflicts: # flaml/automl/model.py # flaml/default/suggest.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ts_forecast notebook tested with this PR?
return val_loss, metric, train_time, pred_time | ||
|
||
def default_estimator_list(self, estimator_list: List[str], is_spark_dataframe: bool) -> List[str]: | ||
assert not is_spark_dataframe, "Spark is not yet supported for time series" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark -> Spark dataframe
test/test_model.py
Outdated
print(lgbm.feature_names_in_) | ||
print(lgbm.feature_importances_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two properties covered by any other test? If not, do not remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Why the special treatment for just these properties of just this estimator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most FLAML estimators have these properties. Here only one estimator is tested. Better than no test.
) | ||
y = np.array([0, 1, 0, 1, 0, 0]) | ||
lgbm.predict(X[:2]) | ||
lgbm.fit(X, y, period=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this use case (period=2) covered by other test? If not or unsure, do not remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want to support this usecase, with repeating dates? As phrased above, this is not a time series problem, the same date repeats three times with different labels, and the dates are irregularly spaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeating dates are for the deduplication feature I suppose. Not sure about the irregular space. @int-chaos Could you comment?
The test fails come from merging main, have nothing to do with this PR as best I can tell |
logger.warning("Duplicate timestamp values found in timestamp column. " f"\n{X.loc[duplicates, X][time_col]}") | ||
X = X.drop_duplicates() | ||
logger.warning("Removed duplicate rows based on all columns") | ||
assert ( | ||
X[[X.columns[0]]].duplicated() is None | ||
), "Duplicate timestamp values with different values for other columns." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coverage of this part of the code gets removed by the change in the test code.
cover dedup
Why are these changes needed?
This continues the process that began with the task-based-refactor PR, now merged, Time-series related models are moved into a subpackage, related bits in the Task object are moved into a separate Task. In the process, some existing bugs are fixed (such as hcrystalball wrapper with certain configs only supporting forecasting for as far ahead as the length of the validation set).
Also, input data is auto-enriched with features derived from the timestamps.
Related issue number
Checks