-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance spark API. #114
Merged
Merged
Enhance spark API. #114
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
huan-dec
approved these changes
Jul 14, 2022
This avoids problems with aggregation when attempting to use NA values for integer columns.
…/Merlion into aadyot/spark-operator
Tree models can now accept max_forecast_steps=None and return_prev=True
aadyotb
changed the title
Refactor spark apps to work with spark operator.
Enhance spark API.
Jul 14, 2022
Derived from https://www.kaggle.com/datasets/manjeetsingh/retaildataset which is released under a CC0 license.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We refactor the spark apps and Dockerfile so that Merlion can be deployed with the spark-on-k8s-operator. To this end, the apps now accept individual, documented arguments instead of a single opaque config file. Additionally, we add tests which follow the same workflow as the pyspark apps. The tests are run on a subset of an open source dataset released under a CC0 license and cover hierarchical time series forecasting.
Next, we further improve the functionality of the forecasting spark app in 4 ways:
return_prev=True
to be passed tomodel.forecast()
in case one wants to obtain a model's historical predictions on the train data. This can be done by supplying the--predict_on_train
argument. Notably, hierarchical reconciliation is skipped for historical timestamps which do not have sampled values from all time series.--agg_dict
argument as an appropriate JSON string. Previously, all data columns were summed. Moreover, if a data column is not specified in the aggregation dictionary, that column will not be used to model any aggregated time series. This can be important for e.g. categorical columns that are useful at the base of the hierarchy, but not at higher levels."__aggregated__"
keyword to indicate that a particular time series (in the output) has been aggregated.Finally, we enhance tree models so that they can accept
max_forecast_steps=None
andreturn_prev=True
.