-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784
[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784
Conversation
if training_dataset.coalesce: | ||
split_dataset[key] = split_dataset[key].coalesce(1) | ||
|
||
split_dataset[key] = split_dataset[key].cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not cache before spliting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This article here suggests to add it after https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc. We need it afterwards, because otherwise the randomSplit() will be executed twice (once for transformation function statistincs, once for writing). If no seed is set for the randomSplit, the splits are potentially different between transfo. stats and writing. Whether it is worth caching it before as well that is debatable. randomSplit scans the data once while creating the splits (because it samples while passing over it), so for the split itself we only have one pass. In the training dataset statistics we do a df.head() to determine the length of the dataframe. If we want to cache we would need to do it before that. But hat should be a separate PR in the future.
python/hsfs/engine/spark.py
Outdated
if training_dataset.coalesce: | ||
dataset = dataset.coalesce(1) | ||
|
||
dataset = dataset.cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no split. Why cache here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did this to cache the query. If we do the split we cache the split result instead. I would have liked to cache the query as well when we split, but I was a bit concerned about the potential memory consumption when caching before and after the split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no split, I think we should not cache the result and users can cache themselves. The purpose of caching is to return consistent result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well okay, the reasoning behind it was to prevent 2x execution in the case of 1. having transformation fuctions and need to calcualte statistics and 2. have to write the df to disk. But I agree that placing the caching here without checking for this particular case is not great. Shall we still keep it for the special case though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. Need to calculate statistics. We should cache then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a Jira here to add it later https://hopsworks.atlassian.net/browse/FSTORE-317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
… caching (logicalclocks#784) * fix create training dataset in pyspark, added caching * fixed stylecheck * fixed stylecheck * revert caching for non-split dfs (cherry picked from commit 2c6864d)
… caching (logicalclocks#784) * fix create training dataset in pyspark, added caching * fixed stylecheck * fixed stylecheck * revert caching for non-split dfs
This PR contains fixes and improvements for the FeatureView API
_write_training_dataset_single
instead of training datasetJIRA Issue: https://hopsworks.atlassian.net/browse/HOPSWORKS-3323
Priority for Review: high
Related PRs: -
How Has This Been Tested?
Checklist For The Assigned Reviewer: