-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784
Changes from 3 commits
6faaca1
eabc283
caf1167
e7626db
2ea6007
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -405,14 +405,17 @@ def write_training_dataset( | |
else: | ||
raise ValueError("Dataset should be a query.") | ||
|
||
if training_dataset.coalesce: | ||
dataset = dataset.coalesce(1) | ||
|
||
dataset = dataset.cache() | ||
|
||
transformation_function_engine.TransformationFunctionEngine.populate_builtin_transformation_functions( | ||
training_dataset, feature_view_obj, dataset | ||
) | ||
if training_dataset.coalesce: | ||
dataset = dataset.coalesce(1) | ||
path = training_dataset.location + "/" + training_dataset.name | ||
return self._write_training_dataset_single( | ||
training_dataset, | ||
training_dataset.transformation_functions, | ||
dataset, | ||
training_dataset.storage_connector, | ||
training_dataset.data_format, | ||
|
@@ -425,12 +428,15 @@ def write_training_dataset( | |
split_dataset = self._split_df( | ||
query_obj, training_dataset, read_options=read_options | ||
) | ||
for key in split_dataset: | ||
if training_dataset.coalesce: | ||
split_dataset[key] = split_dataset[key].coalesce(1) | ||
|
||
split_dataset[key] = split_dataset[key].cache() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we not cache before spliting? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This article here suggests to add it after https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc. We need it afterwards, because otherwise the randomSplit() will be executed twice (once for transformation function statistincs, once for writing). If no seed is set for the randomSplit, the splits are potentially different between transfo. stats and writing. Whether it is worth caching it before as well that is debatable. randomSplit scans the data once while creating the splits (because it samples while passing over it), so for the split itself we only have one pass. In the training dataset statistics we do a df.head() to determine the length of the dataframe. If we want to cache we would need to do it before that. But hat should be a separate PR in the future. |
||
|
||
transformation_function_engine.TransformationFunctionEngine.populate_builtin_transformation_functions( | ||
training_dataset, feature_view_obj, split_dataset | ||
) | ||
if training_dataset.coalesce: | ||
for key in split_dataset: | ||
split_dataset[key] = split_dataset[key].coalesce(1) | ||
return self._write_training_dataset_splits( | ||
training_dataset, split_dataset, write_options, save_mode, to_df=to_df | ||
) | ||
|
@@ -499,7 +505,7 @@ def _write_training_dataset_splits( | |
for split_name, feature_dataframe in feature_dataframes.items(): | ||
split_path = training_dataset.location + "/" + str(split_name) | ||
feature_dataframes[split_name] = self._write_training_dataset_single( | ||
training_dataset, | ||
training_dataset.transformation_functions, | ||
feature_dataframes[split_name], | ||
training_dataset.storage_connector, | ||
training_dataset.data_format, | ||
|
@@ -539,6 +545,8 @@ def _write_training_dataset_single( | |
save_mode | ||
).save(path) | ||
|
||
feature_dataframe.unpersist() | ||
|
||
def read(self, storage_connector, data_format, read_options, location): | ||
if isinstance(location, str): | ||
if data_format.lower() in ["delta", "parquet", "hudi", "orc", "bigquery"]: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no split. Why cache here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did this to cache the query. If we do the split we cache the split result instead. I would have liked to cache the query as well when we split, but I was a bit concerned about the potential memory consumption when caching before and after the split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no split, I think we should not cache the result and users can cache themselves. The purpose of caching is to return consistent result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well okay, the reasoning behind it was to prevent 2x execution in the case of 1. having transformation fuctions and need to calcualte statistics and 2. have to write the df to disk. But I agree that placing the caching here without checking for this particular case is not great. Shall we still keep it for the special case though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. Need to calculate statistics. We should cache then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a Jira here to add it later https://hopsworks.atlassian.net/browse/FSTORE-317