[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784

tdoehmen · 2022-09-15T22:18:46Z

This PR contains fixes and improvements for the FeatureView API

fix for a bug: pass transformation function to _write_training_dataset_single instead of training dataset
improvement: cache dataframe before calculating transformation function statistics to save re-computation and keep randomSplit consistent between determine statistics and writing the df.

JIRA Issue: https://hopsworks.atlassian.net/browse/HOPSWORKS-3323

Priority for Review: high

Related PRs: -

How Has This Been Tested?

Unit Tests
Integration Tests
Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

kennethmhc · 2022-09-16T05:57:11Z

python/hsfs/engine/spark.py

+                if training_dataset.coalesce:
+                    split_dataset[key] = split_dataset[key].coalesce(1)
+
+                split_dataset[key] = split_dataset[key].cache()


should we not cache before spliting?

This article here suggests to add it after https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc. We need it afterwards, because otherwise the randomSplit() will be executed twice (once for transformation function statistincs, once for writing). If no seed is set for the randomSplit, the splits are potentially different between transfo. stats and writing. Whether it is worth caching it before as well that is debatable. randomSplit scans the data once while creating the splits (because it samples while passing over it), so for the split itself we only have one pass. In the training dataset statistics we do a df.head() to determine the length of the dataframe. If we want to cache we would need to do it before that. But hat should be a separate PR in the future.

kennethmhc · 2022-09-19T06:25:54Z

python/hsfs/engine/spark.py

+            if training_dataset.coalesce:
+                dataset = dataset.coalesce(1)
+
+            dataset = dataset.cache()


There is no split. Why cache here?

I think I did this to cache the query. If we do the split we cache the split result instead. I would have liked to cache the query as well when we split, but I was a bit concerned about the potential memory consumption when caching before and after the split.

If there is no split, I think we should not cache the result and users can cache themselves. The purpose of caching is to return consistent result.

Well okay, the reasoning behind it was to prevent 2x execution in the case of 1. having transformation fuctions and need to calcualte statistics and 2. have to write the df to disk. But I agree that placing the caching here without checking for this particular case is not great. Shall we still keep it for the special case though?

Ah right. Need to calculate statistics. We should cache then.

I created a Jira here to add it later https://hopsworks.atlassian.net/browse/FSTORE-317

kennethmhc

LGTM!

… caching (logicalclocks#784) * fix create training dataset in pyspark, added caching * fixed stylecheck * fixed stylecheck * revert caching for non-split dfs (cherry picked from commit 2c6864d)

… caching (#784) * fix create training dataset in pyspark, added caching * fixed stylecheck * fixed stylecheck * revert caching for non-split dfs (cherry picked from commit 2c6864d)

… caching (logicalclocks#784) * fix create training dataset in pyspark, added caching * fixed stylecheck * fixed stylecheck * revert caching for non-split dfs

fix create training dataset in pyspark, added caching

6faaca1

tdoehmen requested a review from kennethmhc September 15, 2022 22:18

tdoehmen changed the title ~~fix create training dataset in pyspark, added caching~~ [HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching Sep 15, 2022

tdoehmen added 2 commits September 16, 2022 00:24

fixed stylecheck

eabc283

fixed stylecheck

caf1167

kennethmhc reviewed Sep 16, 2022

View reviewed changes

kennethmhc reviewed Sep 19, 2022

View reviewed changes

kennethmhc approved these changes Sep 21, 2022

View reviewed changes

tdoehmen added 2 commits September 21, 2022 13:24

revert caching for non-split dfs

e7626db

solve merge conflict with master

2ea6007

tdoehmen merged commit 2c6864d into logicalclocks:master Sep 21, 2022

moritzmeister mentioned this pull request Dec 21, 2022

Remove latest tag from 3.0 docs #907

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784

[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784

tdoehmen commented Sep 15, 2022 •

edited

Loading

kennethmhc Sep 16, 2022

tdoehmen Sep 16, 2022 •

edited

Loading

kennethmhc Sep 19, 2022 •

edited

Loading

tdoehmen Sep 20, 2022

kennethmhc Sep 21, 2022

tdoehmen Sep 21, 2022

kennethmhc Sep 21, 2022

tdoehmen Sep 21, 2022

kennethmhc left a comment

[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784

[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784

Conversation

tdoehmen commented Sep 15, 2022 • edited Loading

kennethmhc Sep 16, 2022

Choose a reason for hiding this comment

tdoehmen Sep 16, 2022 • edited Loading

Choose a reason for hiding this comment

kennethmhc Sep 19, 2022 • edited Loading

Choose a reason for hiding this comment

tdoehmen Sep 20, 2022

Choose a reason for hiding this comment

kennethmhc Sep 21, 2022

Choose a reason for hiding this comment

tdoehmen Sep 21, 2022

Choose a reason for hiding this comment

kennethmhc Sep 21, 2022

Choose a reason for hiding this comment

tdoehmen Sep 21, 2022

Choose a reason for hiding this comment

kennethmhc left a comment

Choose a reason for hiding this comment

tdoehmen commented Sep 15, 2022 •

edited

Loading

tdoehmen Sep 16, 2022 •

edited

Loading

kennethmhc Sep 19, 2022 •

edited

Loading