Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching #784

Merged
merged 5 commits into from
Sep 21, 2022

Conversation

tdoehmen
Copy link
Contributor

@tdoehmen tdoehmen commented Sep 15, 2022

This PR contains fixes and improvements for the FeatureView API

  • fix for a bug: pass transformation function to _write_training_dataset_single instead of training dataset
  • improvement: cache dataframe before calculating transformation function statistics to save re-computation and keep randomSplit consistent between determine statistics and writing the df.

JIRA Issue: https://hopsworks.atlassian.net/browse/HOPSWORKS-3323

Priority for Review: high

Related PRs: -

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

@tdoehmen tdoehmen requested a review from kennethmhc September 15, 2022 22:18
@tdoehmen tdoehmen changed the title fix create training dataset in pyspark, added caching [HOPSWORKS-3323] Fix TDS creation in PySpark client and add excplicit caching Sep 15, 2022
if training_dataset.coalesce:
split_dataset[key] = split_dataset[key].coalesce(1)

split_dataset[key] = split_dataset[key].cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not cache before spliting?

Copy link
Contributor Author

@tdoehmen tdoehmen Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This article here suggests to add it after https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc. We need it afterwards, because otherwise the randomSplit() will be executed twice (once for transformation function statistincs, once for writing). If no seed is set for the randomSplit, the splits are potentially different between transfo. stats and writing. Whether it is worth caching it before as well that is debatable. randomSplit scans the data once while creating the splits (because it samples while passing over it), so for the split itself we only have one pass. In the training dataset statistics we do a df.head() to determine the length of the dataframe. If we want to cache we would need to do it before that. But hat should be a separate PR in the future.

if training_dataset.coalesce:
dataset = dataset.coalesce(1)

dataset = dataset.cache()
Copy link
Contributor

@kennethmhc kennethmhc Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no split. Why cache here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did this to cache the query. If we do the split we cache the split result instead. I would have liked to cache the query as well when we split, but I was a bit concerned about the potential memory consumption when caching before and after the split.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no split, I think we should not cache the result and users can cache themselves. The purpose of caching is to return consistent result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well okay, the reasoning behind it was to prevent 2x execution in the case of 1. having transformation fuctions and need to calcualte statistics and 2. have to write the df to disk. But I agree that placing the caching here without checking for this particular case is not great. Shall we still keep it for the special case though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Need to calculate statistics. We should cache then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a Jira here to add it later https://hopsworks.atlassian.net/browse/FSTORE-317

Copy link
Contributor

@kennethmhc kennethmhc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tdoehmen tdoehmen merged commit 2c6864d into logicalclocks:master Sep 21, 2022
kennethmhc pushed a commit to kennethmhc/feature-store-api that referenced this pull request Sep 22, 2022
… caching (logicalclocks#784)

* fix create training dataset in pyspark, added caching

* fixed stylecheck

* fixed stylecheck

* revert caching for non-split dfs

(cherry picked from commit 2c6864d)
SirOibaf pushed a commit that referenced this pull request Sep 26, 2022
… caching (#784)

* fix create training dataset in pyspark, added caching

* fixed stylecheck

* fixed stylecheck

* revert caching for non-split dfs

(cherry picked from commit 2c6864d)
kennethmhc pushed a commit to kennethmhc/feature-store-api that referenced this pull request Nov 16, 2022
… caching (logicalclocks#784)

* fix create training dataset in pyspark, added caching

* fixed stylecheck

* fixed stylecheck

* revert caching for non-split dfs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants