-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dataset.from_spark #5701
Add Dataset.from_spark #5701
Conversation
The documentation is not available anymore as the PR was closed or merged. |
@mariosasko Would you or another HF datasets maintainer be able to review this, please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work done, @maddiedawson.
Please note there was a previous PR adding this functionality:
However the addition of this conversion was rejected by @lhoestq: see comment #5401 (comment)
Thinking more about it I don't really see how those two methods help in practice, since one can already do datasets <-> pandas <-> spark and those two methods don't add value over this.
Amazing ! Great job @maddiedawson Do you know if it's possible to also support writing to Parquet using the HF ParquetWriter if Parquet is often used when people want to stream the data to train models - which is suitable for big datasets. On the other hand Arrow is generally used for local memory mapping with random access.
Am I right to say that it uses the spark workers to prepare the Arrow files ? If so this should make the data preparation fast and won't fill up the executor's memory as in the previously proposed PR |
Thanks for taking a look! Unlike the previous PR's approach, this implementation takes advantage of Spark mapping to distribute file writing over multiple tasks. (Also it doesn't load the entire dataset into memory :) ) Supporting Parquet here sgtm; I'll modify the PR. I also updated the PR description with a common Spark-HF use case that we want to improve. |
Hey @albertvillanova @lhoestq , would one of you be able to re-review please? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job with this, I left a few comment :)
("3", 3, 3.0), | ||
] | ||
df = spark.createDataFrame(data, "col_1: string, col_2: int, col_3: float") | ||
dataset = Dataset.from_spark(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test to make sure that calling from_spark
twice in a row with two different DataFrames results in two different Dataset objects ?
The DatasetBuilder
uses a cache_dir
with subdirectories defined after the name of the builder ("spark") and the arguments of the dataset to generate. Here since the **kwargs
passed to DatasetBuilder.__init__
won't change between two from_spark
, it may reuse the same cache and return the same dataset.
If this is the case we can add a df_id
parameter to the SparkConfig
, which can be set to to be the python id
of the Spark DataFrame for example (or any identifier of the Spark DataFrame).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the DatasetBuilder to cache based on the dataframe's semanticHash and added a test for it. This isn't perfect; e.g if a user generates a dataset from query "SELECT * FROM T", adds rows to the table, and then regenerates the dataset, the semantic hash for the dataframe will be the same, so the user will get the stale data. However, they can pass the force_download=True option to from_spark to force redownloading. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good ! (I was off the last few days, sorry for the late reply)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question: could someone be able to reuse the cached dataset if they call .from_spark
using the same DataFrame but on different sessions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no, the same df will generate a different a semantic hash in different Spark sessions.
2eb0566
to
81dc1c8
Compare
@lhoestq this is ready for another pass! Thanks so much 🙏 |
0f3c346
to
34b80db
Compare
Friendly ping @lhoestq , also cc @polinaeterna who may be able to help take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me :) cc @albertvillanova as well
I added my final comments/questions:
("3", 3, 3.0), | ||
] | ||
df = spark.createDataFrame(data, "col_1: string, col_2: int, col_3: float") | ||
dataset = Dataset.from_spark(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question: could someone be able to reuse the cached dataset if they call .from_spark
using the same DataFrame but on different sessions ?
Merging |
03f9d73
to
fb25c42
Compare
Just rebased @lhoestq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ! There's just a test that needs to be fixed before merging
I'd also be interested in having someone else review it and then we're good to merge :)
Thanks @lhoestq ! Is there a way for me to trigger the github workflow myself to triage the test failure? I'm not able to repro the test failures locally. |
There were two test issues in the workflow that I wasn't able to reproduce locally:
|
Also one more question @lhoestq when is the next datasets release? We're hoping this can make it in |
I just re-ran the CI. |
0cb3f37
to
4abc961
Compare
4abc961
to
db85f2f
Compare
The remaining CI issues have been addressed! They were
I ran the CI on my forked branch: maddiedawson#2 Everything passes except one instance of tests/test_metric_common.py::LocalMetricTest::test_load_metric_frugalscore; it looks like a flake. @lhoestq granted that the CI passes here, is this ok to merge and release? We'd like to put out a blog post tomorrow to broadcast this to Spark users! |
Thanks @lhoestq ! Could you help take a look at the error please? Seems unrelated... FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_map_multiprocessing_on_disk - NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\Users\RUNNER~1\AppData\Local\Temp\tmptfnrdj4x\cache-5c5687cf5629c97a_00000_of_00002.arrow' |
The blog is live btw! https://www.databricks.com/blog/contributing-spark-loader-for-hugging-face-datasets Hopefully there can be a release today? |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
python 3.9.2 Did the dataset import load data from spark dataframe using multi-node Spark cluster Error : |
Hi @yanzia12138 ! Could you open a new issue please and share the full stack trace ? This will help to know what happened exactly |
Adds static method Dataset.from_spark to create datasets from Spark DataFrames.
This approach alleviates users of the need to materialize their dataframe---a common use case is that the user loads their dataset into a dataframe, uses Spark to apply some transformation to some of the columns, and then wants to train on the dataset.
Related issue: #5678