Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IterableDataset.from_spark #5770

Merged
merged 7 commits into from
May 17, 2023
Merged

Conversation

maddiedawson
Copy link
Contributor

@maddiedawson maddiedawson commented Apr 18, 2023

Follow-up from #5701

Related issue: #5678

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 19, 2023

The documentation is not available anymore as the PR was closed or merged.

@maddiedawson maddiedawson force-pushed the streaming branch 6 times, most recently from cd77191 to e182947 Compare April 27, 2023 16:18
@maddiedawson maddiedawson changed the title [WIP] Support streaming in Dataset.from_spark Support streaming in Dataset.from_spark Apr 27, 2023
@maddiedawson
Copy link
Contributor Author

Hi again @lhoestq this is ready for review! Not sure I have permission to add people to the reviewers list...

@lhoestq
Copy link
Member

lhoestq commented Apr 27, 2023

Cool ! I think you can define IterableDataset.from_spark instead of adding streaming= in Dataset.from_spark, it can be more intuitive IMO :)

return generate_fn


class SparkExamplesIterable(_BaseExamplesIterable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it !

@maddiedawson
Copy link
Contributor Author

Thanks for reviewing! I moved the streaming behavior to IterableDataset.from_spark

@maddiedawson maddiedawson changed the title Support streaming in Dataset.from_spark Add IterableDataset.from_spark Apr 27, 2023
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks ! I added one last comment.

Also if you want feel free to enrich the documentation at use_with_spark.mdx, but we can also do it in a subsequent PR. E.g. have a dedicated new section for it and explain the advantage compared to Dataset.from_spark

src/datasets/packaged_modules/spark/spark.py Outdated Show resolved Hide resolved
@maddiedawson maddiedawson force-pushed the streaming branch 2 times, most recently from bc84cd7 to cf2402c Compare May 3, 2023 23:31
@maddiedawson maddiedawson requested a review from lhoestq May 3, 2023 23:36
@maddiedawson
Copy link
Contributor Author

Thanks Quentin! I'll flesh out the docs in a follow-up PR

Copy link

@lu-wang-dl lu-wang-dl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maddiedawson
Copy link
Contributor Author

Friendly ping @lhoestq

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! I just added comments to fix a few things.

Also partition_order doesn't seem to have any impact right now, can you check if something is missing ? Feel free to add tests for shuffle_data_sources and shard_data_sources to make sure they work as expected ;)

docs/source/use_with_spark.mdx Show resolved Hide resolved
):
def generate_fn():
row_id = 0
for row in df.rdd.toLocalIterator(True):
Copy link
Member

@lhoestq lhoestq May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there's a way to iterate on pyarrow tables ? Or on pandas dataframes ? I expect it would also make it much faster to query the rows by batch.

Also asking because I'm implementing #5821 and it will be helpful :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm depending on your use case, maybe pyarrow.Table's to_reader (https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader) ? Either that or to_batches, both are zero-copy. For pandas, there's iterrows (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas-dataframe-iterrows), but I think this makes a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I mean is it possible to get a Table or pandas DataFrame iterator from a spark DataFrame ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of a great way beside using toPandas on the df, which will collect the entire thing to the driver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the info :)

Have you had a chance to test the speed of toLocalIterator ? Ideally it should be fast enough to be able to train models efficiently 🤞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about twice as fast as the non-streaming version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! This should be the preferred way that is showcased in the documentation then 😎

src/datasets/packaged_modules/spark/spark.py Outdated Show resolved Hide resolved
src/datasets/packaged_modules/spark/spark.py Outdated Show resolved Hide resolved
src/datasets/packaged_modules/spark/spark.py Outdated Show resolved Hide resolved
@maddiedawson
Copy link
Contributor Author

Thanks @lhoestq ! I fixed the partition order thing and added more unit tests.

@maddiedawson maddiedawson requested a review from lhoestq May 10, 2023 05:40
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job implementing this !

@lhoestq lhoestq merged commit 7790ebd into huggingface:main May 17, 2023
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006165 / 0.011353 (-0.005188) 0.004497 / 0.011008 (-0.006511) 0.099142 / 0.038508 (0.060634) 0.027479 / 0.023109 (0.004369) 0.352491 / 0.275898 (0.076593) 0.402993 / 0.323480 (0.079513) 0.004885 / 0.007986 (-0.003100) 0.003315 / 0.004328 (-0.001013) 0.075787 / 0.004250 (0.071537) 0.035320 / 0.037052 (-0.001732) 0.368401 / 0.258489 (0.109912) 0.409090 / 0.293841 (0.115249) 0.030125 / 0.128546 (-0.098421) 0.011670 / 0.075646 (-0.063976) 0.324381 / 0.419271 (-0.094890) 0.050815 / 0.043533 (0.007283) 0.352598 / 0.255139 (0.097460) 0.389189 / 0.283200 (0.105989) 0.092873 / 0.141683 (-0.048810) 1.485140 / 1.452155 (0.032986) 1.545586 / 1.492716 (0.052869)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.199522 / 0.018006 (0.181516) 0.404576 / 0.000490 (0.404087) 0.003322 / 0.000200 (0.003122) 0.000074 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022945 / 0.037411 (-0.014466) 0.095512 / 0.014526 (0.080987) 0.103077 / 0.176557 (-0.073480) 0.163918 / 0.737135 (-0.573217) 0.105560 / 0.296338 (-0.190779)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.417360 / 0.215209 (0.202151) 4.161693 / 2.077655 (2.084039) 1.851941 / 1.504120 (0.347821) 1.649872 / 1.541195 (0.108677) 1.682099 / 1.468490 (0.213609) 0.693187 / 4.584777 (-3.891590) 3.462528 / 3.745712 (-0.283184) 1.839893 / 5.269862 (-3.429968) 1.155945 / 4.565676 (-3.409731) 0.082611 / 0.424275 (-0.341664) 0.012076 / 0.007607 (0.004469) 0.514325 / 0.226044 (0.288280) 5.155052 / 2.268929 (2.886123) 2.307280 / 55.444624 (-53.137345) 1.966483 / 6.876477 (-4.909994) 2.018892 / 2.142072 (-0.123181) 0.803068 / 4.805227 (-4.002159) 0.152213 / 6.500664 (-6.348451) 0.066320 / 0.075469 (-0.009149)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.218578 / 1.841788 (-0.623209) 13.563869 / 8.074308 (5.489561) 13.954596 / 10.191392 (3.763204) 0.151527 / 0.680424 (-0.528897) 0.016655 / 0.534201 (-0.517546) 0.380637 / 0.579283 (-0.198646) 0.395854 / 0.434364 (-0.038509) 0.459111 / 0.540337 (-0.081226) 0.560219 / 1.386936 (-0.826717)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006427 / 0.011353 (-0.004926) 0.004728 / 0.011008 (-0.006280) 0.080525 / 0.038508 (0.042017) 0.027294 / 0.023109 (0.004185) 0.414688 / 0.275898 (0.138790) 0.449882 / 0.323480 (0.126402) 0.004771 / 0.007986 (-0.003214) 0.003402 / 0.004328 (-0.000926) 0.078748 / 0.004250 (0.074497) 0.037046 / 0.037052 (-0.000007) 0.417398 / 0.258489 (0.158909) 0.462921 / 0.293841 (0.169080) 0.030364 / 0.128546 (-0.098182) 0.011810 / 0.075646 (-0.063837) 0.089787 / 0.419271 (-0.329485) 0.039806 / 0.043533 (-0.003727) 0.403401 / 0.255139 (0.148262) 0.439477 / 0.283200 (0.156278) 0.088431 / 0.141683 (-0.053252) 1.534373 / 1.452155 (0.082219) 1.592316 / 1.492716 (0.099600)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.217701 / 0.018006 (0.199695) 0.384770 / 0.000490 (0.384280) 0.000437 / 0.000200 (0.000237) 0.000061 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024952 / 0.037411 (-0.012459) 0.098728 / 0.014526 (0.084202) 0.106324 / 0.176557 (-0.070233) 0.155484 / 0.737135 (-0.581651) 0.109503 / 0.296338 (-0.186836)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.450639 / 0.215209 (0.235430) 4.523110 / 2.077655 (2.445455) 2.224810 / 1.504120 (0.720690) 2.119516 / 1.541195 (0.578321) 2.225192 / 1.468490 (0.756702) 0.695397 / 4.584777 (-3.889380) 3.433559 / 3.745712 (-0.312153) 2.633127 / 5.269862 (-2.636735) 1.448471 / 4.565676 (-3.117206) 0.082262 / 0.424275 (-0.342013) 0.012246 / 0.007607 (0.004639) 0.561243 / 0.226044 (0.335199) 5.652711 / 2.268929 (3.383782) 2.689771 / 55.444624 (-52.754853) 2.359512 / 6.876477 (-4.516965) 2.471098 / 2.142072 (0.329026) 0.802955 / 4.805227 (-4.002272) 0.151142 / 6.500664 (-6.349522) 0.067494 / 0.075469 (-0.007975)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.306879 / 1.841788 (-0.534909) 14.030775 / 8.074308 (5.956467) 12.917790 / 10.191392 (2.726398) 0.141269 / 0.680424 (-0.539155) 0.016264 / 0.534201 (-0.517937) 0.411957 / 0.579283 (-0.167326) 0.393235 / 0.434364 (-0.041129) 0.505144 / 0.540337 (-0.035193) 0.590660 / 1.386936 (-0.796276)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants