Add IterableDataset.from_spark #5770

maddiedawson · 2023-04-18T17:47:53Z

Follow-up from #5701

Related issue: #5678

HuggingFaceDocBuilderDev · 2023-04-19T10:46:06Z

The documentation is not available anymore as the PR was closed or merged.

maddiedawson · 2023-04-27T16:37:15Z

Hi again @lhoestq this is ready for review! Not sure I have permission to add people to the reviewers list...

lhoestq · 2023-04-27T16:42:34Z

Cool ! I think you can define IterableDataset.from_spark instead of adding streaming= in Dataset.from_spark, it can be more intuitive IMO :)

lhoestq · 2023-04-27T16:45:18Z

src/datasets/packaged_modules/spark/spark.py

+    return generate_fn
+
+
+class SparkExamplesIterable(_BaseExamplesIterable):


src/datasets/arrow_dataset.py

maddiedawson · 2023-04-27T17:11:01Z

Thanks for reviewing! I moved the streaming behavior to IterableDataset.from_spark

lhoestq

Cool thanks ! I added one last comment.

Also if you want feel free to enrich the documentation at use_with_spark.mdx, but we can also do it in a subsequent PR. E.g. have a dedicated new section for it and explain the advantage compared to Dataset.from_spark

src/datasets/packaged_modules/spark/spark.py

maddiedawson · 2023-05-03T23:37:08Z

Thanks Quentin! I'll flesh out the docs in a follow-up PR

lu-wang-dl

LGTM

maddiedawson · 2023-05-08T15:55:29Z

Friendly ping @lhoestq

lhoestq

Awesome ! I just added comments to fix a few things.

Also partition_order doesn't seem to have any impact right now, can you check if something is missing ? Feel free to add tests for shuffle_data_sources and shard_data_sources to make sure they work as expected ;)

docs/source/use_with_spark.mdx

lhoestq · 2023-05-09T09:45:51Z

src/datasets/packaged_modules/spark/spark.py

+):
+    def generate_fn():
+        row_id = 0
+        for row in df.rdd.toLocalIterator(True):


Do you know if there's a way to iterate on pyarrow tables ? Or on pandas dataframes ? I expect it would also make it much faster to query the rows by batch.

Also asking because I'm implementing #5821 and it will be helpful :)

Hmm depending on your use case, maybe pyarrow.Table's to_reader (https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader) ? Either that or to_batches, both are zero-copy. For pandas, there's iterrows (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas-dataframe-iterrows), but I think this makes a copy.

Yea I mean is it possible to get a Table or pandas DataFrame iterator from a spark DataFrame ?

I don't know of a great way beside using toPandas on the df, which will collect the entire thing to the driver.

Ok thanks for the info :)

Have you had a chance to test the speed of toLocalIterator ? Ideally it should be fast enough to be able to train models efficiently 🤞

It's about twice as fast as the non-streaming version.

Awesome ! This should be the preferred way that is showcased in the documentation then 😎

src/datasets/packaged_modules/spark/spark.py

maddiedawson · 2023-05-10T05:28:57Z

Thanks @lhoestq ! I fixed the partition order thing and added more unit tests.

lhoestq

LGTM, good job implementing this !

github-actions · 2023-05-17T14:07:31Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006165 / 0.011353 (-0.005188)	0.004497 / 0.011008 (-0.006511)	0.099142 / 0.038508 (0.060634)	0.027479 / 0.023109 (0.004369)	0.352491 / 0.275898 (0.076593)	0.402993 / 0.323480 (0.079513)	0.004885 / 0.007986 (-0.003100)	0.003315 / 0.004328 (-0.001013)	0.075787 / 0.004250 (0.071537)	0.035320 / 0.037052 (-0.001732)	0.368401 / 0.258489 (0.109912)	0.409090 / 0.293841 (0.115249)	0.030125 / 0.128546 (-0.098421)	0.011670 / 0.075646 (-0.063976)	0.324381 / 0.419271 (-0.094890)	0.050815 / 0.043533 (0.007283)	0.352598 / 0.255139 (0.097460)	0.389189 / 0.283200 (0.105989)	0.092873 / 0.141683 (-0.048810)	1.485140 / 1.452155 (0.032986)	1.545586 / 1.492716 (0.052869)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.199522 / 0.018006 (0.181516)	0.404576 / 0.000490 (0.404087)	0.003322 / 0.000200 (0.003122)	0.000074 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022945 / 0.037411 (-0.014466)	0.095512 / 0.014526 (0.080987)	0.103077 / 0.176557 (-0.073480)	0.163918 / 0.737135 (-0.573217)	0.105560 / 0.296338 (-0.190779)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.417360 / 0.215209 (0.202151)	4.161693 / 2.077655 (2.084039)	1.851941 / 1.504120 (0.347821)	1.649872 / 1.541195 (0.108677)	1.682099 / 1.468490 (0.213609)	0.693187 / 4.584777 (-3.891590)	3.462528 / 3.745712 (-0.283184)	1.839893 / 5.269862 (-3.429968)	1.155945 / 4.565676 (-3.409731)	0.082611 / 0.424275 (-0.341664)	0.012076 / 0.007607 (0.004469)	0.514325 / 0.226044 (0.288280)	5.155052 / 2.268929 (2.886123)	2.307280 / 55.444624 (-53.137345)	1.966483 / 6.876477 (-4.909994)	2.018892 / 2.142072 (-0.123181)	0.803068 / 4.805227 (-4.002159)	0.152213 / 6.500664 (-6.348451)	0.066320 / 0.075469 (-0.009149)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.218578 / 1.841788 (-0.623209)	13.563869 / 8.074308 (5.489561)	13.954596 / 10.191392 (3.763204)	0.151527 / 0.680424 (-0.528897)	0.016655 / 0.534201 (-0.517546)	0.380637 / 0.579283 (-0.198646)	0.395854 / 0.434364 (-0.038509)	0.459111 / 0.540337 (-0.081226)	0.560219 / 1.386936 (-0.826717)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006427 / 0.011353 (-0.004926)	0.004728 / 0.011008 (-0.006280)	0.080525 / 0.038508 (0.042017)	0.027294 / 0.023109 (0.004185)	0.414688 / 0.275898 (0.138790)	0.449882 / 0.323480 (0.126402)	0.004771 / 0.007986 (-0.003214)	0.003402 / 0.004328 (-0.000926)	0.078748 / 0.004250 (0.074497)	0.037046 / 0.037052 (-0.000007)	0.417398 / 0.258489 (0.158909)	0.462921 / 0.293841 (0.169080)	0.030364 / 0.128546 (-0.098182)	0.011810 / 0.075646 (-0.063837)	0.089787 / 0.419271 (-0.329485)	0.039806 / 0.043533 (-0.003727)	0.403401 / 0.255139 (0.148262)	0.439477 / 0.283200 (0.156278)	0.088431 / 0.141683 (-0.053252)	1.534373 / 1.452155 (0.082219)	1.592316 / 1.492716 (0.099600)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.217701 / 0.018006 (0.199695)	0.384770 / 0.000490 (0.384280)	0.000437 / 0.000200 (0.000237)	0.000061 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024952 / 0.037411 (-0.012459)	0.098728 / 0.014526 (0.084202)	0.106324 / 0.176557 (-0.070233)	0.155484 / 0.737135 (-0.581651)	0.109503 / 0.296338 (-0.186836)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.450639 / 0.215209 (0.235430)	4.523110 / 2.077655 (2.445455)	2.224810 / 1.504120 (0.720690)	2.119516 / 1.541195 (0.578321)	2.225192 / 1.468490 (0.756702)	0.695397 / 4.584777 (-3.889380)	3.433559 / 3.745712 (-0.312153)	2.633127 / 5.269862 (-2.636735)	1.448471 / 4.565676 (-3.117206)	0.082262 / 0.424275 (-0.342013)	0.012246 / 0.007607 (0.004639)	0.561243 / 0.226044 (0.335199)	5.652711 / 2.268929 (3.383782)	2.689771 / 55.444624 (-52.754853)	2.359512 / 6.876477 (-4.516965)	2.471098 / 2.142072 (0.329026)	0.802955 / 4.805227 (-4.002272)	0.151142 / 6.500664 (-6.349522)	0.067494 / 0.075469 (-0.007975)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.306879 / 1.841788 (-0.534909)	14.030775 / 8.074308 (5.956467)	12.917790 / 10.191392 (2.726398)	0.141269 / 0.680424 (-0.539155)	0.016264 / 0.534201 (-0.517937)	0.411957 / 0.579283 (-0.167326)	0.393235 / 0.434364 (-0.041129)	0.505144 / 0.540337 (-0.035193)	0.590660 / 1.386936 (-0.796276)

maddiedawson force-pushed the streaming branch 6 times, most recently from cd77191 to e182947 Compare April 27, 2023 16:18

maddiedawson changed the title ~~[WIP] Support streaming in Dataset.from_spark~~ Support streaming in Dataset.from_spark Apr 27, 2023

lhoestq reviewed Apr 27, 2023

View reviewed changes

lu-wang-dl reviewed Apr 27, 2023

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

maddiedawson requested review from lu-wang-dl and lhoestq April 27, 2023 17:25

maddiedawson changed the title ~~Support streaming in Dataset.from_spark~~ Add IterableDataset.from_spark Apr 27, 2023

maddiedawson force-pushed the streaming branch from 54f73a6 to 7a1d44b Compare April 27, 2023 17:32

lhoestq reviewed Apr 28, 2023

View reviewed changes

src/datasets/packaged_modules/spark/spark.py Outdated Show resolved Hide resolved

maddiedawson force-pushed the streaming branch 2 times, most recently from bc84cd7 to cf2402c Compare May 3, 2023 23:31

maddiedawson requested a review from lhoestq May 3, 2023 23:36

lu-wang-dl approved these changes May 4, 2023

View reviewed changes

lhoestq reviewed May 9, 2023

View reviewed changes

maddiedawson added 5 commits May 9, 2023 20:51

Init

2e8e176

Fix rebase

1069293

Address comments

cdd9f18

Update docs to include IterableDataset.from_spark

0ae4509

Only validate cache_dir when not streaming

ef3a543

maddiedawson added 2 commits May 9, 2023 20:51

Address comment

006a3d1

Address comments again

a6e11ed

maddiedawson force-pushed the streaming branch from cf2402c to a6e11ed Compare May 10, 2023 05:22

maddiedawson requested a review from lhoestq May 10, 2023 05:40

lhoestq approved these changes May 17, 2023

View reviewed changes

lhoestq merged commit 7790ebd into huggingface:main May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IterableDataset.from_spark #5770

Add IterableDataset.from_spark #5770

maddiedawson commented Apr 18, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 19, 2023 •

edited

Loading

maddiedawson commented Apr 27, 2023

lhoestq commented Apr 27, 2023

lhoestq Apr 27, 2023

maddiedawson commented Apr 27, 2023

lhoestq left a comment •

edited

Loading

maddiedawson commented May 3, 2023

lu-wang-dl left a comment

maddiedawson commented May 8, 2023

lhoestq left a comment

lhoestq May 9, 2023 •

edited

Loading

maddiedawson May 10, 2023

lhoestq May 11, 2023

maddiedawson May 11, 2023

lhoestq May 12, 2023

maddiedawson May 16, 2023

lhoestq May 17, 2023

maddiedawson commented May 10, 2023

lhoestq left a comment

github-actions bot commented May 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

		return generate_fn


		class SparkExamplesIterable(_BaseExamplesIterable):

Add IterableDataset.from_spark #5770

Add IterableDataset.from_spark #5770

Conversation

maddiedawson commented Apr 18, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Apr 19, 2023 • edited Loading

maddiedawson commented Apr 27, 2023

lhoestq commented Apr 27, 2023

lhoestq Apr 27, 2023

Choose a reason for hiding this comment

maddiedawson commented Apr 27, 2023

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

maddiedawson commented May 3, 2023

lu-wang-dl left a comment

Choose a reason for hiding this comment

maddiedawson commented May 8, 2023

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq May 9, 2023 • edited Loading

Choose a reason for hiding this comment

maddiedawson May 10, 2023

Choose a reason for hiding this comment

lhoestq May 11, 2023

Choose a reason for hiding this comment

maddiedawson May 11, 2023

Choose a reason for hiding this comment

lhoestq May 12, 2023

Choose a reason for hiding this comment

maddiedawson May 16, 2023

Choose a reason for hiding this comment

lhoestq May 17, 2023

Choose a reason for hiding this comment

maddiedawson commented May 10, 2023

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented May 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

maddiedawson commented Apr 18, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 19, 2023 •

edited

Loading

lhoestq left a comment •

edited

Loading

lhoestq May 9, 2023 •

edited

Loading