Reduce the number of commits in `push_to_hub` #6269

mariosasko · 2023-09-29T16:22:31Z

Reduces the number of commits in push_to_hub by using the preupload API from huggingface/huggingface_hub#1699. Each commit contains a maximum of 50 uploaded files.

A shard's fingerprint no longer needs to be added as a suffix to support resuming an upload, meaning the shards' naming scheme is the same as the initial one.

Also, it adds support for the following params: create_pr, commit_message and revision (branch deprecated; unlike the previous implementation, this one creates a branch if the branch does not exist to be consistent with transformers).

(Nit) This implementation keeps the markdown section of the generated README.md empty to enable importing the card template (when the card is accessed on the Hub).

Fixes #5492, fixes #6257, fixes #5045, fixes #6271

TODO:

set the minimal version to the next hfh release (once it's published)

HuggingFaceDocBuilderDev · 2023-09-29T16:29:44Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-09-29T16:30:21Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005864 / 0.011353 (-0.005489)	0.003535 / 0.011008 (-0.007474)	0.080732 / 0.038508 (0.042224)	0.057072 / 0.023109 (0.033963)	0.334342 / 0.275898 (0.058444)	0.361345 / 0.323480 (0.037865)	0.003290 / 0.007986 (-0.004696)	0.003794 / 0.004328 (-0.000534)	0.063414 / 0.004250 (0.059163)	0.046901 / 0.037052 (0.009848)	0.335973 / 0.258489 (0.077484)	0.377929 / 0.293841 (0.084088)	0.027199 / 0.128546 (-0.101348)	0.008049 / 0.075646 (-0.067597)	0.261810 / 0.419271 (-0.157462)	0.044669 / 0.043533 (0.001136)	0.333600 / 0.255139 (0.078461)	0.356362 / 0.283200 (0.073162)	0.020325 / 0.141683 (-0.121358)	1.458138 / 1.452155 (0.005984)	1.505923 / 1.492716 (0.013207)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.216456 / 0.018006 (0.198450)	0.421750 / 0.000490 (0.421261)	0.007359 / 0.000200 (0.007159)	0.000246 / 0.000054 (0.000191)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023400 / 0.037411 (-0.014012)	0.073363 / 0.014526 (0.058838)	0.083533 / 0.176557 (-0.093023)	0.144045 / 0.737135 (-0.593090)	0.084050 / 0.296338 (-0.212288)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.398354 / 0.215209 (0.183145)	3.982875 / 2.077655 (1.905220)	2.047299 / 1.504120 (0.543180)	1.873780 / 1.541195 (0.332585)	1.977044 / 1.468490 (0.508554)	0.497038 / 4.584777 (-4.087739)	3.039743 / 3.745712 (-0.705969)	2.832885 / 5.269862 (-2.436977)	1.827300 / 4.565676 (-2.738377)	0.057503 / 0.424275 (-0.366772)	0.006272 / 0.007607 (-0.001335)	0.468681 / 0.226044 (0.242637)	4.696551 / 2.268929 (2.427622)	2.413805 / 55.444624 (-53.030819)	2.157199 / 6.876477 (-4.719278)	2.345986 / 2.142072 (0.203914)	0.584632 / 4.805227 (-4.220595)	0.124684 / 6.500664 (-6.375980)	0.060090 / 0.075469 (-0.015379)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.293551 / 1.841788 (-0.548236)	17.198292 / 8.074308 (9.123984)	13.677910 / 10.191392 (3.486518)	0.146633 / 0.680424 (-0.533791)	0.016711 / 0.534201 (-0.517490)	0.331644 / 0.579283 (-0.247639)	0.360148 / 0.434364 (-0.074215)	0.381194 / 0.540337 (-0.159143)	0.537952 / 1.386936 (-0.848984)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006020 / 0.011353 (-0.005333)	0.003557 / 0.011008 (-0.007451)	0.061926 / 0.038508 (0.023418)	0.056246 / 0.023109 (0.033137)	0.446679 / 0.275898 (0.170781)	0.479843 / 0.323480 (0.156363)	0.004656 / 0.007986 (-0.003330)	0.002823 / 0.004328 (-0.001505)	0.061366 / 0.004250 (0.057115)	0.045793 / 0.037052 (0.008740)	0.460807 / 0.258489 (0.202318)	0.485467 / 0.293841 (0.191626)	0.028555 / 0.128546 (-0.099991)	0.007973 / 0.075646 (-0.067674)	0.068305 / 0.419271 (-0.350966)	0.040844 / 0.043533 (-0.002689)	0.463715 / 0.255139 (0.208576)	0.474553 / 0.283200 (0.191354)	0.019959 / 0.141683 (-0.121723)	1.432527 / 1.452155 (-0.019628)	1.485410 / 1.492716 (-0.007307)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.205555 / 0.018006 (0.187549)	0.408271 / 0.000490 (0.407781)	0.004325 / 0.000200 (0.004125)	0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026338 / 0.037411 (-0.011074)	0.080534 / 0.014526 (0.066008)	0.093935 / 0.176557 (-0.082622)	0.146446 / 0.737135 (-0.590689)	0.092890 / 0.296338 (-0.203448)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.463879 / 0.215209 (0.248670)	4.646411 / 2.077655 (2.568756)	2.567320 / 1.504120 (1.063200)	2.384376 / 1.541195 (0.843181)	2.412738 / 1.468490 (0.944248)	0.510240 / 4.584777 (-4.074537)	3.094988 / 3.745712 (-0.650724)	2.837700 / 5.269862 (-2.432161)	1.850163 / 4.565676 (-2.715513)	0.059320 / 0.424275 (-0.364955)	0.006330 / 0.007607 (-0.001277)	0.537770 / 0.226044 (0.311726)	5.385556 / 2.268929 (3.116627)	3.036088 / 55.444624 (-52.408536)	2.650464 / 6.876477 (-4.226013)	2.755676 / 2.142072 (0.613603)	0.607353 / 4.805227 (-4.197875)	0.124589 / 6.500664 (-6.376075)	0.060778 / 0.075469 (-0.014691)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.343243 / 1.841788 (-0.498545)	17.630281 / 8.074308 (9.555973)	14.401219 / 10.191392 (4.209827)	0.143252 / 0.680424 (-0.537172)	0.017880 / 0.534201 (-0.516321)	0.337391 / 0.579283 (-0.241892)	0.373531 / 0.434364 (-0.060833)	0.398408 / 0.540337 (-0.141929)	0.558925 / 1.386936 (-0.828011)

Wauplin

Nice! Have you tried it? I made a quick review and I think the integration should look like this indeed 👍

src/datasets/arrow_dataset.py

github-actions · 2023-10-02T14:45:51Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006552 / 0.011353 (-0.004801)	0.003853 / 0.011008 (-0.007155)	0.077673 / 0.038508 (0.039165)	0.066043 / 0.023109 (0.042934)	0.289858 / 0.275898 (0.013960)	0.299009 / 0.323480 (-0.024471)	0.004806 / 0.007986 (-0.003179)	0.003517 / 0.004328 (-0.000811)	0.058227 / 0.004250 (0.053977)	0.052134 / 0.037052 (0.015082)	0.328800 / 0.258489 (0.070311)	0.317616 / 0.293841 (0.023776)	0.028344 / 0.128546 (-0.100202)	0.007853 / 0.075646 (-0.067794)	0.291207 / 0.419271 (-0.128065)	0.052977 / 0.043533 (0.009444)	0.287548 / 0.255139 (0.032409)	0.307647 / 0.283200 (0.024448)	0.023899 / 0.141683 (-0.117784)	1.382267 / 1.452155 (-0.069888)	1.589915 / 1.492716 (0.097199)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.246244 / 0.018006 (0.228238)	0.478255 / 0.000490 (0.477766)	0.014115 / 0.000200 (0.013915)	0.000305 / 0.000054 (0.000250)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027033 / 0.037411 (-0.010378)	0.073988 / 0.014526 (0.059462)	0.088337 / 0.176557 (-0.088219)	0.144067 / 0.737135 (-0.593069)	0.091295 / 0.296338 (-0.205043)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.365904 / 0.215209 (0.150695)	3.537330 / 2.077655 (1.459675)	1.678341 / 1.504120 (0.174221)	1.530297 / 1.541195 (-0.010898)	1.605634 / 1.468490 (0.137144)	0.437461 / 4.584777 (-4.147316)	3.419040 / 3.745712 (-0.326672)	3.203549 / 5.269862 (-2.066312)	1.913214 / 4.565676 (-2.652463)	0.052675 / 0.424275 (-0.371600)	0.006681 / 0.007607 (-0.000926)	0.429269 / 0.226044 (0.203225)	4.214051 / 2.268929 (1.945122)	2.217928 / 55.444624 (-53.226696)	1.842679 / 6.876477 (-5.033798)	1.867961 / 2.142072 (-0.274111)	0.550566 / 4.805227 (-4.254661)	0.118015 / 6.500664 (-6.382649)	0.054749 / 0.075469 (-0.020720)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.170547 / 1.841788 (-0.671241)	18.410567 / 8.074308 (10.336259)	12.729992 / 10.191392 (2.538600)	0.160426 / 0.680424 (-0.519998)	0.021259 / 0.534201 (-0.512942)	0.369573 / 0.579283 (-0.209710)	0.440350 / 0.434364 (0.005986)	0.443755 / 0.540337 (-0.096582)	0.645614 / 1.386936 (-0.741322)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005913 / 0.011353 (-0.005440)	0.003542 / 0.011008 (-0.007466)	0.057621 / 0.038508 (0.019113)	0.065822 / 0.023109 (0.042713)	0.390847 / 0.275898 (0.114949)	0.393127 / 0.323480 (0.069647)	0.005040 / 0.007986 (-0.002945)	0.002944 / 0.004328 (-0.001384)	0.069058 / 0.004250 (0.064808)	0.051594 / 0.037052 (0.014542)	0.383745 / 0.258489 (0.125256)	0.414372 / 0.293841 (0.120531)	0.030038 / 0.128546 (-0.098508)	0.008109 / 0.075646 (-0.067538)	0.065444 / 0.419271 (-0.353828)	0.045974 / 0.043533 (0.002441)	0.401695 / 0.255139 (0.146556)	0.417834 / 0.283200 (0.134635)	0.020137 / 0.141683 (-0.121546)	1.452130 / 1.452155 (-0.000025)	1.455259 / 1.492716 (-0.037458)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.228262 / 0.018006 (0.210255)	0.455155 / 0.000490 (0.454665)	0.006667 / 0.000200 (0.006467)	0.000207 / 0.000054 (0.000153)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030159 / 0.037411 (-0.007252)	0.098478 / 0.014526 (0.083952)	0.101409 / 0.176557 (-0.075147)	0.148689 / 0.737135 (-0.588446)	0.103067 / 0.296338 (-0.193272)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.444095 / 0.215209 (0.228886)	3.991588 / 2.077655 (1.913934)	2.147845 / 1.504120 (0.643725)	2.007871 / 1.541195 (0.466676)	2.042074 / 1.468490 (0.573584)	0.451592 / 4.584777 (-4.133185)	3.439400 / 3.745712 (-0.306312)	3.107756 / 5.269862 (-2.162106)	1.909785 / 4.565676 (-2.655891)	0.051718 / 0.424275 (-0.372558)	0.006597 / 0.007607 (-0.001010)	0.480822 / 0.226044 (0.254777)	4.913235 / 2.268929 (2.644307)	2.631882 / 55.444624 (-52.812742)	2.397209 / 6.876477 (-4.479267)	2.487191 / 2.142072 (0.345119)	0.566321 / 4.805227 (-4.238906)	0.121741 / 6.500664 (-6.378924)	0.053399 / 0.075469 (-0.022070)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.256599 / 1.841788 (-0.585189)	18.891127 / 8.074308 (10.816819)	13.219662 / 10.191392 (3.028270)	0.154570 / 0.680424 (-0.525854)	0.022599 / 0.534201 (-0.511602)	0.361998 / 0.579283 (-0.217286)	0.413287 / 0.434364 (-0.021077)	0.464867 / 0.540337 (-0.075470)	0.638880 / 1.386936 (-0.748056)

github-actions · 2023-10-02T14:53:06Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010625 / 0.011353 (-0.000728)	0.005129 / 0.011008 (-0.005879)	0.119975 / 0.038508 (0.081467)	0.100128 / 0.023109 (0.077019)	0.448678 / 0.275898 (0.172780)	0.533150 / 0.323480 (0.209670)	0.005881 / 0.007986 (-0.002105)	0.007451 / 0.004328 (0.003123)	0.090792 / 0.004250 (0.086542)	0.073416 / 0.037052 (0.036363)	0.455395 / 0.258489 (0.196906)	0.497572 / 0.293841 (0.203731)	0.053112 / 0.128546 (-0.075434)	0.014619 / 0.075646 (-0.061027)	0.388023 / 0.419271 (-0.031248)	0.074004 / 0.043533 (0.030471)	0.435319 / 0.255139 (0.180180)	0.465985 / 0.283200 (0.182785)	0.046991 / 0.141683 (-0.094692)	1.895717 / 1.452155 (0.443563)	2.086600 / 1.492716 (0.593884)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.334412 / 0.018006 (0.316406)	0.645510 / 0.000490 (0.645020)	0.019175 / 0.000200 (0.018975)	0.000429 / 0.000054 (0.000374)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034385 / 0.037411 (-0.003026)	0.108939 / 0.014526 (0.094413)	0.125937 / 0.176557 (-0.050619)	0.205643 / 0.737135 (-0.531493)	0.127662 / 0.296338 (-0.168676)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.674093 / 0.215209 (0.458884)	6.646554 / 2.077655 (4.568900)	2.837698 / 1.504120 (1.333578)	2.397199 / 1.541195 (0.856004)	2.485856 / 1.468490 (1.017366)	0.955142 / 4.584777 (-3.629635)	5.667462 / 3.745712 (1.921750)	5.354129 / 5.269862 (0.084268)	3.301609 / 4.565676 (-1.264068)	0.106051 / 0.424275 (-0.318224)	0.009287 / 0.007607 (0.001680)	0.766678 / 0.226044 (0.540634)	7.786701 / 2.268929 (5.517772)	3.665463 / 55.444624 (-51.779161)	2.982912 / 6.876477 (-3.893564)	3.053363 / 2.142072 (0.911290)	1.141090 / 4.805227 (-3.664137)	0.223975 / 6.500664 (-6.276689)	0.093024 / 0.075469 (0.017555)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.728175 / 1.841788 (-0.113613)	25.640134 / 8.074308 (17.565826)	22.124769 / 10.191392 (11.933377)	0.237489 / 0.680424 (-0.442935)	0.030353 / 0.534201 (-0.503848)	0.509371 / 0.579283 (-0.069913)	0.642320 / 0.434364 (0.207956)	0.576889 / 0.540337 (0.036552)	0.899377 / 1.386936 (-0.487559)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010846 / 0.011353 (-0.000507)	0.005876 / 0.011008 (-0.005132)	0.090810 / 0.038508 (0.052302)	0.106651 / 0.023109 (0.083542)	0.551064 / 0.275898 (0.275166)	0.608328 / 0.323480 (0.284848)	0.007563 / 0.007986 (-0.000423)	0.004595 / 0.004328 (0.000267)	0.089125 / 0.004250 (0.084874)	0.076577 / 0.037052 (0.039525)	0.579970 / 0.258489 (0.321481)	0.620214 / 0.293841 (0.326373)	0.052577 / 0.128546 (-0.075970)	0.013734 / 0.075646 (-0.061912)	0.099825 / 0.419271 (-0.319447)	0.068391 / 0.043533 (0.024858)	0.564733 / 0.255139 (0.309594)	0.593925 / 0.283200 (0.310726)	0.037201 / 0.141683 (-0.104482)	1.880969 / 1.452155 (0.428815)	2.065094 / 1.492716 (0.572377)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.426148 / 0.018006 (0.408141)	0.673935 / 0.000490 (0.673445)	0.124190 / 0.000200 (0.123990)	0.001219 / 0.000054 (0.001164)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.040280 / 0.037411 (0.002868)	0.122042 / 0.014526 (0.107516)	0.131333 / 0.176557 (-0.045223)	0.203039 / 0.737135 (-0.534096)	0.134851 / 0.296338 (-0.161487)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.684599 / 0.215209 (0.469390)	6.727529 / 2.077655 (4.649874)	3.255228 / 1.504120 (1.751108)	2.925865 / 1.541195 (1.384670)	2.978762 / 1.468490 (1.510272)	0.931769 / 4.584777 (-3.653008)	5.988956 / 3.745712 (2.243244)	5.228049 / 5.269862 (-0.041812)	3.341470 / 4.565676 (-1.224206)	0.106737 / 0.424275 (-0.317539)	0.009847 / 0.007607 (0.002240)	0.813954 / 0.226044 (0.587909)	8.137071 / 2.268929 (5.868143)	4.140725 / 55.444624 (-51.303899)	3.500579 / 6.876477 (-3.375898)	3.623120 / 2.142072 (1.481047)	1.096634 / 4.805227 (-3.708593)	0.236938 / 6.500664 (-6.263726)	0.083099 / 0.075469 (0.007630)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.856112 / 1.841788 (0.014324)	26.531325 / 8.074308 (18.457017)	24.435866 / 10.191392 (14.244474)	0.264093 / 0.680424 (-0.416331)	0.034872 / 0.534201 (-0.499329)	0.520682 / 0.579283 (-0.058601)	0.635010 / 0.434364 (0.200646)	0.645451 / 0.540337 (0.105113)	0.914616 / 1.386936 (-0.472320)

…mmit-push_to_hub

github-actions · 2023-10-08T15:43:28Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005928 / 0.011353 (-0.005425)	0.003633 / 0.011008 (-0.007375)	0.079554 / 0.038508 (0.041046)	0.057093 / 0.023109 (0.033984)	0.311374 / 0.275898 (0.035476)	0.343778 / 0.323480 (0.020298)	0.004634 / 0.007986 (-0.003352)	0.002886 / 0.004328 (-0.001443)	0.061888 / 0.004250 (0.057637)	0.045895 / 0.037052 (0.008843)	0.316447 / 0.258489 (0.057958)	0.358141 / 0.293841 (0.064300)	0.027247 / 0.128546 (-0.101300)	0.007947 / 0.075646 (-0.067699)	0.259070 / 0.419271 (-0.160201)	0.043802 / 0.043533 (0.000269)	0.315453 / 0.255139 (0.060314)	0.335282 / 0.283200 (0.052082)	0.021096 / 0.141683 (-0.120587)	1.443219 / 1.452155 (-0.008936)	1.523140 / 1.492716 (0.030423)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.222957 / 0.018006 (0.204951)	0.414611 / 0.000490 (0.414122)	0.008354 / 0.000200 (0.008154)	0.000249 / 0.000054 (0.000195)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023880 / 0.037411 (-0.013532)	0.074523 / 0.014526 (0.059997)	0.084803 / 0.176557 (-0.091754)	0.146701 / 0.737135 (-0.590435)	0.084990 / 0.296338 (-0.211348)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.397736 / 0.215209 (0.182527)	3.961740 / 2.077655 (1.884086)	1.909014 / 1.504120 (0.404894)	1.823026 / 1.541195 (0.281831)	1.966235 / 1.468490 (0.497745)	0.498056 / 4.584777 (-4.086721)	3.041408 / 3.745712 (-0.704304)	2.998010 / 5.269862 (-2.271852)	1.887293 / 4.565676 (-2.678384)	0.057096 / 0.424275 (-0.367179)	0.006338 / 0.007607 (-0.001269)	0.465166 / 0.226044 (0.239122)	4.667710 / 2.268929 (2.398781)	2.480798 / 55.444624 (-52.963826)	2.270701 / 6.876477 (-4.605776)	2.376470 / 2.142072 (0.234397)	0.579873 / 4.805227 (-4.225355)	0.125032 / 6.500664 (-6.375632)	0.061057 / 0.075469 (-0.014412)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.229916 / 1.841788 (-0.611872)	17.829628 / 8.074308 (9.755320)	13.860184 / 10.191392 (3.668792)	0.143507 / 0.680424 (-0.536917)	0.016943 / 0.534201 (-0.517258)	0.350106 / 0.579283 (-0.229178)	0.364547 / 0.434364 (-0.069817)	0.398889 / 0.540337 (-0.141448)	0.557948 / 1.386936 (-0.828988)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006052 / 0.011353 (-0.005301)	0.003636 / 0.011008 (-0.007372)	0.062705 / 0.038508 (0.024197)	0.057753 / 0.023109 (0.034644)	0.453219 / 0.275898 (0.177321)	0.485179 / 0.323480 (0.161699)	0.004886 / 0.007986 (-0.003100)	0.002838 / 0.004328 (-0.001490)	0.062593 / 0.004250 (0.058343)	0.047476 / 0.037052 (0.010423)	0.454266 / 0.258489 (0.195777)	0.487939 / 0.293841 (0.194098)	0.028124 / 0.128546 (-0.100422)	0.008000 / 0.075646 (-0.067647)	0.068335 / 0.419271 (-0.350937)	0.040491 / 0.043533 (-0.003042)	0.457868 / 0.255139 (0.202729)	0.476355 / 0.283200 (0.193155)	0.019557 / 0.141683 (-0.122126)	1.507111 / 1.452155 (0.054956)	1.569720 / 1.492716 (0.077003)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.209205 / 0.018006 (0.191199)	0.411782 / 0.000490 (0.411292)	0.003544 / 0.000200 (0.003344)	0.000072 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026569 / 0.037411 (-0.010842)	0.081213 / 0.014526 (0.066687)	0.090971 / 0.176557 (-0.085585)	0.145287 / 0.737135 (-0.591849)	0.091792 / 0.296338 (-0.204546)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.458329 / 0.215209 (0.243120)	4.574463 / 2.077655 (2.496808)	2.516693 / 1.504120 (1.012573)	2.329463 / 1.541195 (0.788269)	2.386704 / 1.468490 (0.918214)	0.503526 / 4.584777 (-4.081251)	3.113382 / 3.745712 (-0.632331)	2.872538 / 5.269862 (-2.397323)	1.865483 / 4.565676 (-2.700194)	0.058292 / 0.424275 (-0.365983)	0.006434 / 0.007607 (-0.001173)	0.530804 / 0.226044 (0.304760)	5.312666 / 2.268929 (3.043738)	2.992569 / 55.444624 (-52.452055)	2.611524 / 6.876477 (-4.264953)	2.779569 / 2.142072 (0.637497)	0.595200 / 4.805227 (-4.210028)	0.123957 / 6.500664 (-6.376707)	0.060601 / 0.075469 (-0.014868)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.345536 / 1.841788 (-0.496252)	18.183827 / 8.074308 (10.109519)	14.814084 / 10.191392 (4.622692)	0.145305 / 0.680424 (-0.535119)	0.018812 / 0.534201 (-0.515389)	0.334793 / 0.579283 (-0.244490)	0.375331 / 0.434364 (-0.059033)	0.392499 / 0.540337 (-0.147839)	0.563286 / 1.386936 (-0.823650)

github-actions · 2023-10-08T17:22:48Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008922 / 0.011353 (-0.002431)	0.005169 / 0.011008 (-0.005840)	0.106275 / 0.038508 (0.067767)	0.076446 / 0.023109 (0.053337)	0.400207 / 0.275898 (0.124309)	0.476262 / 0.323480 (0.152782)	0.006032 / 0.007986 (-0.001954)	0.004266 / 0.004328 (-0.000063)	0.083518 / 0.004250 (0.079267)	0.059644 / 0.037052 (0.022592)	0.409094 / 0.258489 (0.150605)	0.470400 / 0.293841 (0.176559)	0.050161 / 0.128546 (-0.078385)	0.013580 / 0.075646 (-0.062066)	0.375047 / 0.419271 (-0.044224)	0.068319 / 0.043533 (0.024786)	0.433765 / 0.255139 (0.178626)	0.449221 / 0.283200 (0.166021)	0.037636 / 0.141683 (-0.104047)	1.825855 / 1.452155 (0.373700)	1.889665 / 1.492716 (0.396948)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.319622 / 0.018006 (0.301616)	0.588878 / 0.000490 (0.588388)	0.017790 / 0.000200 (0.017590)	0.000532 / 0.000054 (0.000477)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031152 / 0.037411 (-0.006259)	0.093808 / 0.014526 (0.079282)	0.119296 / 0.176557 (-0.057261)	0.181845 / 0.737135 (-0.555291)	0.108527 / 0.296338 (-0.187811)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.575106 / 0.215209 (0.359896)	5.776322 / 2.077655 (3.698668)	2.592913 / 1.504120 (1.088793)	2.389481 / 1.541195 (0.848286)	2.390117 / 1.468490 (0.921627)	0.852420 / 4.584777 (-3.732357)	5.474171 / 3.745712 (1.728459)	4.967188 / 5.269862 (-0.302674)	3.053712 / 4.565676 (-1.511965)	0.098128 / 0.424275 (-0.326147)	0.008722 / 0.007607 (0.001115)	0.699838 / 0.226044 (0.473794)	7.103622 / 2.268929 (4.834693)	3.359326 / 55.444624 (-52.085299)	2.733943 / 6.876477 (-4.142534)	2.770001 / 2.142072 (0.627929)	1.058217 / 4.805227 (-3.747011)	0.215845 / 6.500664 (-6.284820)	0.078532 / 0.075469 (0.003063)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.633173 / 1.841788 (-0.208614)	23.795045 / 8.074308 (15.720737)	21.094433 / 10.191392 (10.903041)	0.234522 / 0.680424 (-0.445902)	0.033632 / 0.534201 (-0.500569)	0.496701 / 0.579283 (-0.082582)	0.626861 / 0.434364 (0.192497)	0.558267 / 0.540337 (0.017930)	0.807461 / 1.386936 (-0.579475)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009136 / 0.011353 (-0.002217)	0.005425 / 0.011008 (-0.005584)	0.081478 / 0.038508 (0.042970)	0.077240 / 0.023109 (0.054130)	0.512156 / 0.275898 (0.236258)	0.561593 / 0.323480 (0.238113)	0.006499 / 0.007986 (-0.001486)	0.004080 / 0.004328 (-0.000248)	0.082121 / 0.004250 (0.077870)	0.063774 / 0.037052 (0.026722)	0.509801 / 0.258489 (0.251312)	0.572826 / 0.293841 (0.278985)	0.050969 / 0.128546 (-0.077578)	0.014876 / 0.075646 (-0.060771)	0.094815 / 0.419271 (-0.324456)	0.063904 / 0.043533 (0.020371)	0.530572 / 0.255139 (0.275433)	0.545940 / 0.283200 (0.262741)	0.036729 / 0.141683 (-0.104954)	1.799493 / 1.452155 (0.347339)	1.931955 / 1.492716 (0.439239)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.291405 / 0.018006 (0.273398)	0.590257 / 0.000490 (0.589767)	0.008394 / 0.000200 (0.008194)	0.000112 / 0.000054 (0.000058)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037613 / 0.037411 (0.000201)	0.103136 / 0.014526 (0.088610)	0.121744 / 0.176557 (-0.054813)	0.198503 / 0.737135 (-0.538632)	0.120183 / 0.296338 (-0.176156)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.659872 / 0.215209 (0.444663)	6.616775 / 2.077655 (4.539120)	3.031679 / 1.504120 (1.527559)	2.743489 / 1.541195 (1.202294)	2.786786 / 1.468490 (1.318296)	0.866625 / 4.584777 (-3.718152)	5.637705 / 3.745712 (1.891993)	4.702563 / 5.269862 (-0.567298)	3.017797 / 4.565676 (-1.547879)	0.100107 / 0.424275 (-0.324169)	0.008443 / 0.007607 (0.000836)	0.791385 / 0.226044 (0.565341)	7.869504 / 2.268929 (5.600576)	3.856634 / 55.444624 (-51.587991)	3.140089 / 6.876477 (-3.736388)	3.489339 / 2.142072 (1.347267)	1.132170 / 4.805227 (-3.673058)	0.219630 / 6.500664 (-6.281034)	0.082289 / 0.075469 (0.006820)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.781902 / 1.841788 (-0.059885)	24.912604 / 8.074308 (16.838296)	21.626512 / 10.191392 (11.435120)	0.228194 / 0.680424 (-0.452230)	0.032799 / 0.534201 (-0.501402)	0.483683 / 0.579283 (-0.095600)	0.604966 / 0.434364 (0.170602)	0.617278 / 0.540337 (0.076940)	0.887337 / 1.386936 (-0.499599)

mariosasko · 2023-10-09T18:25:42Z

I used this Colab to test the new push_to_hub on a large dataset (55 GB). It works great.

One thing that could be improved is the performance of dataset.data.nbytes - it takes ≈ 3 minutes to compute for the dataset in question (50k array chunks per column). It probably makes sense to store larger chunks locally. But this can be addressed in a subsequent PR.

lhoestq

Awesome !

I just added some comments. My main concerns are

single commit can fail (time out) if there are too many operations so we might have to do multi commits anyway in that case
how to let users resume a push_to_hub that failed mid-way because of a connection error for example

.github/workflows/ci.yml

setup.py

src/datasets/arrow_dataset.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

github-actions · 2023-10-10T12:59:17Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007190 / 0.011353 (-0.004163)	0.004394 / 0.011008 (-0.006614)	0.085506 / 0.038508 (0.046998)	0.092177 / 0.023109 (0.069068)	0.351636 / 0.275898 (0.075738)	0.389716 / 0.323480 (0.066236)	0.004443 / 0.007986 (-0.003543)	0.003641 / 0.004328 (-0.000687)	0.066578 / 0.004250 (0.062328)	0.061399 / 0.037052 (0.024346)	0.356008 / 0.258489 (0.097519)	0.398677 / 0.293841 (0.104836)	0.031958 / 0.128546 (-0.096588)	0.008857 / 0.075646 (-0.066789)	0.289613 / 0.419271 (-0.129659)	0.053555 / 0.043533 (0.010022)	0.349268 / 0.255139 (0.094129)	0.368666 / 0.283200 (0.085466)	0.028267 / 0.141683 (-0.113416)	1.502857 / 1.452155 (0.050702)	1.598422 / 1.492716 (0.105705)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.319938 / 0.018006 (0.301931)	0.566925 / 0.000490 (0.566435)	0.014625 / 0.000200 (0.014425)	0.000372 / 0.000054 (0.000318)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030156 / 0.037411 (-0.007255)	0.083128 / 0.014526 (0.068602)	0.101435 / 0.176557 (-0.075122)	0.158971 / 0.737135 (-0.578165)	0.101488 / 0.296338 (-0.194851)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.383904 / 0.215209 (0.168695)	3.829201 / 2.077655 (1.751546)	1.815224 / 1.504120 (0.311104)	1.647865 / 1.541195 (0.106670)	1.738411 / 1.468490 (0.269921)	0.484963 / 4.584777 (-4.099814)	3.494811 / 3.745712 (-0.250901)	3.505811 / 5.269862 (-1.764051)	2.115467 / 4.565676 (-2.450210)	0.057271 / 0.424275 (-0.367004)	0.007285 / 0.007607 (-0.000322)	0.467162 / 0.226044 (0.241118)	4.661572 / 2.268929 (2.392643)	2.330443 / 55.444624 (-53.114182)	1.986116 / 6.876477 (-4.890361)	2.055350 / 2.142072 (-0.086723)	0.580369 / 4.805227 (-4.224858)	0.132700 / 6.500664 (-6.367964)	0.061219 / 0.075469 (-0.014251)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.270843 / 1.841788 (-0.570945)	19.870723 / 8.074308 (11.796415)	14.368932 / 10.191392 (4.177540)	0.167345 / 0.680424 (-0.513079)	0.018358 / 0.534201 (-0.515843)	0.390833 / 0.579283 (-0.188450)	0.419884 / 0.434364 (-0.014480)	0.465683 / 0.540337 (-0.074655)	0.646101 / 1.386936 (-0.740835)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007027 / 0.011353 (-0.004326)	0.004578 / 0.011008 (-0.006430)	0.066468 / 0.038508 (0.027960)	0.081576 / 0.023109 (0.058466)	0.414928 / 0.275898 (0.139030)	0.452130 / 0.323480 (0.128651)	0.005861 / 0.007986 (-0.002124)	0.003740 / 0.004328 (-0.000588)	0.066943 / 0.004250 (0.062692)	0.060100 / 0.037052 (0.023048)	0.418697 / 0.258489 (0.160208)	0.466604 / 0.293841 (0.172764)	0.031887 / 0.128546 (-0.096660)	0.009119 / 0.075646 (-0.066527)	0.072285 / 0.419271 (-0.346986)	0.047599 / 0.043533 (0.004066)	0.410791 / 0.255139 (0.155652)	0.434182 / 0.283200 (0.150982)	0.024799 / 0.141683 (-0.116884)	1.500310 / 1.452155 (0.048155)	1.567151 / 1.492716 (0.074434)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.322482 / 0.018006 (0.304476)	0.550234 / 0.000490 (0.549744)	0.007796 / 0.000200 (0.007596)	0.000088 / 0.000054 (0.000033)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036013 / 0.037411 (-0.001398)	0.098482 / 0.014526 (0.083956)	0.111641 / 0.176557 (-0.064916)	0.166251 / 0.737135 (-0.570884)	0.112426 / 0.296338 (-0.183912)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.429181 / 0.215209 (0.213972)	4.273126 / 2.077655 (2.195472)	2.277440 / 1.504120 (0.773321)	2.112567 / 1.541195 (0.571372)	2.224118 / 1.468490 (0.755628)	0.488876 / 4.584777 (-4.095901)	3.711638 / 3.745712 (-0.034074)	3.480995 / 5.269862 (-1.788867)	2.122114 / 4.565676 (-2.443563)	0.057538 / 0.424275 (-0.366737)	0.007416 / 0.007607 (-0.000191)	0.506881 / 0.226044 (0.280836)	5.067601 / 2.268929 (2.798672)	2.769216 / 55.444624 (-52.675408)	2.420448 / 6.876477 (-4.456029)	2.694225 / 2.142072 (0.552153)	0.588911 / 4.805227 (-4.216316)	0.133542 / 6.500664 (-6.367122)	0.061135 / 0.075469 (-0.014334)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.378029 / 1.841788 (-0.463758)	20.660942 / 8.074308 (12.586634)	15.725969 / 10.191392 (5.534577)	0.169078 / 0.680424 (-0.511346)	0.020540 / 0.534201 (-0.513661)	0.399409 / 0.579283 (-0.179874)	0.432572 / 0.434364 (-0.001792)	0.477106 / 0.540337 (-0.063231)	0.675593 / 1.386936 (-0.711343)

mariosasko · 2023-10-10T13:21:18Z

@lhoestq

single commit can fail (time out) if there are too many operations so we might have to do multi commits anyway in that case

Multiple commits complicate the logic significantly. Maybe, let's keep things simple and emit a warning if there are more than 100 additions (we can suggest increasing max_shard_size in that case). Additionally, we can set the default max_shard_size to a higher value, e.g., 5GB. I think handling up to 500GB of data in the default case seems reasonable. In rare cases where this is a problem, one could increase the default max_shard_size even further (if RAM is not a limiting factor) or use to_parquet + huggingface_hub (we could have a docstring or a doc note that explains this).

Note that we split the dataset based on the Arrow data size, which means Parquet shards will be considerably smaller unless there are binary fields such as image JPEGs in the dataset, which are hard to compress efficiently.

how to let users resume a push_to_hub that failed mid-way because of a connection error for example

They can resume by rerunning the failed push_to_hub.

preupload_lfs_files will be instant in that scenario, as explained in huggingface/huggingface_hub#1699 (comment)

lhoestq · 2023-10-10T13:27:02Z

Multiple commits complicate the logic significantly. Maybe, let's keep things simple and emit a warning if there are more than 100 additions (we can suggest increasing max_shard_size in that case)

I don't think we can do that, many people are uploading files with 100+ files and it would break their workflow

github-actions · 2023-10-11T18:14:52Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006834 / 0.011353 (-0.004519)	0.004424 / 0.011008 (-0.006584)	0.085199 / 0.038508 (0.046691)	0.080237 / 0.023109 (0.057128)	0.308800 / 0.275898 (0.032902)	0.346314 / 0.323480 (0.022835)	0.004399 / 0.007986 (-0.003586)	0.003773 / 0.004328 (-0.000556)	0.065886 / 0.004250 (0.061636)	0.057830 / 0.037052 (0.020777)	0.312035 / 0.258489 (0.053546)	0.362646 / 0.293841 (0.068805)	0.031223 / 0.128546 (-0.097323)	0.008851 / 0.075646 (-0.066795)	0.288264 / 0.419271 (-0.131007)	0.052600 / 0.043533 (0.009067)	0.316127 / 0.255139 (0.060988)	0.328539 / 0.283200 (0.045340)	0.026068 / 0.141683 (-0.115615)	1.458928 / 1.452155 (0.006773)	1.547619 / 1.492716 (0.054902)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.274382 / 0.018006 (0.256375)	0.591192 / 0.000490 (0.590703)	0.009290 / 0.000200 (0.009090)	0.000327 / 0.000054 (0.000273)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031428 / 0.037411 (-0.005983)	0.087523 / 0.014526 (0.072997)	0.101427 / 0.176557 (-0.075130)	0.159228 / 0.737135 (-0.577907)	0.101430 / 0.296338 (-0.194909)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.393914 / 0.215209 (0.178705)	3.917323 / 2.077655 (1.839668)	1.940577 / 1.504120 (0.436457)	1.760996 / 1.541195 (0.219801)	1.865858 / 1.468490 (0.397368)	0.488920 / 4.584777 (-4.095857)	3.513465 / 3.745712 (-0.232248)	3.506600 / 5.269862 (-1.763261)	2.072583 / 4.565676 (-2.493093)	0.058256 / 0.424275 (-0.366019)	0.007420 / 0.007607 (-0.000187)	0.467241 / 0.226044 (0.241197)	4.671470 / 2.268929 (2.402542)	2.422717 / 55.444624 (-53.021908)	2.069501 / 6.876477 (-4.806975)	2.159257 / 2.142072 (0.017184)	0.583808 / 4.805227 (-4.221419)	0.134160 / 6.500664 (-6.366504)	0.068855 / 0.075469 (-0.006614)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.305299 / 1.841788 (-0.536488)	19.913902 / 8.074308 (11.839593)	14.708057 / 10.191392 (4.516665)	0.160113 / 0.680424 (-0.520311)	0.018431 / 0.534201 (-0.515770)	0.396147 / 0.579283 (-0.183136)	0.411738 / 0.434364 (-0.022626)	0.459297 / 0.540337 (-0.081041)	0.636599 / 1.386936 (-0.750337)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006936 / 0.011353 (-0.004417)	0.004290 / 0.011008 (-0.006718)	0.065754 / 0.038508 (0.027246)	0.080655 / 0.023109 (0.057546)	0.399701 / 0.275898 (0.123803)	0.435999 / 0.323480 (0.112519)	0.005690 / 0.007986 (-0.002295)	0.003580 / 0.004328 (-0.000748)	0.065685 / 0.004250 (0.061434)	0.059299 / 0.037052 (0.022246)	0.404295 / 0.258489 (0.145806)	0.438745 / 0.293841 (0.144904)	0.032241 / 0.128546 (-0.096305)	0.008699 / 0.075646 (-0.066947)	0.072053 / 0.419271 (-0.347218)	0.047489 / 0.043533 (0.003956)	0.395638 / 0.255139 (0.140499)	0.417224 / 0.283200 (0.134025)	0.022734 / 0.141683 (-0.118949)	1.507519 / 1.452155 (0.055364)	1.570459 / 1.492716 (0.077743)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.260442 / 0.018006 (0.242435)	0.551933 / 0.000490 (0.551444)	0.005240 / 0.000200 (0.005040)	0.000097 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033718 / 0.037411 (-0.003694)	0.095710 / 0.014526 (0.081184)	0.109970 / 0.176557 (-0.066586)	0.167930 / 0.737135 (-0.569205)	0.109977 / 0.296338 (-0.186362)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.430067 / 0.215209 (0.214857)	4.292564 / 2.077655 (2.214910)	2.313511 / 1.504120 (0.809391)	2.158153 / 1.541195 (0.616959)	2.262486 / 1.468490 (0.793996)	0.492376 / 4.584777 (-4.092401)	3.622287 / 3.745712 (-0.123425)	3.380162 / 5.269862 (-1.889699)	2.111874 / 4.565676 (-2.453803)	0.057882 / 0.424275 (-0.366393)	0.007317 / 0.007607 (-0.000290)	0.504722 / 0.226044 (0.278678)	5.039009 / 2.268929 (2.770080)	2.772162 / 55.444624 (-52.672463)	2.430928 / 6.876477 (-4.445549)	2.666556 / 2.142072 (0.524484)	0.586722 / 4.805227 (-4.218505)	0.133780 / 6.500664 (-6.366884)	0.060269 / 0.075469 (-0.015200)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.339064 / 1.841788 (-0.502724)	20.743931 / 8.074308 (12.669623)	15.491066 / 10.191392 (5.299674)	0.159236 / 0.680424 (-0.521188)	0.020722 / 0.534201 (-0.513479)	0.399440 / 0.579283 (-0.179843)	0.424501 / 0.434364 (-0.009863)	0.474026 / 0.540337 (-0.066311)	0.685239 / 1.386936 (-0.701697)

github-actions · 2023-10-11T20:35:45Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005930 / 0.011353 (-0.005422)	0.003496 / 0.011008 (-0.007512)	0.079631 / 0.038508 (0.041123)	0.058250 / 0.023109 (0.035141)	0.310108 / 0.275898 (0.034210)	0.352747 / 0.323480 (0.029267)	0.005367 / 0.007986 (-0.002619)	0.002943 / 0.004328 (-0.001386)	0.062449 / 0.004250 (0.058199)	0.046433 / 0.037052 (0.009381)	0.311020 / 0.258489 (0.052531)	0.361033 / 0.293841 (0.067192)	0.027419 / 0.128546 (-0.101128)	0.008073 / 0.075646 (-0.067574)	0.261403 / 0.419271 (-0.157869)	0.045059 / 0.043533 (0.001527)	0.310622 / 0.255139 (0.055483)	0.344361 / 0.283200 (0.061161)	0.020561 / 0.141683 (-0.121122)	1.427409 / 1.452155 (-0.024746)	1.506612 / 1.492716 (0.013896)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.234095 / 0.018006 (0.216089)	0.432603 / 0.000490 (0.432113)	0.010283 / 0.000200 (0.010083)	0.000289 / 0.000054 (0.000235)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024263 / 0.037411 (-0.013148)	0.073672 / 0.014526 (0.059146)	0.084080 / 0.176557 (-0.092476)	0.146679 / 0.737135 (-0.590457)	0.084337 / 0.296338 (-0.212001)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434297 / 0.215209 (0.219088)	4.358287 / 2.077655 (2.280633)	2.268461 / 1.504120 (0.764341)	2.107924 / 1.541195 (0.566729)	2.165136 / 1.468490 (0.696646)	0.498421 / 4.584777 (-4.086356)	3.094414 / 3.745712 (-0.651298)	2.991511 / 5.269862 (-2.278351)	1.998052 / 4.565676 (-2.567624)	0.057363 / 0.424275 (-0.366912)	0.006405 / 0.007607 (-0.001203)	0.508396 / 0.226044 (0.282351)	5.104756 / 2.268929 (2.835828)	2.720462 / 55.444624 (-52.724163)	2.391840 / 6.876477 (-4.484637)	2.443063 / 2.142072 (0.300991)	0.590015 / 4.805227 (-4.215212)	0.125414 / 6.500664 (-6.375250)	0.061122 / 0.075469 (-0.014347)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.221883 / 1.841788 (-0.619904)	17.788248 / 8.074308 (9.713940)	13.753315 / 10.191392 (3.561923)	0.146388 / 0.680424 (-0.534036)	0.017038 / 0.534201 (-0.517163)	0.339162 / 0.579283 (-0.240121)	0.372054 / 0.434364 (-0.062309)	0.381507 / 0.540337 (-0.158830)	0.538603 / 1.386936 (-0.848333)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006044 / 0.011353 (-0.005309)	0.003654 / 0.011008 (-0.007354)	0.062956 / 0.038508 (0.024448)	0.061325 / 0.023109 (0.038216)	0.450006 / 0.275898 (0.174108)	0.474560 / 0.323480 (0.151080)	0.004846 / 0.007986 (-0.003140)	0.002904 / 0.004328 (-0.001425)	0.064206 / 0.004250 (0.059956)	0.047850 / 0.037052 (0.010798)	0.448431 / 0.258489 (0.189942)	0.481363 / 0.293841 (0.187523)	0.028622 / 0.128546 (-0.099925)	0.008255 / 0.075646 (-0.067391)	0.068461 / 0.419271 (-0.350810)	0.040234 / 0.043533 (-0.003299)	0.447396 / 0.255139 (0.192257)	0.465383 / 0.283200 (0.182184)	0.021864 / 0.141683 (-0.119819)	1.402197 / 1.452155 (-0.049957)	1.475337 / 1.492716 (-0.017379)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227093 / 0.018006 (0.209087)	0.407908 / 0.000490 (0.407419)	0.006709 / 0.000200 (0.006509)	0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026560 / 0.037411 (-0.010851)	0.080926 / 0.014526 (0.066400)	0.091531 / 0.176557 (-0.085026)	0.145742 / 0.737135 (-0.591393)	0.092203 / 0.296338 (-0.204135)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.473029 / 0.215209 (0.257820)	4.703613 / 2.077655 (2.625958)	2.642622 / 1.504120 (1.138502)	2.465376 / 1.541195 (0.924181)	2.510125 / 1.468490 (1.041635)	0.512606 / 4.584777 (-4.072171)	3.132127 / 3.745712 (-0.613585)	2.890098 / 5.269862 (-2.379763)	1.908140 / 4.565676 (-2.657537)	0.058938 / 0.424275 (-0.365337)	0.006486 / 0.007607 (-0.001121)	0.542279 / 0.226044 (0.316235)	5.435621 / 2.268929 (3.166693)	3.083943 / 55.444624 (-52.360681)	2.761575 / 6.876477 (-4.114901)	2.919672 / 2.142072 (0.777599)	0.608022 / 4.805227 (-4.197205)	0.126821 / 6.500664 (-6.373843)	0.061374 / 0.075469 (-0.014095)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.348848 / 1.841788 (-0.492940)	18.323507 / 8.074308 (10.249199)	14.713411 / 10.191392 (4.522019)	0.155277 / 0.680424 (-0.525146)	0.017739 / 0.534201 (-0.516462)	0.337357 / 0.579283 (-0.241926)	0.376519 / 0.434364 (-0.057844)	0.398011 / 0.540337 (-0.142327)	0.589797 / 1.386936 (-0.797139)

lhoestq

Awesome ! I love it :)

src/datasets/arrow_dataset.py

src/datasets/dataset_dict.py

tests/test_upstream_hub.py

src/datasets/dataset_dict.py

src/datasets/arrow_dataset.py

src/datasets/dataset_dict.py

lhoestq · 2023-10-13T14:08:21Z

src/datasets/arrow_dataset.py

-        repo_files = list(set(files) - set(data_files_to_delete))
+            shard_path_in_repo = f"{data_dir}/{split}-{index:05d}-of-{num_shards:05d}.parquet"
+            buffer = BytesIO()
+            shard.to_parquet(buffer)


(maybe for another PR)

we could only show the tqdm bar of the parquet conversion if it takes more than 5sec, using the "delay" argument in tqdm

Makes sense. I think we can address this in a later PR (I think our entire logging requires a little overhaul)

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

github-actions · 2023-10-13T16:14:46Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007823 / 0.011353 (-0.003530)	0.004136 / 0.011008 (-0.006872)	0.087282 / 0.038508 (0.048774)	0.086352 / 0.023109 (0.063243)	0.328107 / 0.275898 (0.052209)	0.368717 / 0.323480 (0.045237)	0.005452 / 0.007986 (-0.002533)	0.003460 / 0.004328 (-0.000868)	0.064360 / 0.004250 (0.060110)	0.062215 / 0.037052 (0.025162)	0.334666 / 0.258489 (0.076177)	0.388688 / 0.293841 (0.094847)	0.031093 / 0.128546 (-0.097454)	0.008510 / 0.075646 (-0.067137)	0.295965 / 0.419271 (-0.123306)	0.052858 / 0.043533 (0.009325)	0.320104 / 0.255139 (0.064965)	0.346761 / 0.283200 (0.063562)	0.024864 / 0.141683 (-0.116819)	1.483164 / 1.452155 (0.031010)	1.580363 / 1.492716 (0.087647)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.243523 / 0.018006 (0.225516)	0.459741 / 0.000490 (0.459251)	0.010508 / 0.000200 (0.010308)	0.000384 / 0.000054 (0.000330)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029896 / 0.037411 (-0.007515)	0.089150 / 0.014526 (0.074624)	0.098855 / 0.176557 (-0.077702)	0.154469 / 0.737135 (-0.582667)	0.099546 / 0.296338 (-0.196792)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.403547 / 0.215209 (0.188338)	4.036711 / 2.077655 (1.959056)	2.030882 / 1.504120 (0.526762)	1.850432 / 1.541195 (0.309238)	1.924248 / 1.468490 (0.455758)	0.493153 / 4.584777 (-4.091624)	3.634074 / 3.745712 (-0.111638)	3.546145 / 5.269862 (-1.723717)	2.120819 / 4.565676 (-2.444858)	0.057137 / 0.424275 (-0.367138)	0.007454 / 0.007607 (-0.000153)	0.481687 / 0.226044 (0.255642)	4.813203 / 2.268929 (2.544275)	2.481260 / 55.444624 (-52.963364)	2.194185 / 6.876477 (-4.682292)	2.255381 / 2.142072 (0.113308)	0.575160 / 4.805227 (-4.230068)	0.132310 / 6.500664 (-6.368355)	0.061917 / 0.075469 (-0.013553)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.265722 / 1.841788 (-0.576066)	19.949624 / 8.074308 (11.875315)	14.804356 / 10.191392 (4.612964)	0.170485 / 0.680424 (-0.509939)	0.018831 / 0.534201 (-0.515370)	0.407051 / 0.579283 (-0.172233)	0.420560 / 0.434364 (-0.013804)	0.470721 / 0.540337 (-0.069616)	0.651665 / 1.386936 (-0.735271)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007113 / 0.011353 (-0.004240)	0.004186 / 0.011008 (-0.006822)	0.065082 / 0.038508 (0.026574)	0.080275 / 0.023109 (0.057166)	0.393460 / 0.275898 (0.117562)	0.426702 / 0.323480 (0.103223)	0.005639 / 0.007986 (-0.002347)	0.003492 / 0.004328 (-0.000836)	0.065774 / 0.004250 (0.061523)	0.059708 / 0.037052 (0.022656)	0.395598 / 0.258489 (0.137109)	0.437088 / 0.293841 (0.143247)	0.033165 / 0.128546 (-0.095381)	0.008559 / 0.075646 (-0.067087)	0.071782 / 0.419271 (-0.347490)	0.048672 / 0.043533 (0.005139)	0.393883 / 0.255139 (0.138744)	0.412817 / 0.283200 (0.129617)	0.024115 / 0.141683 (-0.117568)	1.522752 / 1.452155 (0.070597)	1.577311 / 1.492716 (0.084595)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.225569 / 0.018006 (0.207563)	0.460310 / 0.000490 (0.459820)	0.004733 / 0.000200 (0.004533)	0.000115 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035241 / 0.037411 (-0.002170)	0.098092 / 0.014526 (0.083566)	0.108025 / 0.176557 (-0.068531)	0.162910 / 0.737135 (-0.574225)	0.108649 / 0.296338 (-0.187689)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.441723 / 0.215209 (0.226514)	4.400656 / 2.077655 (2.323001)	2.413588 / 1.504120 (0.909468)	2.261890 / 1.541195 (0.720696)	2.420878 / 1.468490 (0.952388)	0.496456 / 4.584777 (-4.088321)	3.679930 / 3.745712 (-0.065782)	3.390539 / 5.269862 (-1.879322)	2.109599 / 4.565676 (-2.456078)	0.058896 / 0.424275 (-0.365379)	0.007483 / 0.007607 (-0.000125)	0.521108 / 0.226044 (0.295064)	5.209468 / 2.268929 (2.940540)	2.948595 / 55.444624 (-52.496029)	2.658864 / 6.876477 (-4.217613)	2.913653 / 2.142072 (0.771580)	0.602776 / 4.805227 (-4.202451)	0.136166 / 6.500664 (-6.364498)	0.063812 / 0.075469 (-0.011657)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.350306 / 1.841788 (-0.491482)	20.453980 / 8.074308 (12.379672)	15.758719 / 10.191392 (5.567327)	0.165847 / 0.680424 (-0.514577)	0.020254 / 0.534201 (-0.513947)	0.400006 / 0.579283 (-0.179277)	0.440336 / 0.434364 (0.005972)	0.480122 / 0.540337 (-0.060215)	0.688994 / 1.386936 (-0.697942)

…mmit-push_to_hub

github-actions · 2023-10-13T16:52:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008633 / 0.011353 (-0.002720)	0.004851 / 0.011008 (-0.006157)	0.100647 / 0.038508 (0.062139)	0.084701 / 0.023109 (0.061592)	0.410489 / 0.275898 (0.134590)	0.440231 / 0.323480 (0.116751)	0.004679 / 0.007986 (-0.003307)	0.004172 / 0.004328 (-0.000157)	0.079911 / 0.004250 (0.075661)	0.069537 / 0.037052 (0.032485)	0.423506 / 0.258489 (0.165017)	0.466098 / 0.293841 (0.172257)	0.048773 / 0.128546 (-0.079773)	0.014446 / 0.075646 (-0.061200)	0.342776 / 0.419271 (-0.076495)	0.065672 / 0.043533 (0.022139)	0.411845 / 0.255139 (0.156706)	0.466662 / 0.283200 (0.183462)	0.035752 / 0.141683 (-0.105931)	1.684956 / 1.452155 (0.232801)	1.832173 / 1.492716 (0.339456)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.250744 / 0.018006 (0.232738)	0.528860 / 0.000490 (0.528371)	0.013301 / 0.000200 (0.013101)	0.000413 / 0.000054 (0.000359)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032376 / 0.037411 (-0.005035)	0.094630 / 0.014526 (0.080104)	0.107163 / 0.176557 (-0.069394)	0.172503 / 0.737135 (-0.564633)	0.108407 / 0.296338 (-0.187932)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.671251 / 0.215209 (0.456042)	6.235361 / 2.077655 (4.157706)	2.650328 / 1.504120 (1.146208)	2.341199 / 1.541195 (0.800004)	2.368803 / 1.468490 (0.900313)	0.841347 / 4.584777 (-3.743430)	5.042508 / 3.745712 (1.296796)	4.807565 / 5.269862 (-0.462296)	3.007420 / 4.565676 (-1.558257)	0.099953 / 0.424275 (-0.324322)	0.008412 / 0.007607 (0.000805)	0.747803 / 0.226044 (0.521759)	7.481245 / 2.268929 (5.212316)	3.416157 / 55.444624 (-52.028467)	2.724608 / 6.876477 (-4.151869)	2.832982 / 2.142072 (0.690910)	1.072423 / 4.805227 (-3.732804)	0.211314 / 6.500664 (-6.289351)	0.074098 / 0.075469 (-0.001371)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.566010 / 1.841788 (-0.275778)	23.137708 / 8.074308 (15.063400)	21.440132 / 10.191392 (11.248740)	0.230713 / 0.680424 (-0.449711)	0.028271 / 0.534201 (-0.505930)	0.450821 / 0.579283 (-0.128463)	0.548399 / 0.434364 (0.114035)	0.543588 / 0.540337 (0.003250)	0.805522 / 1.386936 (-0.581414)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008969 / 0.011353 (-0.002384)	0.004793 / 0.011008 (-0.006216)	0.075804 / 0.038508 (0.037296)	0.079893 / 0.023109 (0.056783)	0.464358 / 0.275898 (0.188460)	0.507243 / 0.323480 (0.183763)	0.005945 / 0.007986 (-0.002040)	0.005341 / 0.004328 (0.001012)	0.077952 / 0.004250 (0.073701)	0.059965 / 0.037052 (0.022913)	0.478947 / 0.258489 (0.220458)	0.528444 / 0.293841 (0.234603)	0.052878 / 0.128546 (-0.075668)	0.013939 / 0.075646 (-0.061707)	0.087351 / 0.419271 (-0.331920)	0.058448 / 0.043533 (0.014916)	0.478664 / 0.255139 (0.223525)	0.491239 / 0.283200 (0.208039)	0.032674 / 0.141683 (-0.109008)	1.753911 / 1.452155 (0.301756)	1.858923 / 1.492716 (0.366206)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.239278 / 0.018006 (0.221271)	0.507372 / 0.000490 (0.506882)	0.005489 / 0.000200 (0.005289)	0.000142 / 0.000054 (0.000087)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032919 / 0.037411 (-0.004493)	0.097726 / 0.014526 (0.083200)	0.119159 / 0.176557 (-0.057398)	0.174545 / 0.737135 (-0.562590)	0.115319 / 0.296338 (-0.181020)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.627107 / 0.215209 (0.411898)	6.211925 / 2.077655 (4.134270)	2.731484 / 1.504120 (1.227365)	2.488847 / 1.541195 (0.947652)	2.372445 / 1.468490 (0.903955)	0.822663 / 4.584777 (-3.762114)	4.924001 / 3.745712 (1.178289)	4.371161 / 5.269862 (-0.898700)	2.850314 / 4.565676 (-1.715363)	0.099156 / 0.424275 (-0.325119)	0.007941 / 0.007607 (0.000334)	0.721539 / 0.226044 (0.495495)	7.260874 / 2.268929 (4.991946)	3.351072 / 55.444624 (-52.093552)	2.757115 / 6.876477 (-4.119362)	2.858899 / 2.142072 (0.716827)	0.994054 / 4.805227 (-3.811173)	0.209186 / 6.500664 (-6.291478)	0.072070 / 0.075469 (-0.003399)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.748073 / 1.841788 (-0.093714)	23.514638 / 8.074308 (15.440330)	20.372037 / 10.191392 (10.180645)	0.220020 / 0.680424 (-0.460404)	0.057130 / 0.534201 (-0.477071)	0.458204 / 0.579283 (-0.121079)	0.600509 / 0.434364 (0.166145)	0.557100 / 0.540337 (0.016762)	0.814360 / 1.386936 (-0.572576)

github-actions · 2023-10-13T17:23:11Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007341 / 0.011353 (-0.004012)	0.004606 / 0.011008 (-0.006402)	0.087903 / 0.038508 (0.049395)	0.094090 / 0.023109 (0.070981)	0.322278 / 0.275898 (0.046380)	0.356770 / 0.323480 (0.033290)	0.005988 / 0.007986 (-0.001997)	0.003667 / 0.004328 (-0.000662)	0.066105 / 0.004250 (0.061854)	0.061220 / 0.037052 (0.024167)	0.331190 / 0.258489 (0.072701)	0.381402 / 0.293841 (0.087561)	0.032261 / 0.128546 (-0.096285)	0.009281 / 0.075646 (-0.066366)	0.293694 / 0.419271 (-0.125577)	0.055041 / 0.043533 (0.011508)	0.318080 / 0.255139 (0.062941)	0.348763 / 0.283200 (0.065563)	0.027379 / 0.141683 (-0.114304)	1.496294 / 1.452155 (0.044139)	1.581942 / 1.492716 (0.089226)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.307592 / 0.018006 (0.289586)	0.591805 / 0.000490 (0.591316)	0.017082 / 0.000200 (0.016882)	0.000721 / 0.000054 (0.000666)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032157 / 0.037411 (-0.005254)	0.096249 / 0.014526 (0.081724)	0.106656 / 0.176557 (-0.069901)	0.162966 / 0.737135 (-0.574169)	0.107068 / 0.296338 (-0.189271)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.409083 / 0.215209 (0.193874)	4.044307 / 2.077655 (1.966652)	2.062887 / 1.504120 (0.558767)	1.900568 / 1.541195 (0.359373)	2.011862 / 1.468490 (0.543372)	0.489250 / 4.584777 (-4.095527)	3.519531 / 3.745712 (-0.226182)	3.631713 / 5.269862 (-1.638149)	2.163967 / 4.565676 (-2.401709)	0.057723 / 0.424275 (-0.366552)	0.007474 / 0.007607 (-0.000133)	0.479562 / 0.226044 (0.253517)	4.799825 / 2.268929 (2.530897)	2.530036 / 55.444624 (-52.914588)	2.195344 / 6.876477 (-4.681133)	2.341046 / 2.142072 (0.198974)	0.625105 / 4.805227 (-4.180122)	0.132823 / 6.500664 (-6.367841)	0.061721 / 0.075469 (-0.013748)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.301313 / 1.841788 (-0.540475)	21.218468 / 8.074308 (13.144159)	15.466347 / 10.191392 (5.274955)	0.166115 / 0.680424 (-0.514309)	0.018866 / 0.534201 (-0.515335)	0.399307 / 0.579283 (-0.179976)	0.430537 / 0.434364 (-0.003827)	0.467110 / 0.540337 (-0.073228)	0.645686 / 1.386936 (-0.741250)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007288 / 0.011353 (-0.004065)	0.004298 / 0.011008 (-0.006710)	0.065515 / 0.038508 (0.027007)	0.089948 / 0.023109 (0.066839)	0.410121 / 0.275898 (0.134223)	0.449312 / 0.323480 (0.125832)	0.006749 / 0.007986 (-0.001237)	0.003927 / 0.004328 (-0.000401)	0.065321 / 0.004250 (0.061071)	0.062480 / 0.037052 (0.025428)	0.410796 / 0.258489 (0.152307)	0.457356 / 0.293841 (0.163515)	0.032632 / 0.128546 (-0.095914)	0.008798 / 0.075646 (-0.066849)	0.075936 / 0.419271 (-0.343335)	0.048402 / 0.043533 (0.004869)	0.403385 / 0.255139 (0.148246)	0.426094 / 0.283200 (0.142895)	0.025326 / 0.141683 (-0.116357)	1.551550 / 1.452155 (0.099395)	1.628622 / 1.492716 (0.135905)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.279689 / 0.018006 (0.261682)	0.583754 / 0.000490 (0.583265)	0.006579 / 0.000200 (0.006379)	0.000096 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034906 / 0.037411 (-0.002505)	0.099232 / 0.014526 (0.084706)	0.113093 / 0.176557 (-0.063464)	0.165499 / 0.737135 (-0.571636)	0.113398 / 0.296338 (-0.182941)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.439154 / 0.215209 (0.223945)	4.377041 / 2.077655 (2.299387)	2.395058 / 1.504120 (0.890938)	2.233359 / 1.541195 (0.692164)	2.357281 / 1.468490 (0.888791)	0.486036 / 4.584777 (-4.098741)	3.568794 / 3.745712 (-0.176918)	3.485421 / 5.269862 (-1.784440)	2.174325 / 4.565676 (-2.391351)	0.057855 / 0.424275 (-0.366420)	0.007545 / 0.007607 (-0.000062)	0.516853 / 0.226044 (0.290808)	5.173340 / 2.268929 (2.904412)	2.931475 / 55.444624 (-52.513149)	2.566814 / 6.876477 (-4.309663)	2.873304 / 2.142072 (0.731232)	0.597072 / 4.805227 (-4.208155)	0.133589 / 6.500664 (-6.367075)	0.061882 / 0.075469 (-0.013587)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.382845 / 1.841788 (-0.458943)	21.608316 / 8.074308 (13.534008)	15.702152 / 10.191392 (5.510759)	0.190629 / 0.680424 (-0.489795)	0.020572 / 0.534201 (-0.513629)	0.396207 / 0.579283 (-0.183076)	0.421184 / 0.434364 (-0.013180)	0.477700 / 0.540337 (-0.062638)	0.690828 / 1.386936 (-0.696108)

src/datasets/arrow_dataset.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

github-actions · 2023-10-16T13:16:44Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008450 / 0.011353 (-0.002903)	0.004958 / 0.011008 (-0.006051)	0.105397 / 0.038508 (0.066889)	0.079508 / 0.023109 (0.056399)	0.403050 / 0.275898 (0.127152)	0.443679 / 0.323480 (0.120199)	0.004654 / 0.007986 (-0.003332)	0.005629 / 0.004328 (0.001301)	0.078755 / 0.004250 (0.074505)	0.055694 / 0.037052 (0.018642)	0.409952 / 0.258489 (0.151463)	0.454931 / 0.293841 (0.161090)	0.045124 / 0.128546 (-0.083422)	0.014031 / 0.075646 (-0.061616)	0.347340 / 0.419271 (-0.071931)	0.064359 / 0.043533 (0.020826)	0.414158 / 0.255139 (0.159019)	0.428442 / 0.283200 (0.145243)	0.033726 / 0.141683 (-0.107957)	1.770483 / 1.452155 (0.318328)	1.795267 / 1.492716 (0.302551)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.251020 / 0.018006 (0.233014)	0.507066 / 0.000490 (0.506576)	0.015751 / 0.000200 (0.015551)	0.000531 / 0.000054 (0.000477)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028897 / 0.037411 (-0.008515)	0.087393 / 0.014526 (0.072867)	0.097365 / 0.176557 (-0.079192)	0.164833 / 0.737135 (-0.572303)	0.101281 / 0.296338 (-0.195058)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.610806 / 0.215209 (0.395597)	6.011697 / 2.077655 (3.934042)	2.544268 / 1.504120 (1.040148)	2.127103 / 1.541195 (0.585908)	2.133330 / 1.468490 (0.664839)	0.860964 / 4.584777 (-3.723813)	4.982374 / 3.745712 (1.236662)	5.073026 / 5.269862 (-0.196836)	3.033056 / 4.565676 (-1.532621)	0.118835 / 0.424275 (-0.305440)	0.010122 / 0.007607 (0.002515)	0.805807 / 0.226044 (0.579763)	7.839166 / 2.268929 (5.570238)	3.512405 / 55.444624 (-51.932219)	2.767578 / 6.876477 (-4.108898)	2.936885 / 2.142072 (0.794813)	1.058533 / 4.805227 (-3.746695)	0.222260 / 6.500664 (-6.278404)	0.073890 / 0.075469 (-0.001580)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.628307 / 1.841788 (-0.213480)	22.827116 / 8.074308 (14.752808)	21.809759 / 10.191392 (11.618367)	0.220637 / 0.680424 (-0.459786)	0.028030 / 0.534201 (-0.506171)	0.448620 / 0.579283 (-0.130663)	0.540442 / 0.434364 (0.106078)	0.548601 / 0.540337 (0.008264)	0.770387 / 1.386936 (-0.616549)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009198 / 0.011353 (-0.002155)	0.004935 / 0.011008 (-0.006073)	0.079095 / 0.038508 (0.040587)	0.090490 / 0.023109 (0.067381)	0.453374 / 0.275898 (0.177476)	0.519483 / 0.323480 (0.196003)	0.006539 / 0.007986 (-0.001447)	0.004160 / 0.004328 (-0.000169)	0.078433 / 0.004250 (0.074182)	0.068022 / 0.037052 (0.030969)	0.467686 / 0.258489 (0.209197)	0.523863 / 0.293841 (0.230022)	0.050926 / 0.128546 (-0.077620)	0.013664 / 0.075646 (-0.061982)	0.088787 / 0.419271 (-0.330485)	0.060503 / 0.043533 (0.016971)	0.474692 / 0.255139 (0.219553)	0.516461 / 0.283200 (0.233261)	0.034482 / 0.141683 (-0.107200)	1.747939 / 1.452155 (0.295784)	1.915212 / 1.492716 (0.422496)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.247400 / 0.018006 (0.229394)	0.516829 / 0.000490 (0.516339)	0.005770 / 0.000200 (0.005570)	0.000121 / 0.000054 (0.000067)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034334 / 0.037411 (-0.003077)	0.102397 / 0.014526 (0.087871)	0.114187 / 0.176557 (-0.062370)	0.171093 / 0.737135 (-0.566043)	0.117281 / 0.296338 (-0.179058)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.635710 / 0.215209 (0.420501)	6.400656 / 2.077655 (4.323002)	2.896896 / 1.504120 (1.392776)	2.682890 / 1.541195 (1.141696)	2.656445 / 1.468490 (1.187955)	1.044244 / 4.584777 (-3.540533)	5.393212 / 3.745712 (1.647500)	4.592928 / 5.269862 (-0.676934)	2.798525 / 4.565676 (-1.767151)	0.103720 / 0.424275 (-0.320555)	0.010196 / 0.007607 (0.002589)	0.762756 / 0.226044 (0.536711)	7.232939 / 2.268929 (4.964011)	3.714015 / 55.444624 (-51.730609)	3.050766 / 6.876477 (-3.825711)	3.149715 / 2.142072 (1.007643)	1.058827 / 4.805227 (-3.746400)	0.214079 / 6.500664 (-6.286585)	0.076712 / 0.075469 (0.001243)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.701032 / 1.841788 (-0.140755)	23.742023 / 8.074308 (15.667715)	22.486043 / 10.191392 (12.294651)	0.249757 / 0.680424 (-0.430667)	0.031714 / 0.534201 (-0.502486)	0.479914 / 0.579283 (-0.099369)	0.593315 / 0.434364 (0.158951)	0.562897 / 0.540337 (0.022560)	0.826636 / 1.386936 (-0.560300)

github-actions · 2023-10-16T13:18:18Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007816 / 0.011353 (-0.003537)	0.004541 / 0.011008 (-0.006467)	0.097256 / 0.038508 (0.058748)	0.081376 / 0.023109 (0.058267)	0.356635 / 0.275898 (0.080737)	0.394969 / 0.323480 (0.071489)	0.004670 / 0.007986 (-0.003316)	0.003537 / 0.004328 (-0.000791)	0.075564 / 0.004250 (0.071314)	0.063459 / 0.037052 (0.026407)	0.363846 / 0.258489 (0.105357)	0.416337 / 0.293841 (0.122496)	0.036690 / 0.128546 (-0.091857)	0.009653 / 0.075646 (-0.065993)	0.337265 / 0.419271 (-0.082007)	0.061446 / 0.043533 (0.017913)	0.359190 / 0.255139 (0.104051)	0.385866 / 0.283200 (0.102666)	0.030474 / 0.141683 (-0.111209)	1.796903 / 1.452155 (0.344748)	1.852332 / 1.492716 (0.359616)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.264008 / 0.018006 (0.246002)	0.507387 / 0.000490 (0.506897)	0.012309 / 0.000200 (0.012109)	0.000377 / 0.000054 (0.000323)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033224 / 0.037411 (-0.004188)	0.097136 / 0.014526 (0.082610)	0.113035 / 0.176557 (-0.063522)	0.181778 / 0.737135 (-0.555357)	0.130511 / 0.296338 (-0.165827)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.444512 / 0.215209 (0.229303)	4.453285 / 2.077655 (2.375631)	2.154123 / 1.504120 (0.650003)	1.955451 / 1.541195 (0.414256)	2.015089 / 1.468490 (0.546599)	0.567824 / 4.584777 (-4.016953)	4.083084 / 3.745712 (0.337371)	3.912417 / 5.269862 (-1.357445)	2.366197 / 4.565676 (-2.199480)	0.066468 / 0.424275 (-0.357807)	0.008478 / 0.007607 (0.000870)	0.531196 / 0.226044 (0.305152)	5.311285 / 2.268929 (3.042356)	2.743252 / 55.444624 (-52.701372)	2.322353 / 6.876477 (-4.554124)	2.368168 / 2.142072 (0.226095)	0.679223 / 4.805227 (-4.126004)	0.152401 / 6.500664 (-6.348263)	0.071954 / 0.075469 (-0.003515)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.489114 / 1.841788 (-0.352674)	22.114956 / 8.074308 (14.040648)	16.072564 / 10.191392 (5.881172)	0.164303 / 0.680424 (-0.516121)	0.021317 / 0.534201 (-0.512884)	0.460250 / 0.579283 (-0.119033)	0.467554 / 0.434364 (0.033190)	0.539773 / 0.540337 (-0.000564)	0.751904 / 1.386936 (-0.635032)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007520 / 0.011353 (-0.003833)	0.004487 / 0.011008 (-0.006521)	0.075074 / 0.038508 (0.036566)	0.083135 / 0.023109 (0.060026)	0.474052 / 0.275898 (0.198154)	0.524051 / 0.323480 (0.200571)	0.006192 / 0.007986 (-0.001793)	0.003835 / 0.004328 (-0.000494)	0.074643 / 0.004250 (0.070392)	0.065334 / 0.037052 (0.028282)	0.507033 / 0.258489 (0.248544)	0.519846 / 0.293841 (0.226005)	0.036985 / 0.128546 (-0.091561)	0.009828 / 0.075646 (-0.065818)	0.082992 / 0.419271 (-0.336279)	0.055942 / 0.043533 (0.012409)	0.480652 / 0.255139 (0.225513)	0.503683 / 0.283200 (0.220483)	0.025560 / 0.141683 (-0.116123)	1.801390 / 1.452155 (0.349235)	1.892929 / 1.492716 (0.400213)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.246771 / 0.018006 (0.228765)	0.498901 / 0.000490 (0.498411)	0.008186 / 0.000200 (0.007986)	0.000166 / 0.000054 (0.000112)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.038666 / 0.037411 (0.001254)	0.110317 / 0.014526 (0.095791)	0.122995 / 0.176557 (-0.053562)	0.185355 / 0.737135 (-0.551781)	0.123720 / 0.296338 (-0.172619)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.508421 / 0.215209 (0.293212)	5.046464 / 2.077655 (2.968809)	2.660004 / 1.504120 (1.155884)	2.482841 / 1.541195 (0.941646)	2.573941 / 1.468490 (1.105451)	0.565702 / 4.584777 (-4.019075)	4.197895 / 3.745712 (0.452183)	3.755480 / 5.269862 (-1.514381)	2.308066 / 4.565676 (-2.257610)	0.066559 / 0.424275 (-0.357716)	0.008436 / 0.007607 (0.000829)	0.589858 / 0.226044 (0.363814)	5.873488 / 2.268929 (3.604559)	3.241810 / 55.444624 (-52.202814)	2.789831 / 6.876477 (-4.086645)	3.008989 / 2.142072 (0.866917)	0.679624 / 4.805227 (-4.125603)	0.150868 / 6.500664 (-6.349796)	0.068581 / 0.075469 (-0.006889)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.582955 / 1.841788 (-0.258833)	22.684969 / 8.074308 (14.610661)	16.829855 / 10.191392 (6.638463)	0.201599 / 0.680424 (-0.478825)	0.023261 / 0.534201 (-0.510940)	0.465009 / 0.579283 (-0.114274)	0.497701 / 0.434364 (0.063337)	0.557822 / 0.540337 (0.017485)	0.803234 / 1.386936 (-0.583702)

Wauplin · 2023-10-16T13:37:08Z

Well done! 👏 🔥

github-actions · 2023-10-16T13:43:05Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008866 / 0.011353 (-0.002487)	0.005910 / 0.011008 (-0.005098)	0.099916 / 0.038508 (0.061408)	0.085787 / 0.023109 (0.062678)	0.391028 / 0.275898 (0.115130)	0.412689 / 0.323480 (0.089209)	0.006527 / 0.007986 (-0.001459)	0.004629 / 0.004328 (0.000301)	0.084627 / 0.004250 (0.080377)	0.063404 / 0.037052 (0.026352)	0.408923 / 0.258489 (0.150434)	0.437130 / 0.293841 (0.143289)	0.050256 / 0.128546 (-0.078290)	0.013914 / 0.075646 (-0.061732)	0.350893 / 0.419271 (-0.068379)	0.067931 / 0.043533 (0.024398)	0.383807 / 0.255139 (0.128668)	0.424150 / 0.283200 (0.140950)	0.039978 / 0.141683 (-0.101705)	1.697631 / 1.452155 (0.245476)	1.925568 / 1.492716 (0.432851)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.315417 / 0.018006 (0.297410)	0.607050 / 0.000490 (0.606560)	0.017314 / 0.000200 (0.017114)	0.000514 / 0.000054 (0.000459)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032994 / 0.037411 (-0.004417)	0.103993 / 0.014526 (0.089467)	0.125369 / 0.176557 (-0.051187)	0.185984 / 0.737135 (-0.551151)	0.139192 / 0.296338 (-0.157146)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.639769 / 0.215209 (0.424560)	6.236187 / 2.077655 (4.158532)	2.775777 / 1.504120 (1.271657)	2.599683 / 1.541195 (1.058488)	2.780064 / 1.468490 (1.311574)	1.107247 / 4.584777 (-3.477530)	5.724223 / 3.745712 (1.978511)	5.284786 / 5.269862 (0.014925)	3.342465 / 4.565676 (-1.223211)	0.107685 / 0.424275 (-0.316590)	0.009237 / 0.007607 (0.001630)	0.760282 / 0.226044 (0.534238)	7.570859 / 2.268929 (5.301930)	3.572498 / 55.444624 (-51.872126)	2.997482 / 6.876477 (-3.878995)	2.910001 / 2.142072 (0.767929)	1.249272 / 4.805227 (-3.555955)	0.229425 / 6.500664 (-6.271239)	0.091974 / 0.075469 (0.016505)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.663859 / 1.841788 (-0.177929)	25.283961 / 8.074308 (17.209653)	20.793389 / 10.191392 (10.601997)	0.239263 / 0.680424 (-0.441161)	0.028808 / 0.534201 (-0.505393)	0.521045 / 0.579283 (-0.058238)	0.602451 / 0.434364 (0.168087)	0.544536 / 0.540337 (0.004198)	0.819732 / 1.386936 (-0.567204)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008970 / 0.011353 (-0.002383)	0.009663 / 0.011008 (-0.001345)	0.083471 / 0.038508 (0.044963)	0.090695 / 0.023109 (0.067585)	0.562539 / 0.275898 (0.286641)	0.572092 / 0.323480 (0.248612)	0.007269 / 0.007986 (-0.000717)	0.004664 / 0.004328 (0.000335)	0.084212 / 0.004250 (0.079961)	0.072716 / 0.037052 (0.035664)	0.559810 / 0.258489 (0.301320)	0.574296 / 0.293841 (0.280455)	0.048555 / 0.128546 (-0.079991)	0.015901 / 0.075646 (-0.059746)	0.107815 / 0.419271 (-0.311456)	0.065404 / 0.043533 (0.021871)	0.544787 / 0.255139 (0.289648)	0.586993 / 0.283200 (0.303794)	0.042613 / 0.141683 (-0.099069)	1.919266 / 1.452155 (0.467111)	2.095189 / 1.492716 (0.602473)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.298512 / 0.018006 (0.280506)	0.597745 / 0.000490 (0.597256)	0.008806 / 0.000200 (0.008606)	0.000119 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.039420 / 0.037411 (0.002009)	0.111378 / 0.014526 (0.096852)	0.136421 / 0.176557 (-0.040135)	0.192006 / 0.737135 (-0.545129)	0.130037 / 0.296338 (-0.166301)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.679169 / 0.215209 (0.463960)	6.750881 / 2.077655 (4.673226)	3.220411 / 1.504120 (1.716291)	2.851988 / 1.541195 (1.310794)	2.974247 / 1.468490 (1.505757)	0.892593 / 4.584777 (-3.692184)	5.659975 / 3.745712 (1.914263)	5.172641 / 5.269862 (-0.097220)	3.308429 / 4.565676 (-1.257248)	0.100580 / 0.424275 (-0.323695)	0.009320 / 0.007607 (0.001713)	0.833290 / 0.226044 (0.607245)	8.091847 / 2.268929 (5.822918)	4.023734 / 55.444624 (-51.420890)	3.441583 / 6.876477 (-3.434894)	3.763562 / 2.142072 (1.621489)	1.055105 / 4.805227 (-3.750122)	0.239218 / 6.500664 (-6.261446)	0.081922 / 0.075469 (0.006453)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.796495 / 1.841788 (-0.045293)	25.942492 / 8.074308 (17.868184)	23.211617 / 10.191392 (13.020225)	0.256054 / 0.680424 (-0.424370)	0.030491 / 0.534201 (-0.503710)	0.520474 / 0.579283 (-0.058809)	0.626331 / 0.434364 (0.191967)	0.619897 / 0.540337 (0.079560)	0.900833 / 1.386936 (-0.486103)

ZachNagengast · 2023-10-16T16:03:18Z

Congrats on merging this! 👏

Test single commit push_to_hub API

a8f5116

Wauplin mentioned this pull request Oct 2, 2023

Preupload lfs files before commiting huggingface/huggingface_hub#1699

Merged

4 tasks

Wauplin reviewed Oct 2, 2023

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

mariosasko added 2 commits October 2, 2023 16:37

Address review comments

869e6bc

Fix

d8c29b9

mariosasko added 4 commits October 5, 2023 16:03

Cleaner implementation

cc375f5

Tests

0cdfe09

Merge branch 'main' of github.com:huggingface/datasets into single-co…

b000c17

…mmit-push_to_hub

Fixes and more tests

1e186f0

mariosasko changed the title ~~Test single commit push_to_hub API~~ Single commit `push_to_hub Oct 8, 2023

mariosasko changed the title ~~Single commit `push_to_hub~~ Single commit push_to_hub Oct 8, 2023

Remove comment

579c31f

mariosasko mentioned this pull request Oct 8, 2023

Add token parameter to HfApi's snapshot_download and hf_hub_download huggingface/huggingface_hub#1717

Merged

mariosasko requested review from lhoestq and Wauplin October 9, 2023 18:31

lhoestq reviewed Oct 10, 2023

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

.github/workflows/ci.yml Outdated Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

severo mentioned this pull request Oct 10, 2023

upgrade hfh to 0.18.0? huggingface/dataset-viewer#1956

Closed

Apply suggestions from code review

9764c49

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

mariosasko marked this pull request as ready for review October 10, 2023 13:21

Wauplin mentioned this pull request Oct 10, 2023

Incremental dataset (e.g. .push_to_hub(..., append=True)) #6290

Open

mariosasko requested a review from lhoestq October 11, 2023 18:16

Fix test

26d8bfc

mariosasko mentioned this pull request Oct 12, 2023

Datasets crashing runs due to KeyError #6124

Closed

Wauplin mentioned this pull request Oct 13, 2023

HfFileSystem's transaction is working counterintuitively huggingface/huggingface_hub#1733

Closed

lhoestq approved these changes Oct 13, 2023

View reviewed changes

Apply suggestions from code review

997082a

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

mariosasko added 3 commits October 13, 2023 18:24

Fix type hint and style

499a0c2

Merge branch 'main' of github.com:huggingface/datasets into single-co…

124e702

…mmit-push_to_hub

Oops :)

072f0ce

Log commit number

5e7374b

lhoestq reviewed Oct 13, 2023

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

lhoestq reviewed Oct 13, 2023

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

mariosasko and others added 2 commits October 16, 2023 15:05

Apply suggestions from code review

429f9c6

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Add logging messages to DatasetDict

9241c10

mariosasko merged commit e74f802 into main Oct 16, 2023
13 checks passed

mariosasko deleted the single-commit-push_to_hub branch October 16, 2023 13:30

ZachNagengast mentioned this pull request Oct 16, 2023

Parquet uploads off-by-one naming scheme #6303

Open

qgallouedec mentioned this pull request Oct 19, 2023

Fix commit message formatting in multi-commit uploads #6313

Merged

albertvillanova mentioned this pull request Dec 20, 2023

Support push_to_hub canonical datasets #6519

Merged

severo mentioned this pull request Jul 23, 2024

Expand "Large scale datasets" + add "Frequently updated datasets" huggingface/hub-docs#1349

Open

Reduce the number of commits in push_to_hub #6269

Reduce the number of commits in push_to_hub #6269

Conversation

mariosasko commented Sep 29, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Sep 29, 2023 • edited Loading

github-actions bot commented Sep 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Wauplin left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 2, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Oct 2, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Oct 8, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Oct 8, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko commented Oct 9, 2023

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 10, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko commented Oct 10, 2023 • edited Loading

lhoestq commented Oct 10, 2023

github-actions bot commented Oct 11, 2023

Benchmark: benchmark_array_xd.json

Reduce the number of commits in `push_to_hub` #6269

Reduce the number of commits in `push_to_hub` #6269

mariosasko commented Sep 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 29, 2023 •

edited

Loading

mariosasko commented Oct 10, 2023 •

edited

Loading