Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of commits in push_to_hub #6269

Merged
merged 19 commits into from
Oct 16, 2023
Merged

Conversation

mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Sep 29, 2023

Reduces the number of commits in push_to_hub by using the preupload API from huggingface/huggingface_hub#1699. Each commit contains a maximum of 50 uploaded files.

A shard's fingerprint no longer needs to be added as a suffix to support resuming an upload, meaning the shards' naming scheme is the same as the initial one.

Also, it adds support for the following params: create_pr, commit_message and revision (branch deprecated; unlike the previous implementation, this one creates a branch if the branch does not exist to be consistent with transformers).

(Nit) This implementation keeps the markdown section of the generated README.md empty to enable importing the card template (when the card is accessed on the Hub).

Fixes #5492, fixes #6257, fixes #5045, fixes #6271

TODO:

  • set the minimal version to the next hfh release (once it's published)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 29, 2023

The documentation is not available anymore as the PR was closed or merged.

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005864 / 0.011353 (-0.005489) 0.003535 / 0.011008 (-0.007474) 0.080732 / 0.038508 (0.042224) 0.057072 / 0.023109 (0.033963) 0.334342 / 0.275898 (0.058444) 0.361345 / 0.323480 (0.037865) 0.003290 / 0.007986 (-0.004696) 0.003794 / 0.004328 (-0.000534) 0.063414 / 0.004250 (0.059163) 0.046901 / 0.037052 (0.009848) 0.335973 / 0.258489 (0.077484) 0.377929 / 0.293841 (0.084088) 0.027199 / 0.128546 (-0.101348) 0.008049 / 0.075646 (-0.067597) 0.261810 / 0.419271 (-0.157462) 0.044669 / 0.043533 (0.001136) 0.333600 / 0.255139 (0.078461) 0.356362 / 0.283200 (0.073162) 0.020325 / 0.141683 (-0.121358) 1.458138 / 1.452155 (0.005984) 1.505923 / 1.492716 (0.013207)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.216456 / 0.018006 (0.198450) 0.421750 / 0.000490 (0.421261) 0.007359 / 0.000200 (0.007159) 0.000246 / 0.000054 (0.000191)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023400 / 0.037411 (-0.014012) 0.073363 / 0.014526 (0.058838) 0.083533 / 0.176557 (-0.093023) 0.144045 / 0.737135 (-0.593090) 0.084050 / 0.296338 (-0.212288)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.398354 / 0.215209 (0.183145) 3.982875 / 2.077655 (1.905220) 2.047299 / 1.504120 (0.543180) 1.873780 / 1.541195 (0.332585) 1.977044 / 1.468490 (0.508554) 0.497038 / 4.584777 (-4.087739) 3.039743 / 3.745712 (-0.705969) 2.832885 / 5.269862 (-2.436977) 1.827300 / 4.565676 (-2.738377) 0.057503 / 0.424275 (-0.366772) 0.006272 / 0.007607 (-0.001335) 0.468681 / 0.226044 (0.242637) 4.696551 / 2.268929 (2.427622) 2.413805 / 55.444624 (-53.030819) 2.157199 / 6.876477 (-4.719278) 2.345986 / 2.142072 (0.203914) 0.584632 / 4.805227 (-4.220595) 0.124684 / 6.500664 (-6.375980) 0.060090 / 0.075469 (-0.015379)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.293551 / 1.841788 (-0.548236) 17.198292 / 8.074308 (9.123984) 13.677910 / 10.191392 (3.486518) 0.146633 / 0.680424 (-0.533791) 0.016711 / 0.534201 (-0.517490) 0.331644 / 0.579283 (-0.247639) 0.360148 / 0.434364 (-0.074215) 0.381194 / 0.540337 (-0.159143) 0.537952 / 1.386936 (-0.848984)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006020 / 0.011353 (-0.005333) 0.003557 / 0.011008 (-0.007451) 0.061926 / 0.038508 (0.023418) 0.056246 / 0.023109 (0.033137) 0.446679 / 0.275898 (0.170781) 0.479843 / 0.323480 (0.156363) 0.004656 / 0.007986 (-0.003330) 0.002823 / 0.004328 (-0.001505) 0.061366 / 0.004250 (0.057115) 0.045793 / 0.037052 (0.008740) 0.460807 / 0.258489 (0.202318) 0.485467 / 0.293841 (0.191626) 0.028555 / 0.128546 (-0.099991) 0.007973 / 0.075646 (-0.067674) 0.068305 / 0.419271 (-0.350966) 0.040844 / 0.043533 (-0.002689) 0.463715 / 0.255139 (0.208576) 0.474553 / 0.283200 (0.191354) 0.019959 / 0.141683 (-0.121723) 1.432527 / 1.452155 (-0.019628) 1.485410 / 1.492716 (-0.007307)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.205555 / 0.018006 (0.187549) 0.408271 / 0.000490 (0.407781) 0.004325 / 0.000200 (0.004125) 0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026338 / 0.037411 (-0.011074) 0.080534 / 0.014526 (0.066008) 0.093935 / 0.176557 (-0.082622) 0.146446 / 0.737135 (-0.590689) 0.092890 / 0.296338 (-0.203448)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.463879 / 0.215209 (0.248670) 4.646411 / 2.077655 (2.568756) 2.567320 / 1.504120 (1.063200) 2.384376 / 1.541195 (0.843181) 2.412738 / 1.468490 (0.944248) 0.510240 / 4.584777 (-4.074537) 3.094988 / 3.745712 (-0.650724) 2.837700 / 5.269862 (-2.432161) 1.850163 / 4.565676 (-2.715513) 0.059320 / 0.424275 (-0.364955) 0.006330 / 0.007607 (-0.001277) 0.537770 / 0.226044 (0.311726) 5.385556 / 2.268929 (3.116627) 3.036088 / 55.444624 (-52.408536) 2.650464 / 6.876477 (-4.226013) 2.755676 / 2.142072 (0.613603) 0.607353 / 4.805227 (-4.197875) 0.124589 / 6.500664 (-6.376075) 0.060778 / 0.075469 (-0.014691)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.343243 / 1.841788 (-0.498545) 17.630281 / 8.074308 (9.555973) 14.401219 / 10.191392 (4.209827) 0.143252 / 0.680424 (-0.537172) 0.017880 / 0.534201 (-0.516321) 0.337391 / 0.579283 (-0.241892) 0.373531 / 0.434364 (-0.060833) 0.398408 / 0.540337 (-0.141929) 0.558925 / 1.386936 (-0.828011)

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Have you tried it? I made a quick review and I think the integration should look like this indeed 👍

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Oct 2, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006552 / 0.011353 (-0.004801) 0.003853 / 0.011008 (-0.007155) 0.077673 / 0.038508 (0.039165) 0.066043 / 0.023109 (0.042934) 0.289858 / 0.275898 (0.013960) 0.299009 / 0.323480 (-0.024471) 0.004806 / 0.007986 (-0.003179) 0.003517 / 0.004328 (-0.000811) 0.058227 / 0.004250 (0.053977) 0.052134 / 0.037052 (0.015082) 0.328800 / 0.258489 (0.070311) 0.317616 / 0.293841 (0.023776) 0.028344 / 0.128546 (-0.100202) 0.007853 / 0.075646 (-0.067794) 0.291207 / 0.419271 (-0.128065) 0.052977 / 0.043533 (0.009444) 0.287548 / 0.255139 (0.032409) 0.307647 / 0.283200 (0.024448) 0.023899 / 0.141683 (-0.117784) 1.382267 / 1.452155 (-0.069888) 1.589915 / 1.492716 (0.097199)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.246244 / 0.018006 (0.228238) 0.478255 / 0.000490 (0.477766) 0.014115 / 0.000200 (0.013915) 0.000305 / 0.000054 (0.000250)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027033 / 0.037411 (-0.010378) 0.073988 / 0.014526 (0.059462) 0.088337 / 0.176557 (-0.088219) 0.144067 / 0.737135 (-0.593069) 0.091295 / 0.296338 (-0.205043)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.365904 / 0.215209 (0.150695) 3.537330 / 2.077655 (1.459675) 1.678341 / 1.504120 (0.174221) 1.530297 / 1.541195 (-0.010898) 1.605634 / 1.468490 (0.137144) 0.437461 / 4.584777 (-4.147316) 3.419040 / 3.745712 (-0.326672) 3.203549 / 5.269862 (-2.066312) 1.913214 / 4.565676 (-2.652463) 0.052675 / 0.424275 (-0.371600) 0.006681 / 0.007607 (-0.000926) 0.429269 / 0.226044 (0.203225) 4.214051 / 2.268929 (1.945122) 2.217928 / 55.444624 (-53.226696) 1.842679 / 6.876477 (-5.033798) 1.867961 / 2.142072 (-0.274111) 0.550566 / 4.805227 (-4.254661) 0.118015 / 6.500664 (-6.382649) 0.054749 / 0.075469 (-0.020720)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.170547 / 1.841788 (-0.671241) 18.410567 / 8.074308 (10.336259) 12.729992 / 10.191392 (2.538600) 0.160426 / 0.680424 (-0.519998) 0.021259 / 0.534201 (-0.512942) 0.369573 / 0.579283 (-0.209710) 0.440350 / 0.434364 (0.005986) 0.443755 / 0.540337 (-0.096582) 0.645614 / 1.386936 (-0.741322)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005913 / 0.011353 (-0.005440) 0.003542 / 0.011008 (-0.007466) 0.057621 / 0.038508 (0.019113) 0.065822 / 0.023109 (0.042713) 0.390847 / 0.275898 (0.114949) 0.393127 / 0.323480 (0.069647) 0.005040 / 0.007986 (-0.002945) 0.002944 / 0.004328 (-0.001384) 0.069058 / 0.004250 (0.064808) 0.051594 / 0.037052 (0.014542) 0.383745 / 0.258489 (0.125256) 0.414372 / 0.293841 (0.120531) 0.030038 / 0.128546 (-0.098508) 0.008109 / 0.075646 (-0.067538) 0.065444 / 0.419271 (-0.353828) 0.045974 / 0.043533 (0.002441) 0.401695 / 0.255139 (0.146556) 0.417834 / 0.283200 (0.134635) 0.020137 / 0.141683 (-0.121546) 1.452130 / 1.452155 (-0.000025) 1.455259 / 1.492716 (-0.037458)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.228262 / 0.018006 (0.210255) 0.455155 / 0.000490 (0.454665) 0.006667 / 0.000200 (0.006467) 0.000207 / 0.000054 (0.000153)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030159 / 0.037411 (-0.007252) 0.098478 / 0.014526 (0.083952) 0.101409 / 0.176557 (-0.075147) 0.148689 / 0.737135 (-0.588446) 0.103067 / 0.296338 (-0.193272)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.444095 / 0.215209 (0.228886) 3.991588 / 2.077655 (1.913934) 2.147845 / 1.504120 (0.643725) 2.007871 / 1.541195 (0.466676) 2.042074 / 1.468490 (0.573584) 0.451592 / 4.584777 (-4.133185) 3.439400 / 3.745712 (-0.306312) 3.107756 / 5.269862 (-2.162106) 1.909785 / 4.565676 (-2.655891) 0.051718 / 0.424275 (-0.372558) 0.006597 / 0.007607 (-0.001010) 0.480822 / 0.226044 (0.254777) 4.913235 / 2.268929 (2.644307) 2.631882 / 55.444624 (-52.812742) 2.397209 / 6.876477 (-4.479267) 2.487191 / 2.142072 (0.345119) 0.566321 / 4.805227 (-4.238906) 0.121741 / 6.500664 (-6.378924) 0.053399 / 0.075469 (-0.022070)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.256599 / 1.841788 (-0.585189) 18.891127 / 8.074308 (10.816819) 13.219662 / 10.191392 (3.028270) 0.154570 / 0.680424 (-0.525854) 0.022599 / 0.534201 (-0.511602) 0.361998 / 0.579283 (-0.217286) 0.413287 / 0.434364 (-0.021077) 0.464867 / 0.540337 (-0.075470) 0.638880 / 1.386936 (-0.748056)

@github-actions
Copy link

github-actions bot commented Oct 2, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010625 / 0.011353 (-0.000728) 0.005129 / 0.011008 (-0.005879) 0.119975 / 0.038508 (0.081467) 0.100128 / 0.023109 (0.077019) 0.448678 / 0.275898 (0.172780) 0.533150 / 0.323480 (0.209670) 0.005881 / 0.007986 (-0.002105) 0.007451 / 0.004328 (0.003123) 0.090792 / 0.004250 (0.086542) 0.073416 / 0.037052 (0.036363) 0.455395 / 0.258489 (0.196906) 0.497572 / 0.293841 (0.203731) 0.053112 / 0.128546 (-0.075434) 0.014619 / 0.075646 (-0.061027) 0.388023 / 0.419271 (-0.031248) 0.074004 / 0.043533 (0.030471) 0.435319 / 0.255139 (0.180180) 0.465985 / 0.283200 (0.182785) 0.046991 / 0.141683 (-0.094692) 1.895717 / 1.452155 (0.443563) 2.086600 / 1.492716 (0.593884)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.334412 / 0.018006 (0.316406) 0.645510 / 0.000490 (0.645020) 0.019175 / 0.000200 (0.018975) 0.000429 / 0.000054 (0.000374)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034385 / 0.037411 (-0.003026) 0.108939 / 0.014526 (0.094413) 0.125937 / 0.176557 (-0.050619) 0.205643 / 0.737135 (-0.531493) 0.127662 / 0.296338 (-0.168676)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.674093 / 0.215209 (0.458884) 6.646554 / 2.077655 (4.568900) 2.837698 / 1.504120 (1.333578) 2.397199 / 1.541195 (0.856004) 2.485856 / 1.468490 (1.017366) 0.955142 / 4.584777 (-3.629635) 5.667462 / 3.745712 (1.921750) 5.354129 / 5.269862 (0.084268) 3.301609 / 4.565676 (-1.264068) 0.106051 / 0.424275 (-0.318224) 0.009287 / 0.007607 (0.001680) 0.766678 / 0.226044 (0.540634) 7.786701 / 2.268929 (5.517772) 3.665463 / 55.444624 (-51.779161) 2.982912 / 6.876477 (-3.893564) 3.053363 / 2.142072 (0.911290) 1.141090 / 4.805227 (-3.664137) 0.223975 / 6.500664 (-6.276689) 0.093024 / 0.075469 (0.017555)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.728175 / 1.841788 (-0.113613) 25.640134 / 8.074308 (17.565826) 22.124769 / 10.191392 (11.933377) 0.237489 / 0.680424 (-0.442935) 0.030353 / 0.534201 (-0.503848) 0.509371 / 0.579283 (-0.069913) 0.642320 / 0.434364 (0.207956) 0.576889 / 0.540337 (0.036552) 0.899377 / 1.386936 (-0.487559)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010846 / 0.011353 (-0.000507) 0.005876 / 0.011008 (-0.005132) 0.090810 / 0.038508 (0.052302) 0.106651 / 0.023109 (0.083542) 0.551064 / 0.275898 (0.275166) 0.608328 / 0.323480 (0.284848) 0.007563 / 0.007986 (-0.000423) 0.004595 / 0.004328 (0.000267) 0.089125 / 0.004250 (0.084874) 0.076577 / 0.037052 (0.039525) 0.579970 / 0.258489 (0.321481) 0.620214 / 0.293841 (0.326373) 0.052577 / 0.128546 (-0.075970) 0.013734 / 0.075646 (-0.061912) 0.099825 / 0.419271 (-0.319447) 0.068391 / 0.043533 (0.024858) 0.564733 / 0.255139 (0.309594) 0.593925 / 0.283200 (0.310726) 0.037201 / 0.141683 (-0.104482) 1.880969 / 1.452155 (0.428815) 2.065094 / 1.492716 (0.572377)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.426148 / 0.018006 (0.408141) 0.673935 / 0.000490 (0.673445) 0.124190 / 0.000200 (0.123990) 0.001219 / 0.000054 (0.001164)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.040280 / 0.037411 (0.002868) 0.122042 / 0.014526 (0.107516) 0.131333 / 0.176557 (-0.045223) 0.203039 / 0.737135 (-0.534096) 0.134851 / 0.296338 (-0.161487)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.684599 / 0.215209 (0.469390) 6.727529 / 2.077655 (4.649874) 3.255228 / 1.504120 (1.751108) 2.925865 / 1.541195 (1.384670) 2.978762 / 1.468490 (1.510272) 0.931769 / 4.584777 (-3.653008) 5.988956 / 3.745712 (2.243244) 5.228049 / 5.269862 (-0.041812) 3.341470 / 4.565676 (-1.224206) 0.106737 / 0.424275 (-0.317539) 0.009847 / 0.007607 (0.002240) 0.813954 / 0.226044 (0.587909) 8.137071 / 2.268929 (5.868143) 4.140725 / 55.444624 (-51.303899) 3.500579 / 6.876477 (-3.375898) 3.623120 / 2.142072 (1.481047) 1.096634 / 4.805227 (-3.708593) 0.236938 / 6.500664 (-6.263726) 0.083099 / 0.075469 (0.007630)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.856112 / 1.841788 (0.014324) 26.531325 / 8.074308 (18.457017) 24.435866 / 10.191392 (14.244474) 0.264093 / 0.680424 (-0.416331) 0.034872 / 0.534201 (-0.499329) 0.520682 / 0.579283 (-0.058601) 0.635010 / 0.434364 (0.200646) 0.645451 / 0.540337 (0.105113) 0.914616 / 1.386936 (-0.472320)

@mariosasko mariosasko changed the title Test single commit push_to_hub API Single commit `push_to_hub Oct 8, 2023
@mariosasko mariosasko changed the title Single commit `push_to_hub Single commit push_to_hub Oct 8, 2023
@github-actions
Copy link

github-actions bot commented Oct 8, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005928 / 0.011353 (-0.005425) 0.003633 / 0.011008 (-0.007375) 0.079554 / 0.038508 (0.041046) 0.057093 / 0.023109 (0.033984) 0.311374 / 0.275898 (0.035476) 0.343778 / 0.323480 (0.020298) 0.004634 / 0.007986 (-0.003352) 0.002886 / 0.004328 (-0.001443) 0.061888 / 0.004250 (0.057637) 0.045895 / 0.037052 (0.008843) 0.316447 / 0.258489 (0.057958) 0.358141 / 0.293841 (0.064300) 0.027247 / 0.128546 (-0.101300) 0.007947 / 0.075646 (-0.067699) 0.259070 / 0.419271 (-0.160201) 0.043802 / 0.043533 (0.000269) 0.315453 / 0.255139 (0.060314) 0.335282 / 0.283200 (0.052082) 0.021096 / 0.141683 (-0.120587) 1.443219 / 1.452155 (-0.008936) 1.523140 / 1.492716 (0.030423)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.222957 / 0.018006 (0.204951) 0.414611 / 0.000490 (0.414122) 0.008354 / 0.000200 (0.008154) 0.000249 / 0.000054 (0.000195)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023880 / 0.037411 (-0.013532) 0.074523 / 0.014526 (0.059997) 0.084803 / 0.176557 (-0.091754) 0.146701 / 0.737135 (-0.590435) 0.084990 / 0.296338 (-0.211348)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.397736 / 0.215209 (0.182527) 3.961740 / 2.077655 (1.884086) 1.909014 / 1.504120 (0.404894) 1.823026 / 1.541195 (0.281831) 1.966235 / 1.468490 (0.497745) 0.498056 / 4.584777 (-4.086721) 3.041408 / 3.745712 (-0.704304) 2.998010 / 5.269862 (-2.271852) 1.887293 / 4.565676 (-2.678384) 0.057096 / 0.424275 (-0.367179) 0.006338 / 0.007607 (-0.001269) 0.465166 / 0.226044 (0.239122) 4.667710 / 2.268929 (2.398781) 2.480798 / 55.444624 (-52.963826) 2.270701 / 6.876477 (-4.605776) 2.376470 / 2.142072 (0.234397) 0.579873 / 4.805227 (-4.225355) 0.125032 / 6.500664 (-6.375632) 0.061057 / 0.075469 (-0.014412)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.229916 / 1.841788 (-0.611872) 17.829628 / 8.074308 (9.755320) 13.860184 / 10.191392 (3.668792) 0.143507 / 0.680424 (-0.536917) 0.016943 / 0.534201 (-0.517258) 0.350106 / 0.579283 (-0.229178) 0.364547 / 0.434364 (-0.069817) 0.398889 / 0.540337 (-0.141448) 0.557948 / 1.386936 (-0.828988)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006052 / 0.011353 (-0.005301) 0.003636 / 0.011008 (-0.007372) 0.062705 / 0.038508 (0.024197) 0.057753 / 0.023109 (0.034644) 0.453219 / 0.275898 (0.177321) 0.485179 / 0.323480 (0.161699) 0.004886 / 0.007986 (-0.003100) 0.002838 / 0.004328 (-0.001490) 0.062593 / 0.004250 (0.058343) 0.047476 / 0.037052 (0.010423) 0.454266 / 0.258489 (0.195777) 0.487939 / 0.293841 (0.194098) 0.028124 / 0.128546 (-0.100422) 0.008000 / 0.075646 (-0.067647) 0.068335 / 0.419271 (-0.350937) 0.040491 / 0.043533 (-0.003042) 0.457868 / 0.255139 (0.202729) 0.476355 / 0.283200 (0.193155) 0.019557 / 0.141683 (-0.122126) 1.507111 / 1.452155 (0.054956) 1.569720 / 1.492716 (0.077003)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.209205 / 0.018006 (0.191199) 0.411782 / 0.000490 (0.411292) 0.003544 / 0.000200 (0.003344) 0.000072 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026569 / 0.037411 (-0.010842) 0.081213 / 0.014526 (0.066687) 0.090971 / 0.176557 (-0.085585) 0.145287 / 0.737135 (-0.591849) 0.091792 / 0.296338 (-0.204546)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.458329 / 0.215209 (0.243120) 4.574463 / 2.077655 (2.496808) 2.516693 / 1.504120 (1.012573) 2.329463 / 1.541195 (0.788269) 2.386704 / 1.468490 (0.918214) 0.503526 / 4.584777 (-4.081251) 3.113382 / 3.745712 (-0.632331) 2.872538 / 5.269862 (-2.397323) 1.865483 / 4.565676 (-2.700194) 0.058292 / 0.424275 (-0.365983) 0.006434 / 0.007607 (-0.001173) 0.530804 / 0.226044 (0.304760) 5.312666 / 2.268929 (3.043738) 2.992569 / 55.444624 (-52.452055) 2.611524 / 6.876477 (-4.264953) 2.779569 / 2.142072 (0.637497) 0.595200 / 4.805227 (-4.210028) 0.123957 / 6.500664 (-6.376707) 0.060601 / 0.075469 (-0.014868)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.345536 / 1.841788 (-0.496252) 18.183827 / 8.074308 (10.109519) 14.814084 / 10.191392 (4.622692) 0.145305 / 0.680424 (-0.535119) 0.018812 / 0.534201 (-0.515389) 0.334793 / 0.579283 (-0.244490) 0.375331 / 0.434364 (-0.059033) 0.392499 / 0.540337 (-0.147839) 0.563286 / 1.386936 (-0.823650)

@github-actions
Copy link

github-actions bot commented Oct 8, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008922 / 0.011353 (-0.002431) 0.005169 / 0.011008 (-0.005840) 0.106275 / 0.038508 (0.067767) 0.076446 / 0.023109 (0.053337) 0.400207 / 0.275898 (0.124309) 0.476262 / 0.323480 (0.152782) 0.006032 / 0.007986 (-0.001954) 0.004266 / 0.004328 (-0.000063) 0.083518 / 0.004250 (0.079267) 0.059644 / 0.037052 (0.022592) 0.409094 / 0.258489 (0.150605) 0.470400 / 0.293841 (0.176559) 0.050161 / 0.128546 (-0.078385) 0.013580 / 0.075646 (-0.062066) 0.375047 / 0.419271 (-0.044224) 0.068319 / 0.043533 (0.024786) 0.433765 / 0.255139 (0.178626) 0.449221 / 0.283200 (0.166021) 0.037636 / 0.141683 (-0.104047) 1.825855 / 1.452155 (0.373700) 1.889665 / 1.492716 (0.396948)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.319622 / 0.018006 (0.301616) 0.588878 / 0.000490 (0.588388) 0.017790 / 0.000200 (0.017590) 0.000532 / 0.000054 (0.000477)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031152 / 0.037411 (-0.006259) 0.093808 / 0.014526 (0.079282) 0.119296 / 0.176557 (-0.057261) 0.181845 / 0.737135 (-0.555291) 0.108527 / 0.296338 (-0.187811)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.575106 / 0.215209 (0.359896) 5.776322 / 2.077655 (3.698668) 2.592913 / 1.504120 (1.088793) 2.389481 / 1.541195 (0.848286) 2.390117 / 1.468490 (0.921627) 0.852420 / 4.584777 (-3.732357) 5.474171 / 3.745712 (1.728459) 4.967188 / 5.269862 (-0.302674) 3.053712 / 4.565676 (-1.511965) 0.098128 / 0.424275 (-0.326147) 0.008722 / 0.007607 (0.001115) 0.699838 / 0.226044 (0.473794) 7.103622 / 2.268929 (4.834693) 3.359326 / 55.444624 (-52.085299) 2.733943 / 6.876477 (-4.142534) 2.770001 / 2.142072 (0.627929) 1.058217 / 4.805227 (-3.747011) 0.215845 / 6.500664 (-6.284820) 0.078532 / 0.075469 (0.003063)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.633173 / 1.841788 (-0.208614) 23.795045 / 8.074308 (15.720737) 21.094433 / 10.191392 (10.903041) 0.234522 / 0.680424 (-0.445902) 0.033632 / 0.534201 (-0.500569) 0.496701 / 0.579283 (-0.082582) 0.626861 / 0.434364 (0.192497) 0.558267 / 0.540337 (0.017930) 0.807461 / 1.386936 (-0.579475)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009136 / 0.011353 (-0.002217) 0.005425 / 0.011008 (-0.005584) 0.081478 / 0.038508 (0.042970) 0.077240 / 0.023109 (0.054130) 0.512156 / 0.275898 (0.236258) 0.561593 / 0.323480 (0.238113) 0.006499 / 0.007986 (-0.001486) 0.004080 / 0.004328 (-0.000248) 0.082121 / 0.004250 (0.077870) 0.063774 / 0.037052 (0.026722) 0.509801 / 0.258489 (0.251312) 0.572826 / 0.293841 (0.278985) 0.050969 / 0.128546 (-0.077578) 0.014876 / 0.075646 (-0.060771) 0.094815 / 0.419271 (-0.324456) 0.063904 / 0.043533 (0.020371) 0.530572 / 0.255139 (0.275433) 0.545940 / 0.283200 (0.262741) 0.036729 / 0.141683 (-0.104954) 1.799493 / 1.452155 (0.347339) 1.931955 / 1.492716 (0.439239)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.291405 / 0.018006 (0.273398) 0.590257 / 0.000490 (0.589767) 0.008394 / 0.000200 (0.008194) 0.000112 / 0.000054 (0.000058)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.037613 / 0.037411 (0.000201) 0.103136 / 0.014526 (0.088610) 0.121744 / 0.176557 (-0.054813) 0.198503 / 0.737135 (-0.538632) 0.120183 / 0.296338 (-0.176156)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.659872 / 0.215209 (0.444663) 6.616775 / 2.077655 (4.539120) 3.031679 / 1.504120 (1.527559) 2.743489 / 1.541195 (1.202294) 2.786786 / 1.468490 (1.318296) 0.866625 / 4.584777 (-3.718152) 5.637705 / 3.745712 (1.891993) 4.702563 / 5.269862 (-0.567298) 3.017797 / 4.565676 (-1.547879) 0.100107 / 0.424275 (-0.324169) 0.008443 / 0.007607 (0.000836) 0.791385 / 0.226044 (0.565341) 7.869504 / 2.268929 (5.600576) 3.856634 / 55.444624 (-51.587991) 3.140089 / 6.876477 (-3.736388) 3.489339 / 2.142072 (1.347267) 1.132170 / 4.805227 (-3.673058) 0.219630 / 6.500664 (-6.281034) 0.082289 / 0.075469 (0.006820)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.781902 / 1.841788 (-0.059885) 24.912604 / 8.074308 (16.838296) 21.626512 / 10.191392 (11.435120) 0.228194 / 0.680424 (-0.452230) 0.032799 / 0.534201 (-0.501402) 0.483683 / 0.579283 (-0.095600) 0.604966 / 0.434364 (0.170602) 0.617278 / 0.540337 (0.076940) 0.887337 / 1.386936 (-0.499599)

@mariosasko
Copy link
Collaborator Author

I used this Colab to test the new push_to_hub on a large dataset (55 GB). It works great.

One thing that could be improved is the performance of dataset.data.nbytes - it takes ≈ 3 minutes to compute for the dataset in question (50k array chunks per column). It probably makes sense to store larger chunks locally. But this can be addressed in a subsequent PR.

@mariosasko mariosasko requested review from lhoestq and Wauplin October 9, 2023 18:31
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome !

I just added some comments. My main concerns are

  • single commit can fail (time out) if there are too many operations so we might have to do multi commits anyway in that case
  • how to let users resume a push_to_hub that failed mid-way because of a connection error for example

.github/workflows/ci.yml Outdated Show resolved Hide resolved
.github/workflows/ci.yml Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007190 / 0.011353 (-0.004163) 0.004394 / 0.011008 (-0.006614) 0.085506 / 0.038508 (0.046998) 0.092177 / 0.023109 (0.069068) 0.351636 / 0.275898 (0.075738) 0.389716 / 0.323480 (0.066236) 0.004443 / 0.007986 (-0.003543) 0.003641 / 0.004328 (-0.000687) 0.066578 / 0.004250 (0.062328) 0.061399 / 0.037052 (0.024346) 0.356008 / 0.258489 (0.097519) 0.398677 / 0.293841 (0.104836) 0.031958 / 0.128546 (-0.096588) 0.008857 / 0.075646 (-0.066789) 0.289613 / 0.419271 (-0.129659) 0.053555 / 0.043533 (0.010022) 0.349268 / 0.255139 (0.094129) 0.368666 / 0.283200 (0.085466) 0.028267 / 0.141683 (-0.113416) 1.502857 / 1.452155 (0.050702) 1.598422 / 1.492716 (0.105705)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.319938 / 0.018006 (0.301931) 0.566925 / 0.000490 (0.566435) 0.014625 / 0.000200 (0.014425) 0.000372 / 0.000054 (0.000318)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030156 / 0.037411 (-0.007255) 0.083128 / 0.014526 (0.068602) 0.101435 / 0.176557 (-0.075122) 0.158971 / 0.737135 (-0.578165) 0.101488 / 0.296338 (-0.194851)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.383904 / 0.215209 (0.168695) 3.829201 / 2.077655 (1.751546) 1.815224 / 1.504120 (0.311104) 1.647865 / 1.541195 (0.106670) 1.738411 / 1.468490 (0.269921) 0.484963 / 4.584777 (-4.099814) 3.494811 / 3.745712 (-0.250901) 3.505811 / 5.269862 (-1.764051) 2.115467 / 4.565676 (-2.450210) 0.057271 / 0.424275 (-0.367004) 0.007285 / 0.007607 (-0.000322) 0.467162 / 0.226044 (0.241118) 4.661572 / 2.268929 (2.392643) 2.330443 / 55.444624 (-53.114182) 1.986116 / 6.876477 (-4.890361) 2.055350 / 2.142072 (-0.086723) 0.580369 / 4.805227 (-4.224858) 0.132700 / 6.500664 (-6.367964) 0.061219 / 0.075469 (-0.014251)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.270843 / 1.841788 (-0.570945) 19.870723 / 8.074308 (11.796415) 14.368932 / 10.191392 (4.177540) 0.167345 / 0.680424 (-0.513079) 0.018358 / 0.534201 (-0.515843) 0.390833 / 0.579283 (-0.188450) 0.419884 / 0.434364 (-0.014480) 0.465683 / 0.540337 (-0.074655) 0.646101 / 1.386936 (-0.740835)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007027 / 0.011353 (-0.004326) 0.004578 / 0.011008 (-0.006430) 0.066468 / 0.038508 (0.027960) 0.081576 / 0.023109 (0.058466) 0.414928 / 0.275898 (0.139030) 0.452130 / 0.323480 (0.128651) 0.005861 / 0.007986 (-0.002124) 0.003740 / 0.004328 (-0.000588) 0.066943 / 0.004250 (0.062692) 0.060100 / 0.037052 (0.023048) 0.418697 / 0.258489 (0.160208) 0.466604 / 0.293841 (0.172764) 0.031887 / 0.128546 (-0.096660) 0.009119 / 0.075646 (-0.066527) 0.072285 / 0.419271 (-0.346986) 0.047599 / 0.043533 (0.004066) 0.410791 / 0.255139 (0.155652) 0.434182 / 0.283200 (0.150982) 0.024799 / 0.141683 (-0.116884) 1.500310 / 1.452155 (0.048155) 1.567151 / 1.492716 (0.074434)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.322482 / 0.018006 (0.304476) 0.550234 / 0.000490 (0.549744) 0.007796 / 0.000200 (0.007596) 0.000088 / 0.000054 (0.000033)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036013 / 0.037411 (-0.001398) 0.098482 / 0.014526 (0.083956) 0.111641 / 0.176557 (-0.064916) 0.166251 / 0.737135 (-0.570884) 0.112426 / 0.296338 (-0.183912)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.429181 / 0.215209 (0.213972) 4.273126 / 2.077655 (2.195472) 2.277440 / 1.504120 (0.773321) 2.112567 / 1.541195 (0.571372) 2.224118 / 1.468490 (0.755628) 0.488876 / 4.584777 (-4.095901) 3.711638 / 3.745712 (-0.034074) 3.480995 / 5.269862 (-1.788867) 2.122114 / 4.565676 (-2.443563) 0.057538 / 0.424275 (-0.366737) 0.007416 / 0.007607 (-0.000191) 0.506881 / 0.226044 (0.280836) 5.067601 / 2.268929 (2.798672) 2.769216 / 55.444624 (-52.675408) 2.420448 / 6.876477 (-4.456029) 2.694225 / 2.142072 (0.552153) 0.588911 / 4.805227 (-4.216316) 0.133542 / 6.500664 (-6.367122) 0.061135 / 0.075469 (-0.014334)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.378029 / 1.841788 (-0.463758) 20.660942 / 8.074308 (12.586634) 15.725969 / 10.191392 (5.534577) 0.169078 / 0.680424 (-0.511346) 0.020540 / 0.534201 (-0.513661) 0.399409 / 0.579283 (-0.179874) 0.432572 / 0.434364 (-0.001792) 0.477106 / 0.540337 (-0.063231) 0.675593 / 1.386936 (-0.711343)

@mariosasko
Copy link
Collaborator Author

mariosasko commented Oct 10, 2023

@lhoestq

single commit can fail (time out) if there are too many operations so we might have to do multi commits anyway in that case

Multiple commits complicate the logic significantly. Maybe, let's keep things simple and emit a warning if there are more than 100 additions (we can suggest increasing max_shard_size in that case). Additionally, we can set the default max_shard_size to a higher value, e.g., 5GB. I think handling up to 500GB of data in the default case seems reasonable. In rare cases where this is a problem, one could increase the default max_shard_size even further (if RAM is not a limiting factor) or use to_parquet + huggingface_hub (we could have a docstring or a doc note that explains this).

Note that we split the dataset based on the Arrow data size, which means Parquet shards will be considerably smaller unless there are binary fields such as image JPEGs in the dataset, which are hard to compress efficiently.

how to let users resume a push_to_hub that failed mid-way because of a connection error for example

They can resume by rerunning the failed push_to_hub.

preupload_lfs_files will be instant in that scenario, as explained in huggingface/huggingface_hub#1699 (comment)

@mariosasko mariosasko marked this pull request as ready for review October 10, 2023 13:21
@lhoestq
Copy link
Member

lhoestq commented Oct 10, 2023

Multiple commits complicate the logic significantly. Maybe, let's keep things simple and emit a warning if there are more than 100 additions (we can suggest increasing max_shard_size in that case)

I don't think we can do that, many people are uploading files with 100+ files and it would break their workflow

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006834 / 0.011353 (-0.004519) 0.004424 / 0.011008 (-0.006584) 0.085199 / 0.038508 (0.046691) 0.080237 / 0.023109 (0.057128) 0.308800 / 0.275898 (0.032902) 0.346314 / 0.323480 (0.022835) 0.004399 / 0.007986 (-0.003586) 0.003773 / 0.004328 (-0.000556) 0.065886 / 0.004250 (0.061636) 0.057830 / 0.037052 (0.020777) 0.312035 / 0.258489 (0.053546) 0.362646 / 0.293841 (0.068805) 0.031223 / 0.128546 (-0.097323) 0.008851 / 0.075646 (-0.066795) 0.288264 / 0.419271 (-0.131007) 0.052600 / 0.043533 (0.009067) 0.316127 / 0.255139 (0.060988) 0.328539 / 0.283200 (0.045340) 0.026068 / 0.141683 (-0.115615) 1.458928 / 1.452155 (0.006773) 1.547619 / 1.492716 (0.054902)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.274382 / 0.018006 (0.256375) 0.591192 / 0.000490 (0.590703) 0.009290 / 0.000200 (0.009090) 0.000327 / 0.000054 (0.000273)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031428 / 0.037411 (-0.005983) 0.087523 / 0.014526 (0.072997) 0.101427 / 0.176557 (-0.075130) 0.159228 / 0.737135 (-0.577907) 0.101430 / 0.296338 (-0.194909)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.393914 / 0.215209 (0.178705) 3.917323 / 2.077655 (1.839668) 1.940577 / 1.504120 (0.436457) 1.760996 / 1.541195 (0.219801) 1.865858 / 1.468490 (0.397368) 0.488920 / 4.584777 (-4.095857) 3.513465 / 3.745712 (-0.232248) 3.506600 / 5.269862 (-1.763261) 2.072583 / 4.565676 (-2.493093) 0.058256 / 0.424275 (-0.366019) 0.007420 / 0.007607 (-0.000187) 0.467241 / 0.226044 (0.241197) 4.671470 / 2.268929 (2.402542) 2.422717 / 55.444624 (-53.021908) 2.069501 / 6.876477 (-4.806975) 2.159257 / 2.142072 (0.017184) 0.583808 / 4.805227 (-4.221419) 0.134160 / 6.500664 (-6.366504) 0.068855 / 0.075469 (-0.006614)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.305299 / 1.841788 (-0.536488) 19.913902 / 8.074308 (11.839593) 14.708057 / 10.191392 (4.516665) 0.160113 / 0.680424 (-0.520311) 0.018431 / 0.534201 (-0.515770) 0.396147 / 0.579283 (-0.183136) 0.411738 / 0.434364 (-0.022626) 0.459297 / 0.540337 (-0.081041) 0.636599 / 1.386936 (-0.750337)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006936 / 0.011353 (-0.004417) 0.004290 / 0.011008 (-0.006718) 0.065754 / 0.038508 (0.027246) 0.080655 / 0.023109 (0.057546) 0.399701 / 0.275898 (0.123803) 0.435999 / 0.323480 (0.112519) 0.005690 / 0.007986 (-0.002295) 0.003580 / 0.004328 (-0.000748) 0.065685 / 0.004250 (0.061434) 0.059299 / 0.037052 (0.022246) 0.404295 / 0.258489 (0.145806) 0.438745 / 0.293841 (0.144904) 0.032241 / 0.128546 (-0.096305) 0.008699 / 0.075646 (-0.066947) 0.072053 / 0.419271 (-0.347218) 0.047489 / 0.043533 (0.003956) 0.395638 / 0.255139 (0.140499) 0.417224 / 0.283200 (0.134025) 0.022734 / 0.141683 (-0.118949) 1.507519 / 1.452155 (0.055364) 1.570459 / 1.492716 (0.077743)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.260442 / 0.018006 (0.242435) 0.551933 / 0.000490 (0.551444) 0.005240 / 0.000200 (0.005040) 0.000097 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033718 / 0.037411 (-0.003694) 0.095710 / 0.014526 (0.081184) 0.109970 / 0.176557 (-0.066586) 0.167930 / 0.737135 (-0.569205) 0.109977 / 0.296338 (-0.186362)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.430067 / 0.215209 (0.214857) 4.292564 / 2.077655 (2.214910) 2.313511 / 1.504120 (0.809391) 2.158153 / 1.541195 (0.616959) 2.262486 / 1.468490 (0.793996) 0.492376 / 4.584777 (-4.092401) 3.622287 / 3.745712 (-0.123425) 3.380162 / 5.269862 (-1.889699) 2.111874 / 4.565676 (-2.453803) 0.057882 / 0.424275 (-0.366393) 0.007317 / 0.007607 (-0.000290) 0.504722 / 0.226044 (0.278678) 5.039009 / 2.268929 (2.770080) 2.772162 / 55.444624 (-52.672463) 2.430928 / 6.876477 (-4.445549) 2.666556 / 2.142072 (0.524484) 0.586722 / 4.805227 (-4.218505) 0.133780 / 6.500664 (-6.366884) 0.060269 / 0.075469 (-0.015200)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.339064 / 1.841788 (-0.502724) 20.743931 / 8.074308 (12.669623) 15.491066 / 10.191392 (5.299674) 0.159236 / 0.680424 (-0.521188) 0.020722 / 0.534201 (-0.513479) 0.399440 / 0.579283 (-0.179843) 0.424501 / 0.434364 (-0.009863) 0.474026 / 0.540337 (-0.066311) 0.685239 / 1.386936 (-0.701697)

@mariosasko mariosasko requested a review from lhoestq October 11, 2023 18:16
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005930 / 0.011353 (-0.005422) 0.003496 / 0.011008 (-0.007512) 0.079631 / 0.038508 (0.041123) 0.058250 / 0.023109 (0.035141) 0.310108 / 0.275898 (0.034210) 0.352747 / 0.323480 (0.029267) 0.005367 / 0.007986 (-0.002619) 0.002943 / 0.004328 (-0.001386) 0.062449 / 0.004250 (0.058199) 0.046433 / 0.037052 (0.009381) 0.311020 / 0.258489 (0.052531) 0.361033 / 0.293841 (0.067192) 0.027419 / 0.128546 (-0.101128) 0.008073 / 0.075646 (-0.067574) 0.261403 / 0.419271 (-0.157869) 0.045059 / 0.043533 (0.001527) 0.310622 / 0.255139 (0.055483) 0.344361 / 0.283200 (0.061161) 0.020561 / 0.141683 (-0.121122) 1.427409 / 1.452155 (-0.024746) 1.506612 / 1.492716 (0.013896)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.234095 / 0.018006 (0.216089) 0.432603 / 0.000490 (0.432113) 0.010283 / 0.000200 (0.010083) 0.000289 / 0.000054 (0.000235)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024263 / 0.037411 (-0.013148) 0.073672 / 0.014526 (0.059146) 0.084080 / 0.176557 (-0.092476) 0.146679 / 0.737135 (-0.590457) 0.084337 / 0.296338 (-0.212001)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.434297 / 0.215209 (0.219088) 4.358287 / 2.077655 (2.280633) 2.268461 / 1.504120 (0.764341) 2.107924 / 1.541195 (0.566729) 2.165136 / 1.468490 (0.696646) 0.498421 / 4.584777 (-4.086356) 3.094414 / 3.745712 (-0.651298) 2.991511 / 5.269862 (-2.278351) 1.998052 / 4.565676 (-2.567624) 0.057363 / 0.424275 (-0.366912) 0.006405 / 0.007607 (-0.001203) 0.508396 / 0.226044 (0.282351) 5.104756 / 2.268929 (2.835828) 2.720462 / 55.444624 (-52.724163) 2.391840 / 6.876477 (-4.484637) 2.443063 / 2.142072 (0.300991) 0.590015 / 4.805227 (-4.215212) 0.125414 / 6.500664 (-6.375250) 0.061122 / 0.075469 (-0.014347)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.221883 / 1.841788 (-0.619904) 17.788248 / 8.074308 (9.713940) 13.753315 / 10.191392 (3.561923) 0.146388 / 0.680424 (-0.534036) 0.017038 / 0.534201 (-0.517163) 0.339162 / 0.579283 (-0.240121) 0.372054 / 0.434364 (-0.062309) 0.381507 / 0.540337 (-0.158830) 0.538603 / 1.386936 (-0.848333)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006044 / 0.011353 (-0.005309) 0.003654 / 0.011008 (-0.007354) 0.062956 / 0.038508 (0.024448) 0.061325 / 0.023109 (0.038216) 0.450006 / 0.275898 (0.174108) 0.474560 / 0.323480 (0.151080) 0.004846 / 0.007986 (-0.003140) 0.002904 / 0.004328 (-0.001425) 0.064206 / 0.004250 (0.059956) 0.047850 / 0.037052 (0.010798) 0.448431 / 0.258489 (0.189942) 0.481363 / 0.293841 (0.187523) 0.028622 / 0.128546 (-0.099925) 0.008255 / 0.075646 (-0.067391) 0.068461 / 0.419271 (-0.350810) 0.040234 / 0.043533 (-0.003299) 0.447396 / 0.255139 (0.192257) 0.465383 / 0.283200 (0.182184) 0.021864 / 0.141683 (-0.119819) 1.402197 / 1.452155 (-0.049957) 1.475337 / 1.492716 (-0.017379)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.227093 / 0.018006 (0.209087) 0.407908 / 0.000490 (0.407419) 0.006709 / 0.000200 (0.006509) 0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026560 / 0.037411 (-0.010851) 0.080926 / 0.014526 (0.066400) 0.091531 / 0.176557 (-0.085026) 0.145742 / 0.737135 (-0.591393) 0.092203 / 0.296338 (-0.204135)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.473029 / 0.215209 (0.257820) 4.703613 / 2.077655 (2.625958) 2.642622 / 1.504120 (1.138502) 2.465376 / 1.541195 (0.924181) 2.510125 / 1.468490 (1.041635) 0.512606 / 4.584777 (-4.072171) 3.132127 / 3.745712 (-0.613585) 2.890098 / 5.269862 (-2.379763) 1.908140 / 4.565676 (-2.657537) 0.058938 / 0.424275 (-0.365337) 0.006486 / 0.007607 (-0.001121) 0.542279 / 0.226044 (0.316235) 5.435621 / 2.268929 (3.166693) 3.083943 / 55.444624 (-52.360681) 2.761575 / 6.876477 (-4.114901) 2.919672 / 2.142072 (0.777599) 0.608022 / 4.805227 (-4.197205) 0.126821 / 6.500664 (-6.373843) 0.061374 / 0.075469 (-0.014095)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.348848 / 1.841788 (-0.492940) 18.323507 / 8.074308 (10.249199) 14.713411 / 10.191392 (4.522019) 0.155277 / 0.680424 (-0.525146) 0.017739 / 0.534201 (-0.516462) 0.337357 / 0.579283 (-0.241926) 0.376519 / 0.434364 (-0.057844) 0.398011 / 0.540337 (-0.142327) 0.589797 / 1.386936 (-0.797139)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! I love it :)

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/dataset_dict.py Outdated Show resolved Hide resolved
tests/test_upstream_hub.py Show resolved Hide resolved
src/datasets/dataset_dict.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Show resolved Hide resolved
src/datasets/dataset_dict.py Show resolved Hide resolved
repo_files = list(set(files) - set(data_files_to_delete))
shard_path_in_repo = f"{data_dir}/{split}-{index:05d}-of-{num_shards:05d}.parquet"
buffer = BytesIO()
shard.to_parquet(buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe for another PR)

we could only show the tqdm bar of the parquet conversion if it takes more than 5sec, using the "delay" argument in tqdm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think we can address this in a later PR (I think our entire logging requires a little overhaul)

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007823 / 0.011353 (-0.003530) 0.004136 / 0.011008 (-0.006872) 0.087282 / 0.038508 (0.048774) 0.086352 / 0.023109 (0.063243) 0.328107 / 0.275898 (0.052209) 0.368717 / 0.323480 (0.045237) 0.005452 / 0.007986 (-0.002533) 0.003460 / 0.004328 (-0.000868) 0.064360 / 0.004250 (0.060110) 0.062215 / 0.037052 (0.025162) 0.334666 / 0.258489 (0.076177) 0.388688 / 0.293841 (0.094847) 0.031093 / 0.128546 (-0.097454) 0.008510 / 0.075646 (-0.067137) 0.295965 / 0.419271 (-0.123306) 0.052858 / 0.043533 (0.009325) 0.320104 / 0.255139 (0.064965) 0.346761 / 0.283200 (0.063562) 0.024864 / 0.141683 (-0.116819) 1.483164 / 1.452155 (0.031010) 1.580363 / 1.492716 (0.087647)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.243523 / 0.018006 (0.225516) 0.459741 / 0.000490 (0.459251) 0.010508 / 0.000200 (0.010308) 0.000384 / 0.000054 (0.000330)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029896 / 0.037411 (-0.007515) 0.089150 / 0.014526 (0.074624) 0.098855 / 0.176557 (-0.077702) 0.154469 / 0.737135 (-0.582667) 0.099546 / 0.296338 (-0.196792)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.403547 / 0.215209 (0.188338) 4.036711 / 2.077655 (1.959056) 2.030882 / 1.504120 (0.526762) 1.850432 / 1.541195 (0.309238) 1.924248 / 1.468490 (0.455758) 0.493153 / 4.584777 (-4.091624) 3.634074 / 3.745712 (-0.111638) 3.546145 / 5.269862 (-1.723717) 2.120819 / 4.565676 (-2.444858) 0.057137 / 0.424275 (-0.367138) 0.007454 / 0.007607 (-0.000153) 0.481687 / 0.226044 (0.255642) 4.813203 / 2.268929 (2.544275) 2.481260 / 55.444624 (-52.963364) 2.194185 / 6.876477 (-4.682292) 2.255381 / 2.142072 (0.113308) 0.575160 / 4.805227 (-4.230068) 0.132310 / 6.500664 (-6.368355) 0.061917 / 0.075469 (-0.013553)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.265722 / 1.841788 (-0.576066) 19.949624 / 8.074308 (11.875315) 14.804356 / 10.191392 (4.612964) 0.170485 / 0.680424 (-0.509939) 0.018831 / 0.534201 (-0.515370) 0.407051 / 0.579283 (-0.172233) 0.420560 / 0.434364 (-0.013804) 0.470721 / 0.540337 (-0.069616) 0.651665 / 1.386936 (-0.735271)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007113 / 0.011353 (-0.004240) 0.004186 / 0.011008 (-0.006822) 0.065082 / 0.038508 (0.026574) 0.080275 / 0.023109 (0.057166) 0.393460 / 0.275898 (0.117562) 0.426702 / 0.323480 (0.103223) 0.005639 / 0.007986 (-0.002347) 0.003492 / 0.004328 (-0.000836) 0.065774 / 0.004250 (0.061523) 0.059708 / 0.037052 (0.022656) 0.395598 / 0.258489 (0.137109) 0.437088 / 0.293841 (0.143247) 0.033165 / 0.128546 (-0.095381) 0.008559 / 0.075646 (-0.067087) 0.071782 / 0.419271 (-0.347490) 0.048672 / 0.043533 (0.005139) 0.393883 / 0.255139 (0.138744) 0.412817 / 0.283200 (0.129617) 0.024115 / 0.141683 (-0.117568) 1.522752 / 1.452155 (0.070597) 1.577311 / 1.492716 (0.084595)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.225569 / 0.018006 (0.207563) 0.460310 / 0.000490 (0.459820) 0.004733 / 0.000200 (0.004533) 0.000115 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035241 / 0.037411 (-0.002170) 0.098092 / 0.014526 (0.083566) 0.108025 / 0.176557 (-0.068531) 0.162910 / 0.737135 (-0.574225) 0.108649 / 0.296338 (-0.187689)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.441723 / 0.215209 (0.226514) 4.400656 / 2.077655 (2.323001) 2.413588 / 1.504120 (0.909468) 2.261890 / 1.541195 (0.720696) 2.420878 / 1.468490 (0.952388) 0.496456 / 4.584777 (-4.088321) 3.679930 / 3.745712 (-0.065782) 3.390539 / 5.269862 (-1.879322) 2.109599 / 4.565676 (-2.456078) 0.058896 / 0.424275 (-0.365379) 0.007483 / 0.007607 (-0.000125) 0.521108 / 0.226044 (0.295064) 5.209468 / 2.268929 (2.940540) 2.948595 / 55.444624 (-52.496029) 2.658864 / 6.876477 (-4.217613) 2.913653 / 2.142072 (0.771580) 0.602776 / 4.805227 (-4.202451) 0.136166 / 6.500664 (-6.364498) 0.063812 / 0.075469 (-0.011657)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.350306 / 1.841788 (-0.491482) 20.453980 / 8.074308 (12.379672) 15.758719 / 10.191392 (5.567327) 0.165847 / 0.680424 (-0.514577) 0.020254 / 0.534201 (-0.513947) 0.400006 / 0.579283 (-0.179277) 0.440336 / 0.434364 (0.005972) 0.480122 / 0.540337 (-0.060215) 0.688994 / 1.386936 (-0.697942)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008633 / 0.011353 (-0.002720) 0.004851 / 0.011008 (-0.006157) 0.100647 / 0.038508 (0.062139) 0.084701 / 0.023109 (0.061592) 0.410489 / 0.275898 (0.134590) 0.440231 / 0.323480 (0.116751) 0.004679 / 0.007986 (-0.003307) 0.004172 / 0.004328 (-0.000157) 0.079911 / 0.004250 (0.075661) 0.069537 / 0.037052 (0.032485) 0.423506 / 0.258489 (0.165017) 0.466098 / 0.293841 (0.172257) 0.048773 / 0.128546 (-0.079773) 0.014446 / 0.075646 (-0.061200) 0.342776 / 0.419271 (-0.076495) 0.065672 / 0.043533 (0.022139) 0.411845 / 0.255139 (0.156706) 0.466662 / 0.283200 (0.183462) 0.035752 / 0.141683 (-0.105931) 1.684956 / 1.452155 (0.232801) 1.832173 / 1.492716 (0.339456)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.250744 / 0.018006 (0.232738) 0.528860 / 0.000490 (0.528371) 0.013301 / 0.000200 (0.013101) 0.000413 / 0.000054 (0.000359)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032376 / 0.037411 (-0.005035) 0.094630 / 0.014526 (0.080104) 0.107163 / 0.176557 (-0.069394) 0.172503 / 0.737135 (-0.564633) 0.108407 / 0.296338 (-0.187932)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.671251 / 0.215209 (0.456042) 6.235361 / 2.077655 (4.157706) 2.650328 / 1.504120 (1.146208) 2.341199 / 1.541195 (0.800004) 2.368803 / 1.468490 (0.900313) 0.841347 / 4.584777 (-3.743430) 5.042508 / 3.745712 (1.296796) 4.807565 / 5.269862 (-0.462296) 3.007420 / 4.565676 (-1.558257) 0.099953 / 0.424275 (-0.324322) 0.008412 / 0.007607 (0.000805) 0.747803 / 0.226044 (0.521759) 7.481245 / 2.268929 (5.212316) 3.416157 / 55.444624 (-52.028467) 2.724608 / 6.876477 (-4.151869) 2.832982 / 2.142072 (0.690910) 1.072423 / 4.805227 (-3.732804) 0.211314 / 6.500664 (-6.289351) 0.074098 / 0.075469 (-0.001371)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.566010 / 1.841788 (-0.275778) 23.137708 / 8.074308 (15.063400) 21.440132 / 10.191392 (11.248740) 0.230713 / 0.680424 (-0.449711) 0.028271 / 0.534201 (-0.505930) 0.450821 / 0.579283 (-0.128463) 0.548399 / 0.434364 (0.114035) 0.543588 / 0.540337 (0.003250) 0.805522 / 1.386936 (-0.581414)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008969 / 0.011353 (-0.002384) 0.004793 / 0.011008 (-0.006216) 0.075804 / 0.038508 (0.037296) 0.079893 / 0.023109 (0.056783) 0.464358 / 0.275898 (0.188460) 0.507243 / 0.323480 (0.183763) 0.005945 / 0.007986 (-0.002040) 0.005341 / 0.004328 (0.001012) 0.077952 / 0.004250 (0.073701) 0.059965 / 0.037052 (0.022913) 0.478947 / 0.258489 (0.220458) 0.528444 / 0.293841 (0.234603) 0.052878 / 0.128546 (-0.075668) 0.013939 / 0.075646 (-0.061707) 0.087351 / 0.419271 (-0.331920) 0.058448 / 0.043533 (0.014916) 0.478664 / 0.255139 (0.223525) 0.491239 / 0.283200 (0.208039) 0.032674 / 0.141683 (-0.109008) 1.753911 / 1.452155 (0.301756) 1.858923 / 1.492716 (0.366206)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.239278 / 0.018006 (0.221271) 0.507372 / 0.000490 (0.506882) 0.005489 / 0.000200 (0.005289) 0.000142 / 0.000054 (0.000087)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032919 / 0.037411 (-0.004493) 0.097726 / 0.014526 (0.083200) 0.119159 / 0.176557 (-0.057398) 0.174545 / 0.737135 (-0.562590) 0.115319 / 0.296338 (-0.181020)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.627107 / 0.215209 (0.411898) 6.211925 / 2.077655 (4.134270) 2.731484 / 1.504120 (1.227365) 2.488847 / 1.541195 (0.947652) 2.372445 / 1.468490 (0.903955) 0.822663 / 4.584777 (-3.762114) 4.924001 / 3.745712 (1.178289) 4.371161 / 5.269862 (-0.898700) 2.850314 / 4.565676 (-1.715363) 0.099156 / 0.424275 (-0.325119) 0.007941 / 0.007607 (0.000334) 0.721539 / 0.226044 (0.495495) 7.260874 / 2.268929 (4.991946) 3.351072 / 55.444624 (-52.093552) 2.757115 / 6.876477 (-4.119362) 2.858899 / 2.142072 (0.716827) 0.994054 / 4.805227 (-3.811173) 0.209186 / 6.500664 (-6.291478) 0.072070 / 0.075469 (-0.003399)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.748073 / 1.841788 (-0.093714) 23.514638 / 8.074308 (15.440330) 20.372037 / 10.191392 (10.180645) 0.220020 / 0.680424 (-0.460404) 0.057130 / 0.534201 (-0.477071) 0.458204 / 0.579283 (-0.121079) 0.600509 / 0.434364 (0.166145) 0.557100 / 0.540337 (0.016762) 0.814360 / 1.386936 (-0.572576)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007341 / 0.011353 (-0.004012) 0.004606 / 0.011008 (-0.006402) 0.087903 / 0.038508 (0.049395) 0.094090 / 0.023109 (0.070981) 0.322278 / 0.275898 (0.046380) 0.356770 / 0.323480 (0.033290) 0.005988 / 0.007986 (-0.001997) 0.003667 / 0.004328 (-0.000662) 0.066105 / 0.004250 (0.061854) 0.061220 / 0.037052 (0.024167) 0.331190 / 0.258489 (0.072701) 0.381402 / 0.293841 (0.087561) 0.032261 / 0.128546 (-0.096285) 0.009281 / 0.075646 (-0.066366) 0.293694 / 0.419271 (-0.125577) 0.055041 / 0.043533 (0.011508) 0.318080 / 0.255139 (0.062941) 0.348763 / 0.283200 (0.065563) 0.027379 / 0.141683 (-0.114304) 1.496294 / 1.452155 (0.044139) 1.581942 / 1.492716 (0.089226)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.307592 / 0.018006 (0.289586) 0.591805 / 0.000490 (0.591316) 0.017082 / 0.000200 (0.016882) 0.000721 / 0.000054 (0.000666)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032157 / 0.037411 (-0.005254) 0.096249 / 0.014526 (0.081724) 0.106656 / 0.176557 (-0.069901) 0.162966 / 0.737135 (-0.574169) 0.107068 / 0.296338 (-0.189271)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.409083 / 0.215209 (0.193874) 4.044307 / 2.077655 (1.966652) 2.062887 / 1.504120 (0.558767) 1.900568 / 1.541195 (0.359373) 2.011862 / 1.468490 (0.543372) 0.489250 / 4.584777 (-4.095527) 3.519531 / 3.745712 (-0.226182) 3.631713 / 5.269862 (-1.638149) 2.163967 / 4.565676 (-2.401709) 0.057723 / 0.424275 (-0.366552) 0.007474 / 0.007607 (-0.000133) 0.479562 / 0.226044 (0.253517) 4.799825 / 2.268929 (2.530897) 2.530036 / 55.444624 (-52.914588) 2.195344 / 6.876477 (-4.681133) 2.341046 / 2.142072 (0.198974) 0.625105 / 4.805227 (-4.180122) 0.132823 / 6.500664 (-6.367841) 0.061721 / 0.075469 (-0.013748)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.301313 / 1.841788 (-0.540475) 21.218468 / 8.074308 (13.144159) 15.466347 / 10.191392 (5.274955) 0.166115 / 0.680424 (-0.514309) 0.018866 / 0.534201 (-0.515335) 0.399307 / 0.579283 (-0.179976) 0.430537 / 0.434364 (-0.003827) 0.467110 / 0.540337 (-0.073228) 0.645686 / 1.386936 (-0.741250)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007288 / 0.011353 (-0.004065) 0.004298 / 0.011008 (-0.006710) 0.065515 / 0.038508 (0.027007) 0.089948 / 0.023109 (0.066839) 0.410121 / 0.275898 (0.134223) 0.449312 / 0.323480 (0.125832) 0.006749 / 0.007986 (-0.001237) 0.003927 / 0.004328 (-0.000401) 0.065321 / 0.004250 (0.061071) 0.062480 / 0.037052 (0.025428) 0.410796 / 0.258489 (0.152307) 0.457356 / 0.293841 (0.163515) 0.032632 / 0.128546 (-0.095914) 0.008798 / 0.075646 (-0.066849) 0.075936 / 0.419271 (-0.343335) 0.048402 / 0.043533 (0.004869) 0.403385 / 0.255139 (0.148246) 0.426094 / 0.283200 (0.142895) 0.025326 / 0.141683 (-0.116357) 1.551550 / 1.452155 (0.099395) 1.628622 / 1.492716 (0.135905)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.279689 / 0.018006 (0.261682) 0.583754 / 0.000490 (0.583265) 0.006579 / 0.000200 (0.006379) 0.000096 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034906 / 0.037411 (-0.002505) 0.099232 / 0.014526 (0.084706) 0.113093 / 0.176557 (-0.063464) 0.165499 / 0.737135 (-0.571636) 0.113398 / 0.296338 (-0.182941)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.439154 / 0.215209 (0.223945) 4.377041 / 2.077655 (2.299387) 2.395058 / 1.504120 (0.890938) 2.233359 / 1.541195 (0.692164) 2.357281 / 1.468490 (0.888791) 0.486036 / 4.584777 (-4.098741) 3.568794 / 3.745712 (-0.176918) 3.485421 / 5.269862 (-1.784440) 2.174325 / 4.565676 (-2.391351) 0.057855 / 0.424275 (-0.366420) 0.007545 / 0.007607 (-0.000062) 0.516853 / 0.226044 (0.290808) 5.173340 / 2.268929 (2.904412) 2.931475 / 55.444624 (-52.513149) 2.566814 / 6.876477 (-4.309663) 2.873304 / 2.142072 (0.731232) 0.597072 / 4.805227 (-4.208155) 0.133589 / 6.500664 (-6.367075) 0.061882 / 0.075469 (-0.013587)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.382845 / 1.841788 (-0.458943) 21.608316 / 8.074308 (13.534008) 15.702152 / 10.191392 (5.510759) 0.190629 / 0.680424 (-0.489795) 0.020572 / 0.534201 (-0.513629) 0.396207 / 0.579283 (-0.183076) 0.421184 / 0.434364 (-0.013180) 0.477700 / 0.540337 (-0.062638) 0.690828 / 1.386936 (-0.696108)

mariosasko and others added 2 commits October 16, 2023 15:05
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008450 / 0.011353 (-0.002903) 0.004958 / 0.011008 (-0.006051) 0.105397 / 0.038508 (0.066889) 0.079508 / 0.023109 (0.056399) 0.403050 / 0.275898 (0.127152) 0.443679 / 0.323480 (0.120199) 0.004654 / 0.007986 (-0.003332) 0.005629 / 0.004328 (0.001301) 0.078755 / 0.004250 (0.074505) 0.055694 / 0.037052 (0.018642) 0.409952 / 0.258489 (0.151463) 0.454931 / 0.293841 (0.161090) 0.045124 / 0.128546 (-0.083422) 0.014031 / 0.075646 (-0.061616) 0.347340 / 0.419271 (-0.071931) 0.064359 / 0.043533 (0.020826) 0.414158 / 0.255139 (0.159019) 0.428442 / 0.283200 (0.145243) 0.033726 / 0.141683 (-0.107957) 1.770483 / 1.452155 (0.318328) 1.795267 / 1.492716 (0.302551)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.251020 / 0.018006 (0.233014) 0.507066 / 0.000490 (0.506576) 0.015751 / 0.000200 (0.015551) 0.000531 / 0.000054 (0.000477)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028897 / 0.037411 (-0.008515) 0.087393 / 0.014526 (0.072867) 0.097365 / 0.176557 (-0.079192) 0.164833 / 0.737135 (-0.572303) 0.101281 / 0.296338 (-0.195058)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.610806 / 0.215209 (0.395597) 6.011697 / 2.077655 (3.934042) 2.544268 / 1.504120 (1.040148) 2.127103 / 1.541195 (0.585908) 2.133330 / 1.468490 (0.664839) 0.860964 / 4.584777 (-3.723813) 4.982374 / 3.745712 (1.236662) 5.073026 / 5.269862 (-0.196836) 3.033056 / 4.565676 (-1.532621) 0.118835 / 0.424275 (-0.305440) 0.010122 / 0.007607 (0.002515) 0.805807 / 0.226044 (0.579763) 7.839166 / 2.268929 (5.570238) 3.512405 / 55.444624 (-51.932219) 2.767578 / 6.876477 (-4.108898) 2.936885 / 2.142072 (0.794813) 1.058533 / 4.805227 (-3.746695) 0.222260 / 6.500664 (-6.278404) 0.073890 / 0.075469 (-0.001580)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.628307 / 1.841788 (-0.213480) 22.827116 / 8.074308 (14.752808) 21.809759 / 10.191392 (11.618367) 0.220637 / 0.680424 (-0.459786) 0.028030 / 0.534201 (-0.506171) 0.448620 / 0.579283 (-0.130663) 0.540442 / 0.434364 (0.106078) 0.548601 / 0.540337 (0.008264) 0.770387 / 1.386936 (-0.616549)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009198 / 0.011353 (-0.002155) 0.004935 / 0.011008 (-0.006073) 0.079095 / 0.038508 (0.040587) 0.090490 / 0.023109 (0.067381) 0.453374 / 0.275898 (0.177476) 0.519483 / 0.323480 (0.196003) 0.006539 / 0.007986 (-0.001447) 0.004160 / 0.004328 (-0.000169) 0.078433 / 0.004250 (0.074182) 0.068022 / 0.037052 (0.030969) 0.467686 / 0.258489 (0.209197) 0.523863 / 0.293841 (0.230022) 0.050926 / 0.128546 (-0.077620) 0.013664 / 0.075646 (-0.061982) 0.088787 / 0.419271 (-0.330485) 0.060503 / 0.043533 (0.016971) 0.474692 / 0.255139 (0.219553) 0.516461 / 0.283200 (0.233261) 0.034482 / 0.141683 (-0.107200) 1.747939 / 1.452155 (0.295784) 1.915212 / 1.492716 (0.422496)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.247400 / 0.018006 (0.229394) 0.516829 / 0.000490 (0.516339) 0.005770 / 0.000200 (0.005570) 0.000121 / 0.000054 (0.000067)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034334 / 0.037411 (-0.003077) 0.102397 / 0.014526 (0.087871) 0.114187 / 0.176557 (-0.062370) 0.171093 / 0.737135 (-0.566043) 0.117281 / 0.296338 (-0.179058)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.635710 / 0.215209 (0.420501) 6.400656 / 2.077655 (4.323002) 2.896896 / 1.504120 (1.392776) 2.682890 / 1.541195 (1.141696) 2.656445 / 1.468490 (1.187955) 1.044244 / 4.584777 (-3.540533) 5.393212 / 3.745712 (1.647500) 4.592928 / 5.269862 (-0.676934) 2.798525 / 4.565676 (-1.767151) 0.103720 / 0.424275 (-0.320555) 0.010196 / 0.007607 (0.002589) 0.762756 / 0.226044 (0.536711) 7.232939 / 2.268929 (4.964011) 3.714015 / 55.444624 (-51.730609) 3.050766 / 6.876477 (-3.825711) 3.149715 / 2.142072 (1.007643) 1.058827 / 4.805227 (-3.746400) 0.214079 / 6.500664 (-6.286585) 0.076712 / 0.075469 (0.001243)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.701032 / 1.841788 (-0.140755) 23.742023 / 8.074308 (15.667715) 22.486043 / 10.191392 (12.294651) 0.249757 / 0.680424 (-0.430667) 0.031714 / 0.534201 (-0.502486) 0.479914 / 0.579283 (-0.099369) 0.593315 / 0.434364 (0.158951) 0.562897 / 0.540337 (0.022560) 0.826636 / 1.386936 (-0.560300)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007816 / 0.011353 (-0.003537) 0.004541 / 0.011008 (-0.006467) 0.097256 / 0.038508 (0.058748) 0.081376 / 0.023109 (0.058267) 0.356635 / 0.275898 (0.080737) 0.394969 / 0.323480 (0.071489) 0.004670 / 0.007986 (-0.003316) 0.003537 / 0.004328 (-0.000791) 0.075564 / 0.004250 (0.071314) 0.063459 / 0.037052 (0.026407) 0.363846 / 0.258489 (0.105357) 0.416337 / 0.293841 (0.122496) 0.036690 / 0.128546 (-0.091857) 0.009653 / 0.075646 (-0.065993) 0.337265 / 0.419271 (-0.082007) 0.061446 / 0.043533 (0.017913) 0.359190 / 0.255139 (0.104051) 0.385866 / 0.283200 (0.102666) 0.030474 / 0.141683 (-0.111209) 1.796903 / 1.452155 (0.344748) 1.852332 / 1.492716 (0.359616)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.264008 / 0.018006 (0.246002) 0.507387 / 0.000490 (0.506897) 0.012309 / 0.000200 (0.012109) 0.000377 / 0.000054 (0.000323)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033224 / 0.037411 (-0.004188) 0.097136 / 0.014526 (0.082610) 0.113035 / 0.176557 (-0.063522) 0.181778 / 0.737135 (-0.555357) 0.130511 / 0.296338 (-0.165827)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.444512 / 0.215209 (0.229303) 4.453285 / 2.077655 (2.375631) 2.154123 / 1.504120 (0.650003) 1.955451 / 1.541195 (0.414256) 2.015089 / 1.468490 (0.546599) 0.567824 / 4.584777 (-4.016953) 4.083084 / 3.745712 (0.337371) 3.912417 / 5.269862 (-1.357445) 2.366197 / 4.565676 (-2.199480) 0.066468 / 0.424275 (-0.357807) 0.008478 / 0.007607 (0.000870) 0.531196 / 0.226044 (0.305152) 5.311285 / 2.268929 (3.042356) 2.743252 / 55.444624 (-52.701372) 2.322353 / 6.876477 (-4.554124) 2.368168 / 2.142072 (0.226095) 0.679223 / 4.805227 (-4.126004) 0.152401 / 6.500664 (-6.348263) 0.071954 / 0.075469 (-0.003515)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.489114 / 1.841788 (-0.352674) 22.114956 / 8.074308 (14.040648) 16.072564 / 10.191392 (5.881172) 0.164303 / 0.680424 (-0.516121) 0.021317 / 0.534201 (-0.512884) 0.460250 / 0.579283 (-0.119033) 0.467554 / 0.434364 (0.033190) 0.539773 / 0.540337 (-0.000564) 0.751904 / 1.386936 (-0.635032)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007520 / 0.011353 (-0.003833) 0.004487 / 0.011008 (-0.006521) 0.075074 / 0.038508 (0.036566) 0.083135 / 0.023109 (0.060026) 0.474052 / 0.275898 (0.198154) 0.524051 / 0.323480 (0.200571) 0.006192 / 0.007986 (-0.001793) 0.003835 / 0.004328 (-0.000494) 0.074643 / 0.004250 (0.070392) 0.065334 / 0.037052 (0.028282) 0.507033 / 0.258489 (0.248544) 0.519846 / 0.293841 (0.226005) 0.036985 / 0.128546 (-0.091561) 0.009828 / 0.075646 (-0.065818) 0.082992 / 0.419271 (-0.336279) 0.055942 / 0.043533 (0.012409) 0.480652 / 0.255139 (0.225513) 0.503683 / 0.283200 (0.220483) 0.025560 / 0.141683 (-0.116123) 1.801390 / 1.452155 (0.349235) 1.892929 / 1.492716 (0.400213)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.246771 / 0.018006 (0.228765) 0.498901 / 0.000490 (0.498411) 0.008186 / 0.000200 (0.007986) 0.000166 / 0.000054 (0.000112)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.038666 / 0.037411 (0.001254) 0.110317 / 0.014526 (0.095791) 0.122995 / 0.176557 (-0.053562) 0.185355 / 0.737135 (-0.551781) 0.123720 / 0.296338 (-0.172619)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.508421 / 0.215209 (0.293212) 5.046464 / 2.077655 (2.968809) 2.660004 / 1.504120 (1.155884) 2.482841 / 1.541195 (0.941646) 2.573941 / 1.468490 (1.105451) 0.565702 / 4.584777 (-4.019075) 4.197895 / 3.745712 (0.452183) 3.755480 / 5.269862 (-1.514381) 2.308066 / 4.565676 (-2.257610) 0.066559 / 0.424275 (-0.357716) 0.008436 / 0.007607 (0.000829) 0.589858 / 0.226044 (0.363814) 5.873488 / 2.268929 (3.604559) 3.241810 / 55.444624 (-52.202814) 2.789831 / 6.876477 (-4.086645) 3.008989 / 2.142072 (0.866917) 0.679624 / 4.805227 (-4.125603) 0.150868 / 6.500664 (-6.349796) 0.068581 / 0.075469 (-0.006889)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.582955 / 1.841788 (-0.258833) 22.684969 / 8.074308 (14.610661) 16.829855 / 10.191392 (6.638463) 0.201599 / 0.680424 (-0.478825) 0.023261 / 0.534201 (-0.510940) 0.465009 / 0.579283 (-0.114274) 0.497701 / 0.434364 (0.063337) 0.557822 / 0.540337 (0.017485) 0.803234 / 1.386936 (-0.583702)

@mariosasko mariosasko merged commit e74f802 into main Oct 16, 2023
13 checks passed
@mariosasko mariosasko deleted the single-commit-push_to_hub branch October 16, 2023 13:30
@Wauplin
Copy link
Contributor

Wauplin commented Oct 16, 2023

Well done! 👏 🔥

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008866 / 0.011353 (-0.002487) 0.005910 / 0.011008 (-0.005098) 0.099916 / 0.038508 (0.061408) 0.085787 / 0.023109 (0.062678) 0.391028 / 0.275898 (0.115130) 0.412689 / 0.323480 (0.089209) 0.006527 / 0.007986 (-0.001459) 0.004629 / 0.004328 (0.000301) 0.084627 / 0.004250 (0.080377) 0.063404 / 0.037052 (0.026352) 0.408923 / 0.258489 (0.150434) 0.437130 / 0.293841 (0.143289) 0.050256 / 0.128546 (-0.078290) 0.013914 / 0.075646 (-0.061732) 0.350893 / 0.419271 (-0.068379) 0.067931 / 0.043533 (0.024398) 0.383807 / 0.255139 (0.128668) 0.424150 / 0.283200 (0.140950) 0.039978 / 0.141683 (-0.101705) 1.697631 / 1.452155 (0.245476) 1.925568 / 1.492716 (0.432851)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.315417 / 0.018006 (0.297410) 0.607050 / 0.000490 (0.606560) 0.017314 / 0.000200 (0.017114) 0.000514 / 0.000054 (0.000459)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032994 / 0.037411 (-0.004417) 0.103993 / 0.014526 (0.089467) 0.125369 / 0.176557 (-0.051187) 0.185984 / 0.737135 (-0.551151) 0.139192 / 0.296338 (-0.157146)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.639769 / 0.215209 (0.424560) 6.236187 / 2.077655 (4.158532) 2.775777 / 1.504120 (1.271657) 2.599683 / 1.541195 (1.058488) 2.780064 / 1.468490 (1.311574) 1.107247 / 4.584777 (-3.477530) 5.724223 / 3.745712 (1.978511) 5.284786 / 5.269862 (0.014925) 3.342465 / 4.565676 (-1.223211) 0.107685 / 0.424275 (-0.316590) 0.009237 / 0.007607 (0.001630) 0.760282 / 0.226044 (0.534238) 7.570859 / 2.268929 (5.301930) 3.572498 / 55.444624 (-51.872126) 2.997482 / 6.876477 (-3.878995) 2.910001 / 2.142072 (0.767929) 1.249272 / 4.805227 (-3.555955) 0.229425 / 6.500664 (-6.271239) 0.091974 / 0.075469 (0.016505)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.663859 / 1.841788 (-0.177929) 25.283961 / 8.074308 (17.209653) 20.793389 / 10.191392 (10.601997) 0.239263 / 0.680424 (-0.441161) 0.028808 / 0.534201 (-0.505393) 0.521045 / 0.579283 (-0.058238) 0.602451 / 0.434364 (0.168087) 0.544536 / 0.540337 (0.004198) 0.819732 / 1.386936 (-0.567204)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008970 / 0.011353 (-0.002383) 0.009663 / 0.011008 (-0.001345) 0.083471 / 0.038508 (0.044963) 0.090695 / 0.023109 (0.067585) 0.562539 / 0.275898 (0.286641) 0.572092 / 0.323480 (0.248612) 0.007269 / 0.007986 (-0.000717) 0.004664 / 0.004328 (0.000335) 0.084212 / 0.004250 (0.079961) 0.072716 / 0.037052 (0.035664) 0.559810 / 0.258489 (0.301320) 0.574296 / 0.293841 (0.280455) 0.048555 / 0.128546 (-0.079991) 0.015901 / 0.075646 (-0.059746) 0.107815 / 0.419271 (-0.311456) 0.065404 / 0.043533 (0.021871) 0.544787 / 0.255139 (0.289648) 0.586993 / 0.283200 (0.303794) 0.042613 / 0.141683 (-0.099069) 1.919266 / 1.452155 (0.467111) 2.095189 / 1.492716 (0.602473)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.298512 / 0.018006 (0.280506) 0.597745 / 0.000490 (0.597256) 0.008806 / 0.000200 (0.008606) 0.000119 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039420 / 0.037411 (0.002009) 0.111378 / 0.014526 (0.096852) 0.136421 / 0.176557 (-0.040135) 0.192006 / 0.737135 (-0.545129) 0.130037 / 0.296338 (-0.166301)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.679169 / 0.215209 (0.463960) 6.750881 / 2.077655 (4.673226) 3.220411 / 1.504120 (1.716291) 2.851988 / 1.541195 (1.310794) 2.974247 / 1.468490 (1.505757) 0.892593 / 4.584777 (-3.692184) 5.659975 / 3.745712 (1.914263) 5.172641 / 5.269862 (-0.097220) 3.308429 / 4.565676 (-1.257248) 0.100580 / 0.424275 (-0.323695) 0.009320 / 0.007607 (0.001713) 0.833290 / 0.226044 (0.607245) 8.091847 / 2.268929 (5.822918) 4.023734 / 55.444624 (-51.420890) 3.441583 / 6.876477 (-3.434894) 3.763562 / 2.142072 (1.621489) 1.055105 / 4.805227 (-3.750122) 0.239218 / 6.500664 (-6.261446) 0.081922 / 0.075469 (0.006453)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.796495 / 1.841788 (-0.045293) 25.942492 / 8.074308 (17.868184) 23.211617 / 10.191392 (13.020225) 0.256054 / 0.680424 (-0.424370) 0.030491 / 0.534201 (-0.503710) 0.520474 / 0.579283 (-0.058809) 0.626331 / 0.434364 (0.191967) 0.619897 / 0.540337 (0.079560) 0.900833 / 1.386936 (-0.486103)

@ZachNagengast
Copy link
Contributor

Congrats on merging this! 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants