Skip to content

Commit

Permalink
keep file opened in read_csv
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Jun 7, 2022
1 parent b84ae0e commit 1746712
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions src/datasets/download/streaming_download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -660,8 +660,7 @@ def xpandas_read_csv(filepath_or_buffer, use_auth_token: Optional[Union[str, boo
if hasattr(filepath_or_buffer, "read"):
return pd.read_csv(filepath_or_buffer, **kwargs)
else:
with xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token) as f:
return pd.read_csv(f, **kwargs)
return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


def xpandas_read_excel(filepath_or_buffer, **kwargs):
Expand Down

1 comment on commit 1746712

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009274 / 0.011353 (-0.002079) 0.008800 / 0.011008 (-0.002208) 0.034832 / 0.038508 (-0.003676) 0.038478 / 0.023109 (0.015368) 0.350430 / 0.275898 (0.074532) 0.372482 / 0.323480 (0.049002) 0.006993 / 0.007986 (-0.000993) 0.005640 / 0.004328 (0.001311) 0.011687 / 0.004250 (0.007437) 0.046843 / 0.037052 (0.009791) 0.336260 / 0.258489 (0.077771) 0.435942 / 0.293841 (0.142101) 0.040279 / 0.128546 (-0.088268) 0.009458 / 0.075646 (-0.066188) 0.295873 / 0.419271 (-0.123399) 0.061954 / 0.043533 (0.018421) 0.345114 / 0.255139 (0.089975) 0.377160 / 0.283200 (0.093961) 0.116632 / 0.141683 (-0.025051) 2.280613 / 1.452155 (0.828458) 2.310110 / 1.492716 (0.817393)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.024266 / 0.018006 (0.006260) 0.472049 / 0.000490 (0.471560) 0.012994 / 0.000200 (0.012794) 0.000327 / 0.000054 (0.000272)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027502 / 0.037411 (-0.009909) 0.123613 / 0.014526 (0.109087) 0.129829 / 0.176557 (-0.046727) 0.178786 / 0.737135 (-0.558350) 0.130555 / 0.296338 (-0.165783)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.471155 / 0.215209 (0.255946) 4.583808 / 2.077655 (2.506154) 2.007651 / 1.504120 (0.503531) 1.737326 / 1.541195 (0.196131) 1.831784 / 1.468490 (0.363294) 0.467462 / 4.584777 (-4.117315) 5.735218 / 3.745712 (1.989505) 2.893965 / 5.269862 (-2.375896) 1.186308 / 4.565676 (-3.379368) 0.059237 / 0.424275 (-0.365038) 0.017980 / 0.007607 (0.010373) 0.635817 / 0.226044 (0.409772) 6.528261 / 2.268929 (4.259332) 2.479906 / 55.444624 (-52.964719) 2.123491 / 6.876477 (-4.752985) 2.209833 / 2.142072 (0.067760) 0.612565 / 4.805227 (-4.192663) 0.135299 / 6.500664 (-6.365365) 0.069389 / 0.075469 (-0.006080)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.879050 / 1.841788 (0.037263) 15.891371 / 8.074308 (7.817063) 29.489145 / 10.191392 (19.297753) 0.995630 / 0.680424 (0.315206) 0.605670 / 0.534201 (0.071469) 0.536161 / 0.579283 (-0.043122) 0.589115 / 0.434364 (0.154751) 0.360657 / 0.540337 (-0.179681) 0.385190 / 1.386936 (-1.001746)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009472 / 0.011353 (-0.001881) 0.004679 / 0.011008 (-0.006329) 0.034065 / 0.038508 (-0.004443) 0.038062 / 0.023109 (0.014953) 0.350587 / 0.275898 (0.074689) 0.359117 / 0.323480 (0.035637) 0.007269 / 0.007986 (-0.000717) 0.005350 / 0.004328 (0.001021) 0.008253 / 0.004250 (0.004003) 0.045703 / 0.037052 (0.008651) 0.327801 / 0.258489 (0.069312) 0.372480 / 0.293841 (0.078639) 0.035062 / 0.128546 (-0.093484) 0.011048 / 0.075646 (-0.064598) 0.279181 / 0.419271 (-0.140091) 0.058070 / 0.043533 (0.014537) 0.328938 / 0.255139 (0.073799) 0.342739 / 0.283200 (0.059539) 0.103596 / 0.141683 (-0.038087) 2.042932 / 1.452155 (0.590778) 2.064985 / 1.492716 (0.572269)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.344314 / 0.018006 (0.326308) 0.483667 / 0.000490 (0.483177) 0.036085 / 0.000200 (0.035885) 0.000822 / 0.000054 (0.000768)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027782 / 0.037411 (-0.009629) 0.120894 / 0.014526 (0.106368) 0.129637 / 0.176557 (-0.046919) 0.181322 / 0.737135 (-0.555814) 0.127931 / 0.296338 (-0.168408)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.518029 / 0.215209 (0.302820) 5.132365 / 2.077655 (3.054711) 2.166123 / 1.504120 (0.662003) 1.923806 / 1.541195 (0.382611) 1.981623 / 1.468490 (0.513133) 0.490551 / 4.584777 (-4.094226) 5.652222 / 3.745712 (1.906509) 2.597597 / 5.269862 (-2.672265) 1.132088 / 4.565676 (-3.433589) 0.060151 / 0.424275 (-0.364124) 0.017395 / 0.007607 (0.009788) 0.663155 / 0.226044 (0.437110) 6.475609 / 2.268929 (4.206681) 2.598013 / 55.444624 (-52.846611) 2.204710 / 6.876477 (-4.671767) 2.216567 / 2.142072 (0.074495) 0.635622 / 4.805227 (-4.169606) 0.138144 / 6.500664 (-6.362520) 0.067140 / 0.075469 (-0.008329)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.900859 / 1.841788 (0.059071) 16.502254 / 8.074308 (8.427945) 30.991487 / 10.191392 (20.800095) 0.976211 / 0.680424 (0.295787) 0.606480 / 0.534201 (0.072279) 0.556435 / 0.579283 (-0.022848) 0.617897 / 0.434364 (0.183533) 0.401695 / 0.540337 (-0.138643) 0.401635 / 1.386936 (-0.985301)

CML watermark

Please sign in to comment.