Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add repo_id to DatasetInfo #6268

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Add repo_id to DatasetInfo #6268

wants to merge 3 commits into from

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Sep 29, 2023

from datasets import load_dataset

ds = load_dataset("lhoestq/demo1", split="train")
ds = ds.map(lambda x: {}, num_proc=2).filter(lambda x: True).remove_columns(["id"])
print(ds.repo_id)
# lhoestq/demo1
  • repo_id is None when the dataset doesn't come from the Hub, e.g. from Dataset.from_dict
  • repo_id is set to None when concatenating datasets with different repo ids

related to #4129

TODO:

  • discuss if it's ok for now
  • tests

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@lhoestq
Copy link
Member Author

lhoestq commented Sep 29, 2023

In #4129 we want to track the origin of a dataset, e.g. if it comes from multiple datasets.

I think it's out of scope of DatasetInfo alone, which has info for one dataset only.
Therefore it makes sense to add repo_id, which is for one dataset only.

IMO if we want to track multiple origins we will need a new DatasetInfo that would have fields relevant to a mix of datasets (out of scope of this PR)

cc @mariosasko I'd like your opinion on this

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009009 / 0.011353 (-0.002344) 0.004169 / 0.011008 (-0.006840) 0.098634 / 0.038508 (0.060126) 0.069526 / 0.023109 (0.046417) 0.337963 / 0.275898 (0.062065) 0.379737 / 0.323480 (0.056257) 0.004318 / 0.007986 (-0.003668) 0.005347 / 0.004328 (0.001019) 0.069875 / 0.004250 (0.065624) 0.055964 / 0.037052 (0.018912) 0.340305 / 0.258489 (0.081816) 0.429718 / 0.293841 (0.135877) 0.045101 / 0.128546 (-0.083445) 0.012610 / 0.075646 (-0.063036) 0.312366 / 0.419271 (-0.106905) 0.064711 / 0.043533 (0.021178) 0.345216 / 0.255139 (0.090077) 0.367245 / 0.283200 (0.084046) 0.034638 / 0.141683 (-0.107045) 1.541947 / 1.452155 (0.089793) 1.645268 / 1.492716 (0.152551)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.233501 / 0.018006 (0.215495) 0.514207 / 0.000490 (0.513717) 0.014271 / 0.000200 (0.014072) 0.000366 / 0.000054 (0.000311)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026288 / 0.037411 (-0.011124) 0.083206 / 0.014526 (0.068680) 0.098172 / 0.176557 (-0.078385) 0.158529 / 0.737135 (-0.578606) 0.095183 / 0.296338 (-0.201155)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.538300 / 0.215209 (0.323091) 5.486939 / 2.077655 (3.409285) 2.321812 / 1.504120 (0.817692) 2.002124 / 1.541195 (0.460929) 2.045043 / 1.468490 (0.576553) 0.852772 / 4.584777 (-3.732005) 5.014897 / 3.745712 (1.269185) 4.428115 / 5.269862 (-0.841746) 2.750126 / 4.565676 (-1.815550) 0.099028 / 0.424275 (-0.325247) 0.007678 / 0.007607 (0.000070) 0.664463 / 0.226044 (0.438418) 6.617811 / 2.268929 (4.348883) 2.888382 / 55.444624 (-52.556242) 2.190753 / 6.876477 (-4.685724) 2.414586 / 2.142072 (0.272513) 1.010302 / 4.805227 (-3.794925) 0.194925 / 6.500664 (-6.305739) 0.063490 / 0.075469 (-0.011979)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.543464 / 1.841788 (-0.298323) 20.566666 / 8.074308 (12.492358) 19.410745 / 10.191392 (9.219353) 0.207077 / 0.680424 (-0.473347) 0.028895 / 0.534201 (-0.505306) 0.427525 / 0.579283 (-0.151758) 0.535450 / 0.434364 (0.101086) 0.494632 / 0.540337 (-0.045705) 0.723705 / 1.386936 (-0.663231)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008209 / 0.011353 (-0.003144) 0.004184 / 0.011008 (-0.006824) 0.072420 / 0.038508 (0.033912) 0.066851 / 0.023109 (0.043742) 0.424137 / 0.275898 (0.148239) 0.473156 / 0.323480 (0.149676) 0.005394 / 0.007986 (-0.002591) 0.003898 / 0.004328 (-0.000430) 0.069996 / 0.004250 (0.065746) 0.053113 / 0.037052 (0.016061) 0.453214 / 0.258489 (0.194725) 0.495921 / 0.293841 (0.202080) 0.043028 / 0.128546 (-0.085519) 0.012320 / 0.075646 (-0.063326) 0.080270 / 0.419271 (-0.339002) 0.053337 / 0.043533 (0.009804) 0.436604 / 0.255139 (0.181465) 0.463422 / 0.283200 (0.180223) 0.030277 / 0.141683 (-0.111406) 1.560261 / 1.452155 (0.108106) 1.647209 / 1.492716 (0.154493)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.232556 / 0.018006 (0.214550) 0.502387 / 0.000490 (0.501897) 0.006688 / 0.000200 (0.006488) 0.000118 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030204 / 0.037411 (-0.007207) 0.089438 / 0.014526 (0.074912) 0.118939 / 0.176557 (-0.057617) 0.160537 / 0.737135 (-0.576598) 0.113432 / 0.296338 (-0.182906)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.586469 / 0.215209 (0.371260) 5.916156 / 2.077655 (3.838502) 2.904960 / 1.504120 (1.400840) 2.346838 / 1.541195 (0.805644) 2.373688 / 1.468490 (0.905198) 0.829917 / 4.584777 (-3.754860) 4.851283 / 3.745712 (1.105571) 4.220103 / 5.269862 (-1.049758) 2.706139 / 4.565676 (-1.859538) 0.094095 / 0.424275 (-0.330180) 0.008201 / 0.007607 (0.000594) 0.699099 / 0.226044 (0.473054) 7.046940 / 2.268929 (4.778011) 3.374837 / 55.444624 (-52.069788) 2.690839 / 6.876477 (-4.185638) 2.845717 / 2.142072 (0.703645) 0.989698 / 4.805227 (-3.815529) 0.190413 / 6.500664 (-6.310251) 0.066233 / 0.075469 (-0.009236)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.513607 / 1.841788 (-0.328180) 21.544200 / 8.074308 (13.469892) 20.297337 / 10.191392 (10.105945) 0.216390 / 0.680424 (-0.464034) 0.029962 / 0.534201 (-0.504239) 0.451531 / 0.579283 (-0.127752) 0.530147 / 0.434364 (0.095783) 0.520739 / 0.540337 (-0.019598) 0.716431 / 1.386936 (-0.670505)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006509 / 0.011353 (-0.004844) 0.003987 / 0.011008 (-0.007022) 0.085233 / 0.038508 (0.046725) 0.077765 / 0.023109 (0.054656) 0.310467 / 0.275898 (0.034569) 0.343363 / 0.323480 (0.019883) 0.005557 / 0.007986 (-0.002429) 0.003430 / 0.004328 (-0.000898) 0.064948 / 0.004250 (0.060697) 0.056864 / 0.037052 (0.019812) 0.314005 / 0.258489 (0.055516) 0.360638 / 0.293841 (0.066798) 0.031134 / 0.128546 (-0.097412) 0.008869 / 0.075646 (-0.066777) 0.286409 / 0.419271 (-0.132862) 0.051338 / 0.043533 (0.007805) 0.311329 / 0.255139 (0.056190) 0.334373 / 0.283200 (0.051174) 0.024816 / 0.141683 (-0.116867) 1.502872 / 1.452155 (0.050718) 1.569941 / 1.492716 (0.077224)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.269639 / 0.018006 (0.251633) 0.558510 / 0.000490 (0.558020) 0.011748 / 0.000200 (0.011548) 0.000234 / 0.000054 (0.000180)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029139 / 0.037411 (-0.008272) 0.083586 / 0.014526 (0.069060) 0.102426 / 0.176557 (-0.074131) 0.162398 / 0.737135 (-0.574737) 0.101364 / 0.296338 (-0.194975)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.382281 / 0.215209 (0.167072) 3.826412 / 2.077655 (1.748758) 1.815911 / 1.504120 (0.311791) 1.644539 / 1.541195 (0.103344) 1.688487 / 1.468490 (0.219996) 0.482115 / 4.584777 (-4.102662) 3.574773 / 3.745712 (-0.170939) 3.262733 / 5.269862 (-2.007129) 2.058115 / 4.565676 (-2.507562) 0.056367 / 0.424275 (-0.367908) 0.007233 / 0.007607 (-0.000374) 0.456859 / 0.226044 (0.230815) 4.565935 / 2.268929 (2.297006) 2.311802 / 55.444624 (-53.132823) 1.943936 / 6.876477 (-4.932541) 2.129811 / 2.142072 (-0.012261) 0.575098 / 4.805227 (-4.230129) 0.130495 / 6.500664 (-6.370169) 0.059757 / 0.075469 (-0.015712)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.238495 / 1.841788 (-0.603293) 18.940000 / 8.074308 (10.865692) 14.034240 / 10.191392 (3.842848) 0.166418 / 0.680424 (-0.514006) 0.018420 / 0.534201 (-0.515781) 0.395330 / 0.579283 (-0.183953) 0.413518 / 0.434364 (-0.020846) 0.461499 / 0.540337 (-0.078838) 0.661371 / 1.386936 (-0.725565)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006673 / 0.011353 (-0.004680) 0.004335 / 0.011008 (-0.006673) 0.064568 / 0.038508 (0.026060) 0.072763 / 0.023109 (0.049653) 0.429488 / 0.275898 (0.153590) 0.456900 / 0.323480 (0.133420) 0.005481 / 0.007986 (-0.002505) 0.003649 / 0.004328 (-0.000680) 0.064975 / 0.004250 (0.060724) 0.056839 / 0.037052 (0.019786) 0.439451 / 0.258489 (0.180962) 0.461691 / 0.293841 (0.167850) 0.031455 / 0.128546 (-0.097092) 0.008848 / 0.075646 (-0.066798) 0.071719 / 0.419271 (-0.347553) 0.047116 / 0.043533 (0.003583) 0.429055 / 0.255139 (0.173916) 0.434204 / 0.283200 (0.151004) 0.022594 / 0.141683 (-0.119089) 1.539231 / 1.452155 (0.087077) 1.568111 / 1.492716 (0.075394)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.267374 / 0.018006 (0.249368) 0.553202 / 0.000490 (0.552712) 0.005410 / 0.000200 (0.005210) 0.000101 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031478 / 0.037411 (-0.005933) 0.092438 / 0.014526 (0.077912) 0.103874 / 0.176557 (-0.072682) 0.158428 / 0.737135 (-0.578708) 0.111617 / 0.296338 (-0.184721)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.434783 / 0.215209 (0.219574) 4.332536 / 2.077655 (2.254881) 2.354522 / 1.504120 (0.850402) 2.220271 / 1.541195 (0.679076) 2.338524 / 1.468490 (0.870034) 0.494508 / 4.584777 (-4.090269) 3.619592 / 3.745712 (-0.126120) 3.320897 / 5.269862 (-1.948964) 2.107475 / 4.565676 (-2.458202) 0.058479 / 0.424275 (-0.365796) 0.007427 / 0.007607 (-0.000180) 0.509298 / 0.226044 (0.283254) 5.067940 / 2.268929 (2.799012) 2.815336 / 55.444624 (-52.629288) 2.470958 / 6.876477 (-4.405519) 2.672801 / 2.142072 (0.530728) 0.588199 / 4.805227 (-4.217028) 0.134062 / 6.500664 (-6.366602) 0.060951 / 0.075469 (-0.014518)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.353955 / 1.841788 (-0.487832) 20.386012 / 8.074308 (12.311704) 15.032463 / 10.191392 (4.841071) 0.167243 / 0.680424 (-0.513181) 0.020426 / 0.534201 (-0.513775) 0.396815 / 0.579283 (-0.182469) 0.421806 / 0.434364 (-0.012558) 0.471866 / 0.540337 (-0.068471) 0.667206 / 1.386936 (-0.719730)

src/datasets/info.py Outdated Show resolved Hide resolved
@davanstrien
Copy link
Member

Really happy to see this! It could also be helpful to track some other metadata about how the dataset was built in the future. i.e. for the Stack loaded like this:

ds = load_dataset("bigcode/the-stack", data_dir="data/dockerfile", split="train")

It could be helpful to have easy access to the data_dir argument used during loading since that changes the training data quite a bit vs. loading the full dataset. You can also recover this from download_checksums, which seems a bit hacky. That is not necessary for this PR, though.

@tomaarsen
Copy link
Member

tomaarsen commented Sep 29, 2023

Perhaps it is also interesting to track the revision? I suppose the version also kind of covers that.

That said, this is looking great already! I'm quite excited about this. Losing the repo_id after merging (different) datasets also makes sense to me, well done.

@davanstrien
Copy link
Member

One other thought. Is it worth tracking if a token was passed during loading?

The Hub ID for private datasets could in some cases contain information someone wouldn't want to make public i.e. davanstrien/super_secret_dataset_using_GPT_created_data.

Adding a bool like is_private could then be used by another library to determine if the dataset ID should be shared or not (or default to not sharing the ID for private datasets). i.e. in SpanMarker @tomaarsen might do a check like

if ds.is_private and not push_hub_id_for_private_ds:
	ds_name  = None

Potentially this is overkill but could be useful for downstream libraries who might use this information for creating automatic model cards.

@mariosasko
Copy link
Collaborator

We should probably find a way to remove DatasetInfo, as (most of) its attributes are outdated (homepage, description, etc.), not introduce new ones :). But I guess storing repo_id there is the simplest solution for now, so I'm OK with it.

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented Oct 1, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007757 / 0.011353 (-0.003595) 0.004543 / 0.011008 (-0.006465) 0.100193 / 0.038508 (0.061685) 0.082333 / 0.023109 (0.059224) 0.374586 / 0.275898 (0.098688) 0.412617 / 0.323480 (0.089137) 0.006148 / 0.007986 (-0.001838) 0.003826 / 0.004328 (-0.000503) 0.077077 / 0.004250 (0.072827) 0.064057 / 0.037052 (0.027005) 0.391435 / 0.258489 (0.132946) 0.436439 / 0.293841 (0.142599) 0.036534 / 0.128546 (-0.092012) 0.009986 / 0.075646 (-0.065660) 0.344243 / 0.419271 (-0.075028) 0.062013 / 0.043533 (0.018480) 0.378113 / 0.255139 (0.122974) 0.398476 / 0.283200 (0.115276) 0.026552 / 0.141683 (-0.115131) 1.740505 / 1.452155 (0.288350) 1.835684 / 1.492716 (0.342968)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.267917 / 0.018006 (0.249911) 0.510676 / 0.000490 (0.510186) 0.010810 / 0.000200 (0.010610) 0.000383 / 0.000054 (0.000328)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032113 / 0.037411 (-0.005299) 0.097679 / 0.014526 (0.083153) 0.113213 / 0.176557 (-0.063344) 0.177897 / 0.737135 (-0.559238) 0.111761 / 0.296338 (-0.184577)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.450544 / 0.215209 (0.235335) 4.476746 / 2.077655 (2.399091) 2.205391 / 1.504120 (0.701271) 2.006457 / 1.541195 (0.465262) 2.058859 / 1.468490 (0.590369) 0.571549 / 4.584777 (-4.013228) 4.175039 / 3.745712 (0.429327) 3.815445 / 5.269862 (-1.454416) 2.376673 / 4.565676 (-2.189004) 0.067048 / 0.424275 (-0.357227) 0.008544 / 0.007607 (0.000937) 0.536384 / 0.226044 (0.310340) 5.386232 / 2.268929 (3.117304) 2.825620 / 55.444624 (-52.619004) 2.339821 / 6.876477 (-4.536656) 2.535736 / 2.142072 (0.393663) 0.679572 / 4.805227 (-4.125655) 0.156799 / 6.500664 (-6.343865) 0.071667 / 0.075469 (-0.003802)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.512198 / 1.841788 (-0.329590) 21.786760 / 8.074308 (13.712452) 16.386274 / 10.191392 (6.194882) 0.169108 / 0.680424 (-0.511316) 0.021312 / 0.534201 (-0.512889) 0.466153 / 0.579283 (-0.113130) 0.496192 / 0.434364 (0.061829) 0.549420 / 0.540337 (0.009082) 0.780974 / 1.386936 (-0.605962)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007644 / 0.011353 (-0.003709) 0.004654 / 0.011008 (-0.006354) 0.075280 / 0.038508 (0.036772) 0.083044 / 0.023109 (0.059935) 0.481704 / 0.275898 (0.205805) 0.514828 / 0.323480 (0.191348) 0.006245 / 0.007986 (-0.001740) 0.003715 / 0.004328 (-0.000614) 0.074498 / 0.004250 (0.070248) 0.064406 / 0.037052 (0.027353) 0.481874 / 0.258489 (0.223385) 0.518527 / 0.293841 (0.224686) 0.037549 / 0.128546 (-0.090997) 0.010106 / 0.075646 (-0.065541) 0.084266 / 0.419271 (-0.335006) 0.056659 / 0.043533 (0.013126) 0.497707 / 0.255139 (0.242568) 0.503201 / 0.283200 (0.220001) 0.027086 / 0.141683 (-0.114597) 1.834715 / 1.452155 (0.382560) 1.919927 / 1.492716 (0.427210)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.249288 / 0.018006 (0.231282) 0.500950 / 0.000490 (0.500460) 0.005856 / 0.000200 (0.005656) 0.000120 / 0.000054 (0.000065)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.037674 / 0.037411 (0.000263) 0.111141 / 0.014526 (0.096615) 0.123408 / 0.176557 (-0.053149) 0.186604 / 0.737135 (-0.550531) 0.125360 / 0.296338 (-0.170979)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.520480 / 0.215209 (0.305271) 5.171108 / 2.077655 (3.093453) 2.812746 / 1.504120 (1.308626) 2.602941 / 1.541195 (1.061746) 2.666196 / 1.468490 (1.197706) 0.578684 / 4.584777 (-4.006092) 4.238722 / 3.745712 (0.493010) 3.844361 / 5.269862 (-1.425501) 2.369214 / 4.565676 (-2.196462) 0.068543 / 0.424275 (-0.355732) 0.008695 / 0.007607 (0.001088) 0.621869 / 0.226044 (0.395825) 6.200566 / 2.268929 (3.931637) 3.340846 / 55.444624 (-52.103779) 2.920691 / 6.876477 (-3.955786) 3.132438 / 2.142072 (0.990366) 0.697394 / 4.805227 (-4.107834) 0.158385 / 6.500664 (-6.342280) 0.072566 / 0.075469 (-0.002903)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.599070 / 1.841788 (-0.242717) 22.767139 / 8.074308 (14.692831) 17.053988 / 10.191392 (6.862596) 0.188414 / 0.680424 (-0.492009) 0.023409 / 0.534201 (-0.510792) 0.472092 / 0.579283 (-0.107191) 0.486107 / 0.434364 (0.051743) 0.562190 / 0.540337 (0.021852) 0.791606 / 1.386936 (-0.595330)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants