Add repo_id to DatasetInfo #6268

lhoestq · 2023-09-29T10:24:55Z

from datasets import load_dataset

ds = load_dataset("lhoestq/demo1", split="train")
ds = ds.map(lambda x: {}, num_proc=2).filter(lambda x: True).remove_columns(["id"])
print(ds.repo_id)
# lhoestq/demo1

repo_id is None when the dataset doesn't come from the Hub, e.g. from Dataset.from_dict
repo_id is set to None when concatenating datasets with different repo ids

related to #4129

TODO:

discuss if it's ok for now
tests

HuggingFaceDocBuilderDev · 2023-09-29T10:31:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lhoestq · 2023-09-29T10:32:29Z

In #4129 we want to track the origin of a dataset, e.g. if it comes from multiple datasets.

I think it's out of scope of DatasetInfo alone, which has info for one dataset only.
Therefore it makes sense to add repo_id, which is for one dataset only.

IMO if we want to track multiple origins we will need a new DatasetInfo that would have fields relevant to a mix of datasets (out of scope of this PR)

cc @mariosasko I'd like your opinion on this

github-actions · 2023-09-29T10:34:13Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009009 / 0.011353 (-0.002344)	0.004169 / 0.011008 (-0.006840)	0.098634 / 0.038508 (0.060126)	0.069526 / 0.023109 (0.046417)	0.337963 / 0.275898 (0.062065)	0.379737 / 0.323480 (0.056257)	0.004318 / 0.007986 (-0.003668)	0.005347 / 0.004328 (0.001019)	0.069875 / 0.004250 (0.065624)	0.055964 / 0.037052 (0.018912)	0.340305 / 0.258489 (0.081816)	0.429718 / 0.293841 (0.135877)	0.045101 / 0.128546 (-0.083445)	0.012610 / 0.075646 (-0.063036)	0.312366 / 0.419271 (-0.106905)	0.064711 / 0.043533 (0.021178)	0.345216 / 0.255139 (0.090077)	0.367245 / 0.283200 (0.084046)	0.034638 / 0.141683 (-0.107045)	1.541947 / 1.452155 (0.089793)	1.645268 / 1.492716 (0.152551)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.233501 / 0.018006 (0.215495)	0.514207 / 0.000490 (0.513717)	0.014271 / 0.000200 (0.014072)	0.000366 / 0.000054 (0.000311)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026288 / 0.037411 (-0.011124)	0.083206 / 0.014526 (0.068680)	0.098172 / 0.176557 (-0.078385)	0.158529 / 0.737135 (-0.578606)	0.095183 / 0.296338 (-0.201155)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.538300 / 0.215209 (0.323091)	5.486939 / 2.077655 (3.409285)	2.321812 / 1.504120 (0.817692)	2.002124 / 1.541195 (0.460929)	2.045043 / 1.468490 (0.576553)	0.852772 / 4.584777 (-3.732005)	5.014897 / 3.745712 (1.269185)	4.428115 / 5.269862 (-0.841746)	2.750126 / 4.565676 (-1.815550)	0.099028 / 0.424275 (-0.325247)	0.007678 / 0.007607 (0.000070)	0.664463 / 0.226044 (0.438418)	6.617811 / 2.268929 (4.348883)	2.888382 / 55.444624 (-52.556242)	2.190753 / 6.876477 (-4.685724)	2.414586 / 2.142072 (0.272513)	1.010302 / 4.805227 (-3.794925)	0.194925 / 6.500664 (-6.305739)	0.063490 / 0.075469 (-0.011979)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.543464 / 1.841788 (-0.298323)	20.566666 / 8.074308 (12.492358)	19.410745 / 10.191392 (9.219353)	0.207077 / 0.680424 (-0.473347)	0.028895 / 0.534201 (-0.505306)	0.427525 / 0.579283 (-0.151758)	0.535450 / 0.434364 (0.101086)	0.494632 / 0.540337 (-0.045705)	0.723705 / 1.386936 (-0.663231)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008209 / 0.011353 (-0.003144)	0.004184 / 0.011008 (-0.006824)	0.072420 / 0.038508 (0.033912)	0.066851 / 0.023109 (0.043742)	0.424137 / 0.275898 (0.148239)	0.473156 / 0.323480 (0.149676)	0.005394 / 0.007986 (-0.002591)	0.003898 / 0.004328 (-0.000430)	0.069996 / 0.004250 (0.065746)	0.053113 / 0.037052 (0.016061)	0.453214 / 0.258489 (0.194725)	0.495921 / 0.293841 (0.202080)	0.043028 / 0.128546 (-0.085519)	0.012320 / 0.075646 (-0.063326)	0.080270 / 0.419271 (-0.339002)	0.053337 / 0.043533 (0.009804)	0.436604 / 0.255139 (0.181465)	0.463422 / 0.283200 (0.180223)	0.030277 / 0.141683 (-0.111406)	1.560261 / 1.452155 (0.108106)	1.647209 / 1.492716 (0.154493)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.232556 / 0.018006 (0.214550)	0.502387 / 0.000490 (0.501897)	0.006688 / 0.000200 (0.006488)	0.000118 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030204 / 0.037411 (-0.007207)	0.089438 / 0.014526 (0.074912)	0.118939 / 0.176557 (-0.057617)	0.160537 / 0.737135 (-0.576598)	0.113432 / 0.296338 (-0.182906)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.586469 / 0.215209 (0.371260)	5.916156 / 2.077655 (3.838502)	2.904960 / 1.504120 (1.400840)	2.346838 / 1.541195 (0.805644)	2.373688 / 1.468490 (0.905198)	0.829917 / 4.584777 (-3.754860)	4.851283 / 3.745712 (1.105571)	4.220103 / 5.269862 (-1.049758)	2.706139 / 4.565676 (-1.859538)	0.094095 / 0.424275 (-0.330180)	0.008201 / 0.007607 (0.000594)	0.699099 / 0.226044 (0.473054)	7.046940 / 2.268929 (4.778011)	3.374837 / 55.444624 (-52.069788)	2.690839 / 6.876477 (-4.185638)	2.845717 / 2.142072 (0.703645)	0.989698 / 4.805227 (-3.815529)	0.190413 / 6.500664 (-6.310251)	0.066233 / 0.075469 (-0.009236)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.513607 / 1.841788 (-0.328180)	21.544200 / 8.074308 (13.469892)	20.297337 / 10.191392 (10.105945)	0.216390 / 0.680424 (-0.464034)	0.029962 / 0.534201 (-0.504239)	0.451531 / 0.579283 (-0.127752)	0.530147 / 0.434364 (0.095783)	0.520739 / 0.540337 (-0.019598)	0.716431 / 1.386936 (-0.670505)

github-actions · 2023-09-29T10:42:52Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006509 / 0.011353 (-0.004844)	0.003987 / 0.011008 (-0.007022)	0.085233 / 0.038508 (0.046725)	0.077765 / 0.023109 (0.054656)	0.310467 / 0.275898 (0.034569)	0.343363 / 0.323480 (0.019883)	0.005557 / 0.007986 (-0.002429)	0.003430 / 0.004328 (-0.000898)	0.064948 / 0.004250 (0.060697)	0.056864 / 0.037052 (0.019812)	0.314005 / 0.258489 (0.055516)	0.360638 / 0.293841 (0.066798)	0.031134 / 0.128546 (-0.097412)	0.008869 / 0.075646 (-0.066777)	0.286409 / 0.419271 (-0.132862)	0.051338 / 0.043533 (0.007805)	0.311329 / 0.255139 (0.056190)	0.334373 / 0.283200 (0.051174)	0.024816 / 0.141683 (-0.116867)	1.502872 / 1.452155 (0.050718)	1.569941 / 1.492716 (0.077224)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.269639 / 0.018006 (0.251633)	0.558510 / 0.000490 (0.558020)	0.011748 / 0.000200 (0.011548)	0.000234 / 0.000054 (0.000180)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029139 / 0.037411 (-0.008272)	0.083586 / 0.014526 (0.069060)	0.102426 / 0.176557 (-0.074131)	0.162398 / 0.737135 (-0.574737)	0.101364 / 0.296338 (-0.194975)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.382281 / 0.215209 (0.167072)	3.826412 / 2.077655 (1.748758)	1.815911 / 1.504120 (0.311791)	1.644539 / 1.541195 (0.103344)	1.688487 / 1.468490 (0.219996)	0.482115 / 4.584777 (-4.102662)	3.574773 / 3.745712 (-0.170939)	3.262733 / 5.269862 (-2.007129)	2.058115 / 4.565676 (-2.507562)	0.056367 / 0.424275 (-0.367908)	0.007233 / 0.007607 (-0.000374)	0.456859 / 0.226044 (0.230815)	4.565935 / 2.268929 (2.297006)	2.311802 / 55.444624 (-53.132823)	1.943936 / 6.876477 (-4.932541)	2.129811 / 2.142072 (-0.012261)	0.575098 / 4.805227 (-4.230129)	0.130495 / 6.500664 (-6.370169)	0.059757 / 0.075469 (-0.015712)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.238495 / 1.841788 (-0.603293)	18.940000 / 8.074308 (10.865692)	14.034240 / 10.191392 (3.842848)	0.166418 / 0.680424 (-0.514006)	0.018420 / 0.534201 (-0.515781)	0.395330 / 0.579283 (-0.183953)	0.413518 / 0.434364 (-0.020846)	0.461499 / 0.540337 (-0.078838)	0.661371 / 1.386936 (-0.725565)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006673 / 0.011353 (-0.004680)	0.004335 / 0.011008 (-0.006673)	0.064568 / 0.038508 (0.026060)	0.072763 / 0.023109 (0.049653)	0.429488 / 0.275898 (0.153590)	0.456900 / 0.323480 (0.133420)	0.005481 / 0.007986 (-0.002505)	0.003649 / 0.004328 (-0.000680)	0.064975 / 0.004250 (0.060724)	0.056839 / 0.037052 (0.019786)	0.439451 / 0.258489 (0.180962)	0.461691 / 0.293841 (0.167850)	0.031455 / 0.128546 (-0.097092)	0.008848 / 0.075646 (-0.066798)	0.071719 / 0.419271 (-0.347553)	0.047116 / 0.043533 (0.003583)	0.429055 / 0.255139 (0.173916)	0.434204 / 0.283200 (0.151004)	0.022594 / 0.141683 (-0.119089)	1.539231 / 1.452155 (0.087077)	1.568111 / 1.492716 (0.075394)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.267374 / 0.018006 (0.249368)	0.553202 / 0.000490 (0.552712)	0.005410 / 0.000200 (0.005210)	0.000101 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031478 / 0.037411 (-0.005933)	0.092438 / 0.014526 (0.077912)	0.103874 / 0.176557 (-0.072682)	0.158428 / 0.737135 (-0.578708)	0.111617 / 0.296338 (-0.184721)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434783 / 0.215209 (0.219574)	4.332536 / 2.077655 (2.254881)	2.354522 / 1.504120 (0.850402)	2.220271 / 1.541195 (0.679076)	2.338524 / 1.468490 (0.870034)	0.494508 / 4.584777 (-4.090269)	3.619592 / 3.745712 (-0.126120)	3.320897 / 5.269862 (-1.948964)	2.107475 / 4.565676 (-2.458202)	0.058479 / 0.424275 (-0.365796)	0.007427 / 0.007607 (-0.000180)	0.509298 / 0.226044 (0.283254)	5.067940 / 2.268929 (2.799012)	2.815336 / 55.444624 (-52.629288)	2.470958 / 6.876477 (-4.405519)	2.672801 / 2.142072 (0.530728)	0.588199 / 4.805227 (-4.217028)	0.134062 / 6.500664 (-6.366602)	0.060951 / 0.075469 (-0.014518)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.353955 / 1.841788 (-0.487832)	20.386012 / 8.074308 (12.311704)	15.032463 / 10.191392 (4.841071)	0.167243 / 0.680424 (-0.513181)	0.020426 / 0.534201 (-0.513775)	0.396815 / 0.579283 (-0.182469)	0.421806 / 0.434364 (-0.012558)	0.471866 / 0.540337 (-0.068471)	0.667206 / 1.386936 (-0.719730)

src/datasets/info.py

davanstrien · 2023-09-29T10:47:55Z

Really happy to see this! It could also be helpful to track some other metadata about how the dataset was built in the future. i.e. for the Stack loaded like this:

ds = load_dataset("bigcode/the-stack", data_dir="data/dockerfile", split="train")

It could be helpful to have easy access to the data_dir argument used during loading since that changes the training data quite a bit vs. loading the full dataset. You can also recover this from download_checksums, which seems a bit hacky. That is not necessary for this PR, though.

tomaarsen · 2023-09-29T10:53:20Z

Perhaps it is also interesting to track the revision? I suppose the version also kind of covers that.

That said, this is looking great already! I'm quite excited about this. Losing the repo_id after merging (different) datasets also makes sense to me, well done.

davanstrien · 2023-09-29T11:40:22Z

One other thought. Is it worth tracking if a token was passed during loading?

The Hub ID for private datasets could in some cases contain information someone wouldn't want to make public i.e. davanstrien/super_secret_dataset_using_GPT_created_data.

Adding a bool like is_private could then be used by another library to determine if the dataset ID should be shared or not (or default to not sharing the ID for private datasets). i.e. in SpanMarker @tomaarsen might do a check like

if ds.is_private and not push_hub_id_for_private_ds:
	ds_name  = None

Potentially this is overkill but could be useful for downstream libraries who might use this information for creating automatic model cards.

mariosasko · 2023-09-29T18:41:52Z

We should probably find a way to remove DatasetInfo, as (most of) its attributes are outdated (homepage, description, etc.), not introduce new ones :). But I guess storing repo_id there is the simplest solution for now, so I'm OK with it.

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

github-actions · 2023-10-01T15:29:45Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007757 / 0.011353 (-0.003595)	0.004543 / 0.011008 (-0.006465)	0.100193 / 0.038508 (0.061685)	0.082333 / 0.023109 (0.059224)	0.374586 / 0.275898 (0.098688)	0.412617 / 0.323480 (0.089137)	0.006148 / 0.007986 (-0.001838)	0.003826 / 0.004328 (-0.000503)	0.077077 / 0.004250 (0.072827)	0.064057 / 0.037052 (0.027005)	0.391435 / 0.258489 (0.132946)	0.436439 / 0.293841 (0.142599)	0.036534 / 0.128546 (-0.092012)	0.009986 / 0.075646 (-0.065660)	0.344243 / 0.419271 (-0.075028)	0.062013 / 0.043533 (0.018480)	0.378113 / 0.255139 (0.122974)	0.398476 / 0.283200 (0.115276)	0.026552 / 0.141683 (-0.115131)	1.740505 / 1.452155 (0.288350)	1.835684 / 1.492716 (0.342968)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.267917 / 0.018006 (0.249911)	0.510676 / 0.000490 (0.510186)	0.010810 / 0.000200 (0.010610)	0.000383 / 0.000054 (0.000328)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032113 / 0.037411 (-0.005299)	0.097679 / 0.014526 (0.083153)	0.113213 / 0.176557 (-0.063344)	0.177897 / 0.737135 (-0.559238)	0.111761 / 0.296338 (-0.184577)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.450544 / 0.215209 (0.235335)	4.476746 / 2.077655 (2.399091)	2.205391 / 1.504120 (0.701271)	2.006457 / 1.541195 (0.465262)	2.058859 / 1.468490 (0.590369)	0.571549 / 4.584777 (-4.013228)	4.175039 / 3.745712 (0.429327)	3.815445 / 5.269862 (-1.454416)	2.376673 / 4.565676 (-2.189004)	0.067048 / 0.424275 (-0.357227)	0.008544 / 0.007607 (0.000937)	0.536384 / 0.226044 (0.310340)	5.386232 / 2.268929 (3.117304)	2.825620 / 55.444624 (-52.619004)	2.339821 / 6.876477 (-4.536656)	2.535736 / 2.142072 (0.393663)	0.679572 / 4.805227 (-4.125655)	0.156799 / 6.500664 (-6.343865)	0.071667 / 0.075469 (-0.003802)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.512198 / 1.841788 (-0.329590)	21.786760 / 8.074308 (13.712452)	16.386274 / 10.191392 (6.194882)	0.169108 / 0.680424 (-0.511316)	0.021312 / 0.534201 (-0.512889)	0.466153 / 0.579283 (-0.113130)	0.496192 / 0.434364 (0.061829)	0.549420 / 0.540337 (0.009082)	0.780974 / 1.386936 (-0.605962)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007644 / 0.011353 (-0.003709)	0.004654 / 0.011008 (-0.006354)	0.075280 / 0.038508 (0.036772)	0.083044 / 0.023109 (0.059935)	0.481704 / 0.275898 (0.205805)	0.514828 / 0.323480 (0.191348)	0.006245 / 0.007986 (-0.001740)	0.003715 / 0.004328 (-0.000614)	0.074498 / 0.004250 (0.070248)	0.064406 / 0.037052 (0.027353)	0.481874 / 0.258489 (0.223385)	0.518527 / 0.293841 (0.224686)	0.037549 / 0.128546 (-0.090997)	0.010106 / 0.075646 (-0.065541)	0.084266 / 0.419271 (-0.335006)	0.056659 / 0.043533 (0.013126)	0.497707 / 0.255139 (0.242568)	0.503201 / 0.283200 (0.220001)	0.027086 / 0.141683 (-0.114597)	1.834715 / 1.452155 (0.382560)	1.919927 / 1.492716 (0.427210)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.249288 / 0.018006 (0.231282)	0.500950 / 0.000490 (0.500460)	0.005856 / 0.000200 (0.005656)	0.000120 / 0.000054 (0.000065)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037674 / 0.037411 (0.000263)	0.111141 / 0.014526 (0.096615)	0.123408 / 0.176557 (-0.053149)	0.186604 / 0.737135 (-0.550531)	0.125360 / 0.296338 (-0.170979)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.520480 / 0.215209 (0.305271)	5.171108 / 2.077655 (3.093453)	2.812746 / 1.504120 (1.308626)	2.602941 / 1.541195 (1.061746)	2.666196 / 1.468490 (1.197706)	0.578684 / 4.584777 (-4.006092)	4.238722 / 3.745712 (0.493010)	3.844361 / 5.269862 (-1.425501)	2.369214 / 4.565676 (-2.196462)	0.068543 / 0.424275 (-0.355732)	0.008695 / 0.007607 (0.001088)	0.621869 / 0.226044 (0.395825)	6.200566 / 2.268929 (3.931637)	3.340846 / 55.444624 (-52.103779)	2.920691 / 6.876477 (-3.955786)	3.132438 / 2.142072 (0.990366)	0.697394 / 4.805227 (-4.107834)	0.158385 / 6.500664 (-6.342280)	0.072566 / 0.075469 (-0.002903)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.599070 / 1.841788 (-0.242717)	22.767139 / 8.074308 (14.692831)	17.053988 / 10.191392 (6.862596)	0.188414 / 0.680424 (-0.492009)	0.023409 / 0.534201 (-0.510792)	0.472092 / 0.579283 (-0.107191)	0.486107 / 0.434364 (0.051743)	0.562190 / 0.540337 (0.021852)	0.791606 / 1.386936 (-0.595330)

add repo_id info

fcaa9f2

nit

aade5a0

tomaarsen reviewed Sep 29, 2023

View reviewed changes

src/datasets/info.py Outdated Show resolved Hide resolved

Update src/datasets/info.py

aacbaf4

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

Add repo_id to DatasetInfo #6268

Are you sure you want to change the base?

Add repo_id to DatasetInfo #6268

Conversation

lhoestq commented Sep 29, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Sep 29, 2023

lhoestq commented Sep 29, 2023 • edited Loading

github-actions bot commented Sep 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Sep 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

davanstrien commented Sep 29, 2023

tomaarsen commented Sep 29, 2023 • edited Loading

davanstrien commented Sep 29, 2023

mariosasko commented Sep 29, 2023

github-actions bot commented Oct 1, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Sep 29, 2023 •

edited

Loading

lhoestq commented Sep 29, 2023 •

edited

Loading

tomaarsen commented Sep 29, 2023 •

edited

Loading