Fix ConnectionError for gated datasets and unauthenticated users #7110

albertvillanova · 2024-08-20T05:26:54Z

Fix ConnectionError for gated datasets and unauthenticated users. See:

CI worker tests are broken for gated datasets: ConnectionError dataset-viewer#3025

Note that a recent change in the Hub returns dataset info for gated datasets and unauthenticated users, instead of raising a GatedRepoError as before. See:

Gated dataset info is leaked huggingface_hub#2457

This PR adds an additional check (/auth-check) for gated datasets and raises DatasetNotFoundError for unauthenticated users, as it was the case before the change in the Hub.

Fix suggested by @Pierrci (thanks @Wauplin for pointing it out).

Fix #7109.

HuggingFaceDocBuilderDev · 2024-08-20T05:30:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova · 2024-08-20T05:49:44Z

Note that the CI error is unrelated to this PR and should be addressed in another PR. See:

CI is broken for numpy-2: Failed to fetch wheel: llvmlite==0.34.0 #7111

Wauplin

Looks good to me! We might add support for the /auth-check endpoint in huggingface_hub directly in the future.

github-actions · 2024-08-20T09:20:36Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005354 / 0.011353 (-0.005999)	0.004031 / 0.011008 (-0.006977)	0.062470 / 0.038508 (0.023962)	0.030882 / 0.023109 (0.007773)	0.244816 / 0.275898 (-0.031082)	0.264324 / 0.323480 (-0.059156)	0.004164 / 0.007986 (-0.003822)	0.002858 / 0.004328 (-0.001471)	0.049008 / 0.004250 (0.044758)	0.042139 / 0.037052 (0.005086)	0.279496 / 0.258489 (0.021007)	0.279408 / 0.293841 (-0.014433)	0.029701 / 0.128546 (-0.098845)	0.012501 / 0.075646 (-0.063145)	0.203267 / 0.419271 (-0.216004)	0.035964 / 0.043533 (-0.007569)	0.239361 / 0.255139 (-0.015778)	0.258942 / 0.283200 (-0.024257)	0.017956 / 0.141683 (-0.123727)	1.160468 / 1.452155 (-0.291687)	1.203475 / 1.492716 (-0.289242)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.004639 / 0.018006 (-0.013367)	0.298020 / 0.000490 (0.297530)	0.000212 / 0.000200 (0.000012)	0.000043 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019371 / 0.037411 (-0.018040)	0.063311 / 0.014526 (0.048785)	0.076412 / 0.176557 (-0.100145)	0.122574 / 0.737135 (-0.614561)	0.078076 / 0.296338 (-0.218263)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.275381 / 0.215209 (0.060172)	2.713220 / 2.077655 (0.635565)	1.441940 / 1.504120 (-0.062179)	1.325545 / 1.541195 (-0.215650)	1.363859 / 1.468490 (-0.104631)	0.715147 / 4.584777 (-3.869630)	2.356482 / 3.745712 (-1.389230)	2.882792 / 5.269862 (-2.387069)	1.833399 / 4.565676 (-2.732278)	0.077872 / 0.424275 (-0.346403)	0.005172 / 0.007607 (-0.002435)	0.326361 / 0.226044 (0.100316)	3.239202 / 2.268929 (0.970273)	1.837745 / 55.444624 (-53.606879)	1.517299 / 6.876477 (-5.359178)	1.552938 / 2.142072 (-0.589134)	0.801496 / 4.805227 (-4.003731)	0.133351 / 6.500664 (-6.367314)	0.042052 / 0.075469 (-0.033418)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.957887 / 1.841788 (-0.883901)	11.625291 / 8.074308 (3.550983)	9.679413 / 10.191392 (-0.511979)	0.140271 / 0.680424 (-0.540153)	0.013991 / 0.534201 (-0.520210)	0.299874 / 0.579283 (-0.279409)	0.267164 / 0.434364 (-0.167200)	0.338143 / 0.540337 (-0.202194)	0.434105 / 1.386936 (-0.952831)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005833 / 0.011353 (-0.005520)	0.003761 / 0.011008 (-0.007247)	0.049699 / 0.038508 (0.011191)	0.032786 / 0.023109 (0.009677)	0.265100 / 0.275898 (-0.010798)	0.291045 / 0.323480 (-0.032435)	0.004281 / 0.007986 (-0.003705)	0.002737 / 0.004328 (-0.001591)	0.048524 / 0.004250 (0.044274)	0.040783 / 0.037052 (0.003731)	0.281122 / 0.258489 (0.022633)	0.311349 / 0.293841 (0.017508)	0.032143 / 0.128546 (-0.096403)	0.011747 / 0.075646 (-0.063899)	0.059432 / 0.419271 (-0.359840)	0.034362 / 0.043533 (-0.009171)	0.261061 / 0.255139 (0.005922)	0.279536 / 0.283200 (-0.003663)	0.019172 / 0.141683 (-0.122510)	1.160069 / 1.452155 (-0.292086)	1.224160 / 1.492716 (-0.268556)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093596 / 0.018006 (0.075590)	0.302862 / 0.000490 (0.302372)	0.000208 / 0.000200 (0.000008)	0.000047 / 0.000054 (-0.000007)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022785 / 0.037411 (-0.014626)	0.079263 / 0.014526 (0.064737)	0.091340 / 0.176557 (-0.085216)	0.129453 / 0.737135 (-0.607682)	0.091349 / 0.296338 (-0.204989)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.298166 / 0.215209 (0.082957)	3.003146 / 2.077655 (0.925491)	1.575903 / 1.504120 (0.071783)	1.445231 / 1.541195 (-0.095963)	1.477116 / 1.468490 (0.008625)	0.726496 / 4.584777 (-3.858281)	0.959827 / 3.745712 (-2.785885)	2.941142 / 5.269862 (-2.328720)	1.878581 / 4.565676 (-2.687096)	0.078475 / 0.424275 (-0.345800)	0.005137 / 0.007607 (-0.002470)	0.352078 / 0.226044 (0.126034)	3.486113 / 2.268929 (1.217184)	1.965024 / 55.444624 (-53.479600)	1.667223 / 6.876477 (-5.209254)	1.665254 / 2.142072 (-0.476819)	0.803543 / 4.805227 (-4.001684)	0.133003 / 6.500664 (-6.367661)	0.041462 / 0.075469 (-0.034008)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.045534 / 1.841788 (-0.796254)	12.124988 / 8.074308 (4.050680)	10.418723 / 10.191392 (0.227331)	0.142453 / 0.680424 (-0.537971)	0.015686 / 0.534201 (-0.518515)	0.300557 / 0.579283 (-0.278726)	0.119851 / 0.434364 (-0.314512)	0.342297 / 0.540337 (-0.198040)	0.441263 / 1.386936 (-0.945673)

Pierrci · 2024-08-20T15:11:34Z

lgtm!

albertvillanova added 3 commits August 19, 2024 17:56

Test load_dataset raises DatasetNotFoundError for unauthenticated user

a6ec5b3

Raise DatasetNotFoundError for gated and unauthenticated

34b37c4

Rename function

e868018

Wauplin approved these changes Aug 20, 2024

View reviewed changes

Merge branch 'main' into fix-7109

89c5b66

Wauplin mentioned this pull request Aug 20, 2024

Add support for /auth-check huggingface/huggingface_hub#2466

Closed

albertvillanova merged commit 90b1d94 into main Aug 20, 2024
15 checks passed

albertvillanova deleted the fix-7109 branch August 20, 2024 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ConnectionError for gated datasets and unauthenticated users #7110

Fix ConnectionError for gated datasets and unauthenticated users #7110

albertvillanova commented Aug 20, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 20, 2024

albertvillanova commented Aug 20, 2024 •

edited

Loading

Wauplin left a comment

github-actions bot commented Aug 20, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Pierrci commented Aug 20, 2024

Fix ConnectionError for gated datasets and unauthenticated users #7110

Fix ConnectionError for gated datasets and unauthenticated users #7110

Conversation

albertvillanova commented Aug 20, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Aug 20, 2024

albertvillanova commented Aug 20, 2024 • edited Loading

Wauplin left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 20, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Pierrci commented Aug 20, 2024

albertvillanova commented Aug 20, 2024 •

edited

Loading

albertvillanova commented Aug 20, 2024 •

edited

Loading