Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting a datasets.utils.info_utils.NonMatchingSplitsSizesError when downloading the openwebtext dataset from huggingface #10

Open
lovodkin93 opened this issue Mar 4, 2023 · 0 comments

Comments

@lovodkin93
Copy link

hello,
I am trying to download the openwebtext dataset from huggingface, but I keep getting the following error:

Downloading data: 100%|________________________________________________________________________________________________________________| 12.9G/12.9G [25:43<00:00, 8.35MB/s]
/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/download/download_manager.py:527: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
  warnings.warn(
Extracting data files: 100%|________________________________________________________________________________________________________| 20610/20610 [9:43:42<00:00,  1.70s/it]
Traceback (most recent call last):
  File "ssd_process_data.py", line 485, in <module>
    main()
  File "ssd_process_data.py", line 369, in main
    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=39769494896, num_examples=8013769, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=39769065791, num_examples=8013740, shard_lengths=[101000, 100000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 101000, 101000, 101000, 101000, 102000, 102000, 100000, 101000, 100000, 101000, 102000, 101000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 101000, 101000, 102000, 101000, 102000, 101000, 101000, 100000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 100000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 102000, 102000, 101000, 101000, 102000, 102000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 102000, 101000, 13740], dataset_name='openwebtext')}]

I have tried forcing the redownloading of the dataset by passing the download_mode="force_redownload" parameter, but it yield the same error.

I have also tried passing the ignore_verifications=True parameter, but this in turn yielded the following error:

    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1754, in load_dataset
    verification_mode = VerificationMode(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 339, in __call__
    return cls.__new__(cls, value)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 663, in __new__
    raise ve_exc
ValueError: 'none' is not a valid VerificationMode

Has anyone encountered such a problem, or knows what I can do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant