Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imdb dataset cannot be loaded #876

Closed
rabeehk opened this issue Nov 22, 2020 · 6 comments
Closed

imdb dataset cannot be loaded #876

rabeehk opened this issue Nov 22, 2020 · 6 comments

Comments

@rabeehk
Copy link
Contributor

rabeehk commented Nov 22, 2020

Hi
I am trying to load the imdb train dataset

dataset = datasets.load_dataset("imdb", split="train")

getting following errors, thanks for your help

Traceback (most recent call last):        
  File "<stdin>", line 1, in <module>
  File "/idiap/user/rkarimi/libs/anaconda3/envs/internship/lib/python3.7/site-packages/datasets/load.py", line 611, in load_dataset
    ignore_verifications=ignore_verifications,
  File "/idiap/user/rkarimi/libs/anaconda3/envs/internship/lib/python3.7/site-packages/datasets/builder.py", line 476, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/idiap/user/rkarimi/libs/anaconda3/envs/internship/lib/python3.7/site-packages/datasets/builder.py", line 558, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/internship/lib/python3.7/site-packages/datasets/utils/info_utils.py", line 73, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='test', num_bytes=32660064, num_examples=25000, dataset_name='imdb'), 'recorded': SplitInfo(name='test', num_bytes=26476338, num_examples=20316, dataset_name='imdb')}, {'expected': SplitInfo(name='train', num_bytes=33442202, num_examples=25000, dataset_name='imdb'), 'recorded': SplitInfo(name='train', num_bytes=0, num_examples=0, dataset_name='imdb')}, {'expected': SplitInfo(name='unsupervised', num_bytes=67125548, num_examples=50000, dataset_name='imdb'), 'recorded': SplitInfo(name='unsupervised', num_bytes=0, num_examples=0, dataset_name='imdb')}]
>>> dataset = datasets.load_dataset("imdb", split="train")

@lhoestq
Copy link
Member

lhoestq commented Nov 24, 2020

It looks like there was an issue while building the imdb dataset.
Could you provide more information about your OS and the version of python and datasets ?

Also could you try again with

dataset = datasets.load_dataset("imdb", split="train", download_mode="force_redownload")

to make sure it's not a corrupted file issue ?

@rabeehk
Copy link
Contributor Author

rabeehk commented Dec 24, 2020

I was using version 1.1.2 and this resolved with version 1.1.3, thanks.

@rabeehk rabeehk closed this as completed Dec 24, 2020
@PierreColombo
Copy link
Contributor

Hello,
I have the same pb with 1.8.0

@lhoestq
Copy link
Member

lhoestq commented Nov 25, 2021

Hi ! I just tried in 1.8.0 and it worked fine. Can you try again ? Maybe the dataset host had some issues that are fixed now

@PierreColombo
Copy link
Contributor

Hello,
It works fine now :) !
Thanks !

@xianbaoqian
Copy link

Ran into the same issue on a different dataset. I workedaround this by passing

    verification_mode='no_checks',

to load_dataset method. Ref:

if verification_mode == VerificationMode.BASIC_CHECKS or verification_mode == VerificationMode.ALL_CHECKS:

Note that this is a hack before the root cause is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants