Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotADirectoryError while loading the CNN/Dailymail dataset #996

Closed
arc-bu opened this issue Dec 2, 2020 · 12 comments
Closed

NotADirectoryError while loading the CNN/Dailymail dataset #996

arc-bu opened this issue Dec 2, 2020 · 12 comments

Comments

@arc-bu
Copy link

arc-bu commented Dec 2, 2020

Downloading and preparing dataset cnn_dailymail/3.0.0 (download: 558.32 MiB, generated: 1.28 GiB, post-processed: Unknown size, total: 1.82 GiB) to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602...


NotADirectoryError Traceback (most recent call last)

in ()
22
23
---> 24 train = load_dataset('cnn_dailymail', '3.0.0', split='train')
25 validation = load_dataset('cnn_dailymail', '3.0.0', split='validation')
26 test = load_dataset('cnn_dailymail', '3.0.0', split='test')

5 frames

/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602/cnn_dailymail.py in _find_files(dl_paths, publisher, url_dict)
132 else:
133 logging.fatal("Unsupported publisher: %s", publisher)
--> 134 files = sorted(os.listdir(top_dir))
135
136 ret_files = []

NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

@lhoestq
Copy link
Member

lhoestq commented Dec 2, 2020

Looks like the google drive download failed.
I'm getting a Google Drive - Quota exceeded error while looking at the downloaded file.

We should consider finding a better host than google drive for this dataset imo
related : #873 #864

@arc-bu
Copy link
Author

arc-bu commented Dec 3, 2020

It is working now, thank you.

Should I leave this issue open to address the Quota-exceeded error?

@lhoestq
Copy link
Member

lhoestq commented Dec 3, 2020

Yes please. It's been happening several times, we definitely need to address it

@gchhablani
Copy link
Contributor

Any updates on this one? I'm facing a similar issue trying to add CelebA.

@lhoestq
Copy link
Member

lhoestq commented Feb 5, 2021

I've looked into it and couldn't find a solution. This looks like a Google Drive limitation..
Please try to use other hosts when possible

@gchhablani
Copy link
Contributor

gchhablani commented Feb 5, 2021

The original links are google drive links. Would it be feasible for HF to maintain their own servers for this? Also, I think the same issue must also exist with TFDS.

@lhoestq
Copy link
Member

lhoestq commented Feb 5, 2021

It's possible to host data on our side but we should ask the authors. TFDS has the same issue and doesn't have a solution either afaik.
Otherwise you can use the google drive link, but it it's not that convenient because of this quota issue.

@gchhablani
Copy link
Contributor

Okay. I imagine asking every author who shares their dataset on Google Drive will also be cumbersome.

@griff4692
Copy link

I am getting this error as well. Is there a fix?

@lhoestq
Copy link
Member

lhoestq commented Apr 7, 2021

Not as long as the data is stored on GG drive unfortunately.
Maybe we can ask if there's a mirror ?

Hi @JafferWilson is there a download link to get cnn dailymail from another host than GG drive ?

To give you some context, this library provides tools to download and process datasets. For CNN DailyMail the data are downloaded from the link you provide on your github repository. Unfortunately because of GG drive quotas, many users are not able to load this dataset.

@mrazizi
Copy link

mrazizi commented Dec 19, 2021

The following copy of CNN/DM dataset, fixed the problem for me:
https://huggingface.co/datasets/ccdv/cnn_dailymail

@lhoestq
Copy link
Member

lhoestq commented Dec 21, 2021

Thanks for the link @mrazizi !

Apparently the original authors don't host the dataset themselves ("for legal reasons", source here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants