NotADirectoryError while loading the CNN/Dailymail dataset #996

arc-bu · 2020-12-02T11:07:56Z

Downloading and preparing dataset cnn_dailymail/3.0.0 (download: 558.32 MiB, generated: 1.28 GiB, post-processed: Unknown size, total: 1.82 GiB) to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602...

NotADirectoryError Traceback (most recent call last)

in ()
22
23
---> 24 train = load_dataset('cnn_dailymail', '3.0.0', split='train')
25 validation = load_dataset('cnn_dailymail', '3.0.0', split='validation')
26 test = load_dataset('cnn_dailymail', '3.0.0', split='test')

5 frames

/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602/cnn_dailymail.py in _find_files(dl_paths, publisher, url_dict)
132 else:
133 logging.fatal("Unsupported publisher: %s", publisher)
--> 134 files = sorted(os.listdir(top_dir))
135
136 ret_files = []

NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

lhoestq · 2020-12-02T11:15:04Z

Looks like the google drive download failed.
I'm getting a Google Drive - Quota exceeded error while looking at the downloaded file.

We should consider finding a better host than google drive for this dataset imo
related : #873 #864

arc-bu · 2020-12-03T11:05:28Z

It is working now, thank you.

Should I leave this issue open to address the Quota-exceeded error?

lhoestq · 2020-12-03T13:17:42Z

Yes please. It's been happening several times, we definitely need to address it

gchhablani · 2021-02-05T14:12:18Z

Any updates on this one? I'm facing a similar issue trying to add CelebA.

lhoestq · 2021-02-05T15:28:52Z

I've looked into it and couldn't find a solution. This looks like a Google Drive limitation..
Please try to use other hosts when possible

gchhablani · 2021-02-05T15:57:16Z

The original links are google drive links. Would it be feasible for HF to maintain their own servers for this? Also, I think the same issue must also exist with TFDS.

lhoestq · 2021-02-05T16:05:11Z

It's possible to host data on our side but we should ask the authors. TFDS has the same issue and doesn't have a solution either afaik.
Otherwise you can use the google drive link, but it it's not that convenient because of this quota issue.

gchhablani · 2021-02-06T16:40:40Z

Okay. I imagine asking every author who shares their dataset on Google Drive will also be cumbersome.

griff4692 · 2021-04-05T15:28:48Z

I am getting this error as well. Is there a fix?

lhoestq · 2021-04-07T14:49:30Z

Not as long as the data is stored on GG drive unfortunately.
Maybe we can ask if there's a mirror ?

Hi @JafferWilson is there a download link to get cnn dailymail from another host than GG drive ?

To give you some context, this library provides tools to download and process datasets. For CNN DailyMail the data are downloaded from the link you provide on your github repository. Unfortunately because of GG drive quotas, many users are not able to load this dataset.

mrazizi · 2021-12-19T07:27:56Z

The following copy of CNN/DM dataset, fixed the problem for me:
https://huggingface.co/datasets/ccdv/cnn_dailymail

lhoestq · 2021-12-21T10:20:03Z

Thanks for the link @mrazizi !

Apparently the original authors don't host the dataset themselves ("for legal reasons", source here).

lhoestq mentioned this issue Feb 10, 2021

load_dataset("amazon_polarity") NonMatchingChecksumError #1856

Closed

lhoestq mentioned this issue Jul 15, 2021

downloading of yahoo_answers_topics dataset failed #2646

Closed

lhoestq mentioned this issue Dec 21, 2021

Unable to load 'cnn_dailymail' dataset #3465

Closed

albertvillanova closed this as completed Feb 17, 2022

davidshinn mentioned this issue Feb 20, 2022

load_dataset('cnn_dalymail', '3.0.0') gives a 'Not a directory' error #873

Closed

gcmsrc mentioned this issue Oct 3, 2022

Chapter 6 notebook cannot be run even with datasets 2.0.0 due to datasets load_dataset error may be related to Google Virus scan nlp-with-transformers/notebooks#62

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NotADirectoryError while loading the CNN/Dailymail dataset #996

NotADirectoryError while loading the CNN/Dailymail dataset #996

arc-bu commented Dec 2, 2020

lhoestq commented Dec 2, 2020

arc-bu commented Dec 3, 2020

lhoestq commented Dec 3, 2020

gchhablani commented Feb 5, 2021

lhoestq commented Feb 5, 2021

gchhablani commented Feb 5, 2021 •

edited

Loading

lhoestq commented Feb 5, 2021

gchhablani commented Feb 6, 2021

griff4692 commented Apr 5, 2021

lhoestq commented Apr 7, 2021

mrazizi commented Dec 19, 2021

lhoestq commented Dec 21, 2021

NotADirectoryError while loading the CNN/Dailymail dataset #996

NotADirectoryError while loading the CNN/Dailymail dataset #996

Comments

arc-bu commented Dec 2, 2020

lhoestq commented Dec 2, 2020

arc-bu commented Dec 3, 2020

lhoestq commented Dec 3, 2020

gchhablani commented Feb 5, 2021

lhoestq commented Feb 5, 2021

gchhablani commented Feb 5, 2021 • edited Loading

lhoestq commented Feb 5, 2021

gchhablani commented Feb 6, 2021

griff4692 commented Apr 5, 2021

lhoestq commented Apr 7, 2021

mrazizi commented Dec 19, 2021

lhoestq commented Dec 21, 2021

gchhablani commented Feb 5, 2021 •

edited

Loading