Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Download CNN-Dailymail Dataset #3784

Closed
AngadSethi opened this issue Feb 25, 2022 · 4 comments · Fixed by #3787
Closed

Unable to Download CNN-Dailymail Dataset #3784

AngadSethi opened this issue Feb 25, 2022 · 4 comments · Fixed by #3787
Assignees
Labels
bug Something isn't working

Comments

@AngadSethi
Copy link

Describe the bug

I am unable to download the CNN-Dailymail dataset. Upon closer investigation, I realised why this was happening:

  • The dataset sits in Google Drive, and both the CNN and DM datasets are large.
  • Google is unable to scan the folder for viruses, so the link which would originally download the dataset, now downloads the source code of this web page:
    image
  • This leads to the following error:
NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

Steps to reproduce the bug

import datasets
dataset = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")

Expected results

That the dataset is downloaded and processed just like other datasets.

Actual results

Hit with this error:

NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

Environment info

  • datasets version: 1.18.3
  • Platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.12
  • PyArrow version: 6.0.1
@AngadSethi AngadSethi added the bug Something isn't working label Feb 25, 2022
@AngadSethi
Copy link
Author

#self-assign

AngadSethi added a commit to AngadSethi/datasets that referenced this issue Feb 25, 2022
This commit fixes the issue described in huggingface#3784. By adding an extra parameter to the end of Google Drive links, we are able to bypass the virus check and download the datasets.
AngadSethi added a commit to AngadSethi/datasets that referenced this issue Feb 25, 2022
This commit fixes the issue described in huggingface#3784. By adding an extra parameter to the end of Google Drive links, we are able to bypass the virus check and download the datasets.
@albertvillanova
Copy link
Member

@AngadSethi thanks for reporting and thanks for your PR!

@AngadSethi
Copy link
Author

Glad to help @albertvillanova! Just fine-tuning the PR, will comment once I am able to get it up and running 😀

@albertvillanova
Copy link
Member

Fixed by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants