Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cnn_dailymail is broken in pinned version of datasets #28

Closed
1 of 11 tasks
osanseviero opened this issue Mar 13, 2022 · 6 comments · Fixed by #57
Closed
1 of 11 tasks

cnn_dailymail is broken in pinned version of datasets #28

osanseviero opened this issue Mar 13, 2022 · 6 comments · Fixed by #57

Comments

@osanseviero
Copy link
Contributor

osanseviero commented Mar 13, 2022

Information

The problem arises in chapter:

  • Introduction
  • Text Classification
  • Transformer Anatomy
  • Multilingual Named Entity Recognition
  • Text Generation
  • Summarization
  • Question Answering
  • Making Transformers Efficient in Production
  • Dealing with Few to No Labels
  • Training Transformers from Scratch
  • Future Directions

Describe the bug

As per huggingface/datasets#3830, trying to load the dataset fails. This change is still not in the latest release, but will likely need some update

To Reproduce

Steps to reproduce the behavior:

Just run

load_dataset("cnn_dailymail", '3.0.0')

For exact error message, see linked issue

@JingxinLee
Copy link

I met the same problem.

You could try

dataset = load_dataset('ccdv/cnn_dailymail', '3.0.0')

instead.

@osanseviero
Copy link
Contributor Author

Doing

pip install git+https://github.com/huggingface/datasets#egg=datasets

and load_dataset("...", download_mode="force_redownload") should fix the issue, but it does require installing datasets from head

@lewtun
Copy link
Member

lewtun commented Mar 14, 2022

Thanks for reporting the bug @osanseviero!

@lvwerra we should probably consider bumping datasets to the next release and checking that nothing breaks in the other chapters. One advantage of bumping to a later version is that we can then stream the audio dataset in the final chapter :)

@lvwerra
Copy link
Member

lvwerra commented Mar 22, 2022

I feel like we should avoid changing library versions globally as it requires a significant amount of work to make sure we don't break anything elsewhere. I'd rather fix it in the installation function for that chapter specifically.

@lewtun
Copy link
Member

lewtun commented Mar 23, 2022

Sounds good @lvwerra! I'll take care of this :)

@albertvillanova
Copy link

albertvillanova commented Mar 30, 2022

@lewtun, this issue was fixed in datasets version 2.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants