Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec 2022.8.2 breaks xopen in streaming mode #4961

Closed
DCNemesis opened this issue Sep 9, 2022 · 6 comments
Closed

fsspec 2022.8.2 breaks xopen in streaming mode #4961

DCNemesis opened this issue Sep 9, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@DCNemesis
Copy link

DCNemesis commented Sep 9, 2022

Describe the bug

When fsspec 2022.8.2 is installed in your environment, xopen will prematurely close files, making streaming mode inoperable.

Steps to reproduce the bug

import datasets

data = datasets.load_dataset('MLCommons/ml_spoken_words', 'id_wav', split='train', streaming=True)

Expected results

Dataset should load as iterator.

Actual results

[/usr/local/lib/python3.7/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1737     # Return iterable dataset in case of streaming
   1738     if streaming:
-> 1739         return builder_instance.as_streaming_dataset(split=split)
   1740 
   1741     # Some datasets are already processed on the HF google storage

[/usr/local/lib/python3.7/dist-packages/datasets/builder.py](https://localhost:8080/#) in as_streaming_dataset(self, split, base_path)
   1023         )
   1024         self._check_manual_download(dl_manager)
-> 1025         splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
   1026         # By default, return all splits
   1027         if split is None:

[~/.cache/huggingface/modules/datasets_modules/datasets/MLCommons--ml_spoken_words/321ea853cf0a05abb7a2d7efea900692a3d8622af65a2f3ce98adb7800a5d57b/ml_spoken_words.py](https://localhost:8080/#) in _split_generators(self, dl_manager)
    182                 name=datasets.Split.TRAIN,
    183                 gen_kwargs={
--> 184                     "audio_archives": [download_audio(split="train", lang=lang) for lang in self.config.languages],
    185                     "local_audio_archives_paths": [download_extract_audio(split="train", lang=lang) for lang in
    186                                                    self.config.languages] if not dl_manager.is_streaming else None,

[~/.cache/huggingface/modules/datasets_modules/datasets/MLCommons--ml_spoken_words/321ea853cf0a05abb7a2d7efea900692a3d8622af65a2f3ce98adb7800a5d57b/ml_spoken_words.py](https://localhost:8080/#) in <listcomp>(.0)
    182                 name=datasets.Split.TRAIN,
    183                 gen_kwargs={
--> 184                     "audio_archives": [download_audio(split="train", lang=lang) for lang in self.config.languages],
    185                     "local_audio_archives_paths": [download_extract_audio(split="train", lang=lang) for lang in
    186                                                    self.config.languages] if not dl_manager.is_streaming else None,

[~/.cache/huggingface/modules/datasets_modules/datasets/MLCommons--ml_spoken_words/321ea853cf0a05abb7a2d7efea900692a3d8622af65a2f3ce98adb7800a5d57b/ml_spoken_words.py](https://localhost:8080/#) in _download_audio_archives(dl_manager, lang, format, split)
    267 # for streaming case
    268 def _download_audio_archives(dl_manager, lang, format, split):
--> 269     archives_paths = _download_audio_archives_paths(dl_manager, lang, format, split)
    270     return [dl_manager.iter_archive(archive_path) for archive_path in archives_paths]

[~/.cache/huggingface/modules/datasets_modules/datasets/MLCommons--ml_spoken_words/321ea853cf0a05abb7a2d7efea900692a3d8622af65a2f3ce98adb7800a5d57b/ml_spoken_words.py](https://localhost:8080/#) in _download_audio_archives_paths(dl_manager, lang, format, split)
    251     n_files_path = dl_manager.download(n_files_url)
    252 
--> 253     with open(n_files_path, "r", encoding="utf-8") as file:
    254         n_files = int(file.read().strip())  # the file contains a number of archives
    255 

ValueError: I/O operation on closed file.

Environment info

  • datasets version: 2.4.0
  • Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.13
  • PyArrow version: 6.0.1
  • Pandas version: 1.3.5
@DCNemesis DCNemesis added the bug Something isn't working label Sep 9, 2022
@DCNemesis
Copy link
Author

loading fsspec==2022.7.1 fixes this issue, setup.py would need to be changed to prevent users from using the latest version of fsspec.

@DCNemesis
Copy link
Author

Opened PR to address this.

@albertvillanova
Copy link
Member

albertvillanova commented Sep 12, 2022

Hi @DCNemesis, thanks for reporting.

That was a temporary issue in fsspec releases 2022.8.0 and 2022.8.1. But they fixed it in their patch release 2022.8.2 (and yanked both previous versions). See:

Are you sure you have version 2022.8.2 installed?

pip install -U fsspec

@DCNemesis
Copy link
Author

@albertvillanova I was using a temporary Google Colab instance, but checking it again today it seems it was loading 2022.8.1 rather than 2022.8.2. It's surprising that colab is using the version that was replaced the same day it was released. Testing with 2022.8.2 did work. It appears Colab will be fixing it on their end too.

@albertvillanova
Copy link
Member

Thanks for the additional information.

Once we know 2022.8.2 works, I'm closing this issue. Feel free to reopen it if necessary.

@albertvillanova albertvillanova linked a pull request Sep 12, 2022 that will close this issue
@albertvillanova
Copy link
Member

Colab just upgraded their default fsspec version to 2022.8.2:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants