fsspec 2022.8.2 breaks xopen in streaming mode #4961

DCNemesis · 2022-09-09T17:26:55Z

Describe the bug

When fsspec 2022.8.2 is installed in your environment, xopen will prematurely close files, making streaming mode inoperable.

Steps to reproduce the bug

import datasets

data = datasets.load_dataset('MLCommons/ml_spoken_words', 'id_wav', split='train', streaming=True)

Expected results

Dataset should load as iterator.

Actual results

[/usr/local/lib/python3.7/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1737     # Return iterable dataset in case of streaming
   1738     if streaming:
-> 1739         return builder_instance.as_streaming_dataset(split=split)
   1740 
   1741     # Some datasets are already processed on the HF google storage

[/usr/local/lib/python3.7/dist-packages/datasets/builder.py](https://localhost:8080/#) in as_streaming_dataset(self, split, base_path)
   1023         )
   1024         self._check_manual_download(dl_manager)
-> 1025         splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
   1026         # By default, return all splits
   1027         if split is None:

[~/.cache/huggingface/modules/datasets_modules/datasets/MLCommons--ml_spoken_words/321ea853cf0a05abb7a2d7efea900692a3d8622af65a2f3ce98adb7800a5d57b/ml_spoken_words.py](https://localhost:8080/#) in _split_generators(self, dl_manager)
    182                 name=datasets.Split.TRAIN,
    183                 gen_kwargs={
--> 184                     "audio_archives": [download_audio(split="train", lang=lang) for lang in self.config.languages],
    185                     "local_audio_archives_paths": [download_extract_audio(split="train", lang=lang) for lang in
    186                                                    self.config.languages] if not dl_manager.is_streaming else None,

[~/.cache/huggingface/modules/datasets_modules/datasets/MLCommons--ml_spoken_words/321ea853cf0a05abb7a2d7efea900692a3d8622af65a2f3ce98adb7800a5d57b/ml_spoken_words.py](https://localhost:8080/#) in <listcomp>(.0)
    182                 name=datasets.Split.TRAIN,
    183                 gen_kwargs={
--> 184                     "audio_archives": [download_audio(split="train", lang=lang) for lang in self.config.languages],
    185                     "local_audio_archives_paths": [download_extract_audio(split="train", lang=lang) for lang in
    186                                                    self.config.languages] if not dl_manager.is_streaming else None,

[~/.cache/huggingface/modules/datasets_modules/datasets/MLCommons--ml_spoken_words/321ea853cf0a05abb7a2d7efea900692a3d8622af65a2f3ce98adb7800a5d57b/ml_spoken_words.py](https://localhost:8080/#) in _download_audio_archives(dl_manager, lang, format, split)
    267 # for streaming case
    268 def _download_audio_archives(dl_manager, lang, format, split):
--> 269     archives_paths = _download_audio_archives_paths(dl_manager, lang, format, split)
    270     return [dl_manager.iter_archive(archive_path) for archive_path in archives_paths]

[~/.cache/huggingface/modules/datasets_modules/datasets/MLCommons--ml_spoken_words/321ea853cf0a05abb7a2d7efea900692a3d8622af65a2f3ce98adb7800a5d57b/ml_spoken_words.py](https://localhost:8080/#) in _download_audio_archives_paths(dl_manager, lang, format, split)
    251     n_files_path = dl_manager.download(n_files_url)
    252 
--> 253     with open(n_files_path, "r", encoding="utf-8") as file:
    254         n_files = int(file.read().strip())  # the file contains a number of archives
    255 

ValueError: I/O operation on closed file.

Environment info

datasets version: 2.4.0
Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.13
PyArrow version: 6.0.1
Pandas version: 1.3.5

The text was updated successfully, but these errors were encountered:

DCNemesis · 2022-09-09T17:29:00Z

loading fsspec==2022.7.1 fixes this issue, setup.py would need to be changed to prevent users from using the latest version of fsspec.

DCNemesis · 2022-09-09T17:58:43Z

Opened PR to address this.

albertvillanova · 2022-09-12T08:17:12Z

Hi @DCNemesis, thanks for reporting.

That was a temporary issue in fsspec releases 2022.8.0 and 2022.8.1. But they fixed it in their patch release 2022.8.2 (and yanked both previous versions). See:

Unpin fsspec transformers#18846

Are you sure you have version 2022.8.2 installed?

pip install -U fsspec

DCNemesis · 2022-09-12T11:55:53Z

@albertvillanova I was using a temporary Google Colab instance, but checking it again today it seems it was loading 2022.8.1 rather than 2022.8.2. It's surprising that colab is using the version that was replaced the same day it was released. Testing with 2022.8.2 did work. It appears Colab will be fixing it on their end too.

albertvillanova · 2022-09-12T14:32:05Z

Thanks for the additional information.

Once we know 2022.8.2 works, I'm closing this issue. Feel free to reopen it if necessary.

albertvillanova · 2022-09-12T17:45:50Z

Colab just upgraded their default fsspec version to 2022.8.2:

Colab has buggy fsspec==2022.8.1 which has been YANKED on PyPI googlecolab/colabtools#3055 (comment)

DCNemesis added the bug Something isn't working label Sep 9, 2022

DCNemesis mentioned this issue Sep 9, 2022

Update setup.py #4962

Closed

albertvillanova closed this as completed Sep 12, 2022

albertvillanova linked a pull request Sep 12, 2022 that will close this issue

Update setup.py #4962

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsspec 2022.8.2 breaks xopen in streaming mode #4961

fsspec 2022.8.2 breaks xopen in streaming mode #4961

DCNemesis commented Sep 9, 2022 •

edited

Loading

DCNemesis commented Sep 9, 2022

DCNemesis commented Sep 9, 2022

albertvillanova commented Sep 12, 2022 •

edited

Loading

DCNemesis commented Sep 12, 2022

albertvillanova commented Sep 12, 2022

albertvillanova commented Sep 12, 2022

fsspec 2022.8.2 breaks xopen in streaming mode #4961

fsspec 2022.8.2 breaks xopen in streaming mode #4961

Comments

DCNemesis commented Sep 9, 2022 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

DCNemesis commented Sep 9, 2022

DCNemesis commented Sep 9, 2022

albertvillanova commented Sep 12, 2022 • edited Loading

DCNemesis commented Sep 12, 2022

albertvillanova commented Sep 12, 2022

albertvillanova commented Sep 12, 2022

DCNemesis commented Sep 9, 2022 •

edited

Loading

albertvillanova commented Sep 12, 2022 •

edited

Loading