Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errror when saving to disk a dataset of images #5717

Open
jplu opened this issue Apr 7, 2023 · 19 comments
Open

Errror when saving to disk a dataset of images #5717

jplu opened this issue Apr 7, 2023 · 19 comments
Assignees

Comments

@jplu
Copy link
Contributor

jplu commented Apr 7, 2023

Describe the bug

Hello!

I have an issue when I try to save on disk my dataset of images. The error I get is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1442, in save_to_disk
    for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1473, in _save_to_disk_single
    writer.write_table(pa_table)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_writer.py", line 570, in write_table
    pa_table = embed_table_storage(pa_table)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2268, in embed_table_storage
    arrays = [
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2269, in <listcomp>
    embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2142, in embed_array_storage
    return feature.embed_storage(array)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/features/image.py", line 269, in embed_storage
    storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
  File "pyarrow/array.pxi", line 2766, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 2961, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

My dataset is around 50K images, is this error might be due to a bad image?

Thanks for the help.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/dataset")
dataset["train"].save_to_disk("./myds", num_shards=40)

Expected behavior

Having my dataset properly saved to disk.

Environment info

  • datasets version: 2.11.0
  • Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
  • Python version: 3.10.10
  • Huggingface_hub version: 0.13.3
  • PyArrow version: 11.0.0
  • Pandas version: 2.0.0
@jplu
Copy link
Contributor Author

jplu commented Apr 7, 2023

Looks like as long as the number of shards makes a batch lower than 1000 images it works. In my training set I have 40K images. If I use num_shards=40 (batch of 1000 images) I get the error, but if I update it to num_shards=50 (batch of 800 images) it works.

I will be happy to share my dataset privately if it can help to better debug.

@mariosasko mariosasko self-assigned this Apr 7, 2023
@mariosasko
Copy link
Collaborator

Hi! I didn't manage to reproduce this behavior, so sharing the dataset with us would help a lot.

My dataset is around 50K images, is this error might be due to a bad image?

This shouldn't be the case as we save raw data to disk without decoding it.

@jplu
Copy link
Contributor Author

jplu commented Apr 14, 2023

OK, thanks! The dataset is currently hosted on a gcs bucket. How would you like to proceed for sharing the link?

@mariosasko
Copy link
Collaborator

You could follow this procedure or upload the dataset to Google Drive (50K images is not that much unless high-res) and send me an email with the link.

@jplu
Copy link
Contributor Author

jplu commented Apr 17, 2023

Thanks @mariosasko. I just sent you the GDrive link.

@mariosasko
Copy link
Collaborator

Thanks @jplu! I managed to reproduce the TypeError - it stems from this line, which can return a ChunkedArray (its mask is also chunked then, which Arrow does not allow) when the embedded data is too big to fit in a standard Array.

I'm working on a fix.

@mariosasko
Copy link
Collaborator

@yairl-dn You should be able to bypass this issue by reducing datasets.config.DEFAULT_MAX_BATCH_SIZE (1000 by default)

In Datasets 3.0, the Image storage format will be simplified, so this should be easier to fix then.

@jaggzh
Copy link

jaggzh commented Nov 8, 2023

The same error occurs with my save_to_disk() of Audio() items. I still get it with:

import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE=35
from datasets import Features, Array2D, Value, Dataset, Sequence, Audio
Saving the dataset (41/47 shards):  88%|██████████████████████████████████████████▉      | 297/339 [01:21<00:11,  3.65 examples/s]
Traceback (most recent call last):
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 155, in <module>
create_dataset(args)
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 137, in create_dataset
hf_dataset.save_to_disk(args.outds)
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1532, in save_to_disk
for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1563, in _save_to_disk_single
writer.write_table(pa_table)
File "/home/j/src/py/datasets/src/datasets/arrow_writer.py", line 574, in write_table
pa_table = embed_table_storage(pa_table)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2307, in embed_table_storage
arrays = [
^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2308, in <listcomp>
embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 1831, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 1831, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2177, in embed_array_storage
return feature.embed_storage(array)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/features/audio.py", line 276, in embed_storage
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 2850, in pyarrow.lib.StructArray.from_arrays
File "pyarrow/array.pxi", line 3290, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

@StevenSong
Copy link

StevenSong commented Mar 12, 2024

Similar to @jaggzh, setting datasets.config.DEFAULT_MAX_BATCH_SIZE did not help in my case (same error here but for different dataset: Stanford-AIMI/RRG24#2).

This is also reproducible with this open dataset: https://huggingface.co/datasets/nlphuji/winogavil/discussions/1

Here's some code to do so:

import datasets

datasets.config.DEFAULT_MAX_BATCH_SIZE = 1

from datasets import load_dataset

ds = load_dataset("nlphuji/winogavil")

ds.save_to_disk("temp")

I've done some more debugging with datasets==2.18.0 (which incorporates PR #6283 as suggested by @lhoestq in the above dataset discussion), and it seems like the culprit might now be these lines:

datasets/src/datasets/table.py

Lines 2111 to 2115 in ca8409a

array_offsets = _combine_list_array_offsets_with_mask(array)
if isinstance(feature, list):
return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature[0]))
if isinstance(feature, Sequence) and feature.length == -1:
return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature.feature))

From what I understand (and apologies I'm new to pyarrow), for an Image or Audio feature, these lines recursively call embed_array_storage for a list of either feature, ending up in the feature's embed_storage function. For all values in the list, embed_storage reads the bytes if they're not already loaded. The issue is the list being passed to the first recursive call is array.values which are the underlying values of array regardless of array's slicing (as influenced by parameters such as datasets.config.DEFAULT_MAX_BATCH_SIZE). This results in the same overflowing list of bytes that result in the ChunkedArray being returned in embed_storage. Even if the array weren't to overflow and this code ran without throwing an exception, it still seems incorrect to load all values if you ultimately only want some subset with ListArray.from_arrays(offsets, values); it seems some wasted effort if those values thrown out will get loaded again in the next batch and vice versa for the current batch of values during later batches.

Maybe there's a fix where you could pass a mask to embed_storage such that it only loads the values you ultimately want for the current batch? Curious to see if you agree with this diagnosis of the problem and if you think this fix is viable @mariosasko?

@yairl
Copy link

yairl commented Mar 12, 2024

Would be nice if they have something similar to Dagshub's S3 sync; it worked like a charm for my bigger datasets.

@StevenSong
Copy link

StevenSong commented Mar 12, 2024

I guess also the proposed masking solution simply enables datasets.config.DEFAULT_MAX_BATCH_SIZE by reducing the number of elements loaded, it does not address the underlying problem of trying to load all the images as bytes into a pyarrow array.

I'm happy to turn this into an actual PR but here's what I've implemented locally at tables.py:embed_array_storage to fix the above test case (nlphuji/winogavil) and my own use case:

    elif pa.types.is_list(array.type):
        # feature must be either [subfeature] or Sequence(subfeature)
        # Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
        array_offsets = _combine_list_array_offsets_with_mask(array)

        # mask underlying struct array so array_values.to_pylist()
        # fills None (see feature.embed_storage)
        idxs = np.arange(len(array.values))
        idxs = pa.ListArray.from_arrays(array_offsets, idxs).flatten()
        mask = np.ones(len(array.values)).astype(bool)
        mask[idxs] = False
        mask = pa.array(mask)
        # indexing 0 might be problematic but not sure
        # how else to get arbitrary keys from a struct array
        array_keys = array.values[0].keys()
        # is array.values always a struct array?
        array_values = pa.StructArray.from_arrays(
            arrays=[array.values.field(k) for k in array_keys],
            names=array_keys,
            mask=mask,
        )
        if isinstance(feature, list):
            return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature[0]))
        if isinstance(feature, Sequence) and feature.length == -1:
            return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature.feature))

Again though I'm new to pyarrow so this might not be the cleanest implementation, also I'm really not sure if there are other cases where this solution doesn't work. Would love to get some feedback from the hf folks!

@AJDERS
Copy link

AJDERS commented Mar 18, 2024

I have the same issue, with an audio dataset where file sizes vary significantly (~0.2-200 mb). Reducing datasets.config.DEFAULT_MAX_BATCH_SIZE doesn't help.

@20141888
Copy link

20141888 commented Jul 4, 2024

Still the problem is occured.
Huggingface is sucks 🤮🤮🤮🤮

@painebenjamin
Copy link

painebenjamin commented Sep 5, 2024

Came across this issue myself, with the same symptoms and reasons as everyone else; pa.array is returning a ChunkedArray in features.audio.Audio.embed_storage for my audio which varies between ~1MB and ~10MB in size.

I would rather remove a troublesome file from my dataset than have to switch off this library, but it would be difficult to identify which file(s) caused the issue, and it may just shift the issue down to another shard or another file anyway. So, I took the path of least resistance and simply dropped anything beyond the first chunk when this issue occurred, and added a warning to indicate what was dropped.

In the end I lost one file out of 105,024 samples and was able to complete the 1,479 shard dataset after only the one issue on shard 228.

While this is certainly not an ideal solution, it does represent a much better user experience, and was acceptable for my use case. I'm going to test the Image portion and then open a pull request to propose this "lossy" behavior become the way these edge cases are handled (maybe behind an environment flag?) until someone like @mariosasko or others can formulate a more holistic solution.

My work-in-progress "fix": main...painebenjamin:datasets:main (https://github.com/painebenjamin/datasets)

@lhoestq
Copy link
Member

lhoestq commented Sep 6, 2024

Another option could be to use pa.large_binary instead of pa.binary in certain cases ?

@fdschmidt93
Copy link
Contributor

fdschmidt93 commented Nov 6, 2024

For my large audio dataset, what seems to work for me is to locally change pa_type to pa.large_binary in both

pa_type: ClassVar[Any] = pa.struct({"bytes": pa.binary(), "path": pa.string()})

and

type=pa.binary(),

prior to uploading the dataset. Before downloading it, I just remove both changes to make sure any user with latest datasets can use it.

As a side note, the other proposed workarounds did not work for me.

@TParcollet
Copy link

Hey @fdschmidt93 I am not sure to follow. Can users downloading your dataset from the hub read it if you created the files with large_binary? It sounds like it will not be casted properly for them?

@fdschmidt93
Copy link
Contributor

fdschmidt93 commented Dec 7, 2024

Yes, that should work. In full detail -

I have two separate conda environments.

  1. The one I prepare the data with for which I apply the above changes.
  2. Another one I actually run my experiments with that uses latest available datasets from pip

This seems to work just fine. More concretely, I'm downloading my own data uploaded with environment 1 from a private HF datasets repo in environment 2 (i.e., load_dataset(...)) and run the experiments.

E: The private HF repo I was referring to now is public: https://huggingface.co/datasets/WueNLP/belebele-fleurs

@TParcollet
Copy link

Interesting ... so it's not even relevant at reading time, only writting ... Thanks I'll try this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests