Errror when saving to disk a dataset of images #5717

jplu · 2023-04-07T11:59:17Z

Describe the bug

Hello!

I have an issue when I try to save on disk my dataset of images. The error I get is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1442, in save_to_disk
    for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1473, in _save_to_disk_single
    writer.write_table(pa_table)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_writer.py", line 570, in write_table
    pa_table = embed_table_storage(pa_table)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2268, in embed_table_storage
    arrays = [
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2269, in <listcomp>
    embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2142, in embed_array_storage
    return feature.embed_storage(array)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/features/image.py", line 269, in embed_storage
    storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
  File "pyarrow/array.pxi", line 2766, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 2961, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

My dataset is around 50K images, is this error might be due to a bad image?

Thanks for the help.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/dataset")
dataset["train"].save_to_disk("./myds", num_shards=40)

Expected behavior

Having my dataset properly saved to disk.

Environment info

datasets version: 2.11.0
Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.10.10
Huggingface_hub version: 0.13.3
PyArrow version: 11.0.0
Pandas version: 2.0.0

The text was updated successfully, but these errors were encountered:

jplu · 2023-04-07T15:54:29Z

Looks like as long as the number of shards makes a batch lower than 1000 images it works. In my training set I have 40K images. If I use num_shards=40 (batch of 1000 images) I get the error, but if I update it to num_shards=50 (batch of 800 images) it works.

I will be happy to share my dataset privately if it can help to better debug.

mariosasko · 2023-04-14T17:32:22Z

Hi! I didn't manage to reproduce this behavior, so sharing the dataset with us would help a lot.

My dataset is around 50K images, is this error might be due to a bad image?

This shouldn't be the case as we save raw data to disk without decoding it.

jplu · 2023-04-14T17:52:36Z

OK, thanks! The dataset is currently hosted on a gcs bucket. How would you like to proceed for sharing the link?

mariosasko · 2023-04-14T18:55:03Z

You could follow this procedure or upload the dataset to Google Drive (50K images is not that much unless high-res) and send me an email with the link.

jplu · 2023-04-17T09:47:46Z

Thanks @mariosasko. I just sent you the GDrive link.

mariosasko · 2023-05-09T17:14:49Z

Thanks @jplu! I managed to reproduce the TypeError - it stems from this line, which can return a ChunkedArray (its mask is also chunked then, which Arrow does not allow) when the embedded data is too big to fit in a standard Array.

I'm working on a fix.

mariosasko · 2023-09-25T13:43:14Z

@yairl-dn You should be able to bypass this issue by reducing datasets.config.DEFAULT_MAX_BATCH_SIZE (1000 by default)

In Datasets 3.0, the Image storage format will be simplified, so this should be easier to fix then.

jaggzh · 2023-11-08T10:39:52Z

The same error occurs with my save_to_disk() of Audio() items. I still get it with:

import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE=35
from datasets import Features, Array2D, Value, Dataset, Sequence, Audio

Saving the dataset (41/47 shards):  88%|██████████████████████████████████████████▉      | 297/339 [01:21<00:11,  3.65 examples/s]
Traceback (most recent call last):
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 155, in <module>
create_dataset(args)
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 137, in create_dataset
hf_dataset.save_to_disk(args.outds)
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1532, in save_to_disk
for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1563, in _save_to_disk_single
writer.write_table(pa_table)
File "/home/j/src/py/datasets/src/datasets/arrow_writer.py", line 574, in write_table
pa_table = embed_table_storage(pa_table)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2307, in embed_table_storage
arrays = [
^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2308, in <listcomp>
embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 1831, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 1831, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2177, in embed_array_storage
return feature.embed_storage(array)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/features/audio.py", line 276, in embed_storage
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 2850, in pyarrow.lib.StructArray.from_arrays
File "pyarrow/array.pxi", line 3290, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

StevenSong · 2024-03-12T03:37:25Z

Similar to @jaggzh, setting datasets.config.DEFAULT_MAX_BATCH_SIZE did not help in my case (same error here but for different dataset: Stanford-AIMI/RRG24#2).

This is also reproducible with this open dataset: https://huggingface.co/datasets/nlphuji/winogavil/discussions/1

Here's some code to do so:

import datasets

datasets.config.DEFAULT_MAX_BATCH_SIZE = 1

from datasets import load_dataset

ds = load_dataset("nlphuji/winogavil")

ds.save_to_disk("temp")

I've done some more debugging with datasets==2.18.0 (which incorporates PR #6283 as suggested by @lhoestq in the above dataset discussion), and it seems like the culprit might now be these lines:

datasets/src/datasets/table.py

Lines 2111 to 2115 in ca8409a

    
           array_offsets = _combine_list_array_offsets_with_mask(array) 
        
           if isinstance(feature, list): 
        
               return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature[0])) 
        
           if isinstance(feature, Sequence) and feature.length == -1: 
        
               return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature.feature))

From what I understand (and apologies I'm new to pyarrow), for an Image or Audio feature, these lines recursively call embed_array_storage for a list of either feature, ending up in the feature's embed_storage function. For all values in the list, embed_storage reads the bytes if they're not already loaded. The issue is the list being passed to the first recursive call is array.values which are the underlying values of array regardless of array's slicing (as influenced by parameters such as datasets.config.DEFAULT_MAX_BATCH_SIZE). This results in the same overflowing list of bytes that result in the ChunkedArray being returned in embed_storage. Even if the array weren't to overflow and this code ran without throwing an exception, it still seems incorrect to load all values if you ultimately only want some subset with ListArray.from_arrays(offsets, values); it seems some wasted effort if those values thrown out will get loaded again in the next batch and vice versa for the current batch of values during later batches.

Maybe there's a fix where you could pass a mask to embed_storage such that it only loads the values you ultimately want for the current batch? Curious to see if you agree with this diagnosis of the problem and if you think this fix is viable @mariosasko?

yairl · 2024-03-12T03:40:51Z

Would be nice if they have something similar to Dagshub's S3 sync; it worked like a charm for my bigger datasets.

StevenSong · 2024-03-12T14:14:24Z

I guess also the proposed masking solution simply enables datasets.config.DEFAULT_MAX_BATCH_SIZE by reducing the number of elements loaded, it does not address the underlying problem of trying to load all the images as bytes into a pyarrow array.

I'm happy to turn this into an actual PR but here's what I've implemented locally at tables.py:embed_array_storage to fix the above test case (nlphuji/winogavil) and my own use case:

    elif pa.types.is_list(array.type):
        # feature must be either [subfeature] or Sequence(subfeature)
        # Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
        array_offsets = _combine_list_array_offsets_with_mask(array)

        # mask underlying struct array so array_values.to_pylist()
        # fills None (see feature.embed_storage)
        idxs = np.arange(len(array.values))
        idxs = pa.ListArray.from_arrays(array_offsets, idxs).flatten()
        mask = np.ones(len(array.values)).astype(bool)
        mask[idxs] = False
        mask = pa.array(mask)
        # indexing 0 might be problematic but not sure
        # how else to get arbitrary keys from a struct array
        array_keys = array.values[0].keys()
        # is array.values always a struct array?
        array_values = pa.StructArray.from_arrays(
            arrays=[array.values.field(k) for k in array_keys],
            names=array_keys,
            mask=mask,
        )
        if isinstance(feature, list):
            return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature[0]))
        if isinstance(feature, Sequence) and feature.length == -1:
            return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature.feature))

Again though I'm new to pyarrow so this might not be the cleanest implementation, also I'm really not sure if there are other cases where this solution doesn't work. Would love to get some feedback from the hf folks!

AJDERS · 2024-03-18T08:26:18Z

I have the same issue, with an audio dataset where file sizes vary significantly (~0.2-200 mb). Reducing datasets.config.DEFAULT_MAX_BATCH_SIZE doesn't help.

20141888 · 2024-07-04T07:25:37Z

Still the problem is occured.
Huggingface is sucks 🤮🤮🤮🤮

painebenjamin · 2024-09-05T23:49:10Z

Came across this issue myself, with the same symptoms and reasons as everyone else; pa.array is returning a ChunkedArray in features.audio.Audio.embed_storage for my audio which varies between ~1MB and ~10MB in size.

I would rather remove a troublesome file from my dataset than have to switch off this library, but it would be difficult to identify which file(s) caused the issue, and it may just shift the issue down to another shard or another file anyway. So, I took the path of least resistance and simply dropped anything beyond the first chunk when this issue occurred, and added a warning to indicate what was dropped.

In the end I lost one file out of 105,024 samples and was able to complete the 1,479 shard dataset after only the one issue on shard 228.

While this is certainly not an ideal solution, it does represent a much better user experience, and was acceptable for my use case. I'm going to test the Image portion and then open a pull request to propose this "lossy" behavior become the way these edge cases are handled (maybe behind an environment flag?) until someone like @mariosasko or others can formulate a more holistic solution.

My work-in-progress "fix": main...painebenjamin:datasets:main (https://github.com/painebenjamin/datasets)

lhoestq · 2024-09-06T13:09:00Z

Another option could be to use pa.large_binary instead of pa.binary in certain cases ?

fdschmidt93 · 2024-11-06T14:51:24Z

For my large audio dataset, what seems to work for me is to locally change pa_type to pa.large_binary in both

datasets/src/datasets/features/audio.py

Line 71 in 01f91ba

pa_type: ClassVar[Any] = pa.struct({"bytes": pa.binary(), "path": pa.string()})

and

datasets/src/datasets/features/audio.py

Line 270 in 01f91ba

type=pa.binary(),

prior to uploading the dataset. Before downloading it, I just remove both changes to make sure any user with latest datasets can use it.

As a side note, the other proposed workarounds did not work for me.

TParcollet · 2024-12-07T18:56:25Z

Hey @fdschmidt93 I am not sure to follow. Can users downloading your dataset from the hub read it if you created the files with large_binary? It sounds like it will not be casted properly for them?

fdschmidt93 · 2024-12-07T19:18:46Z

Yes, that should work. In full detail -

I have two separate conda environments.

The one I prepare the data with for which I apply the above changes.
Another one I actually run my experiments with that uses latest available datasets from pip

This seems to work just fine. More concretely, I'm downloading my own data uploaded with environment 1 from a private HF datasets repo in environment 2 (i.e., load_dataset(...)) and run the experiments.

E: The private HF repo I was referring to now is public: https://huggingface.co/datasets/WueNLP/belebele-fleurs

TParcollet · 2024-12-07T19:26:58Z

Interesting ... so it's not even relevant at reading time, only writting ... Thanks I'll try this out.

mariosasko self-assigned this Apr 7, 2023

jaggzh mentioned this issue Nov 8, 2023

Index 339 out of range for dataset of size 339 <-- save_to_file() #6389

Open

StevenSong mentioned this issue Mar 12, 2024

Issue with dataset Stanford-AIMI/RRG24#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errror when saving to disk a dataset of images #5717

Errror when saving to disk a dataset of images #5717

jplu commented Apr 7, 2023

jplu commented Apr 7, 2023

mariosasko commented Apr 14, 2023

jplu commented Apr 14, 2023

mariosasko commented Apr 14, 2023

jplu commented Apr 17, 2023

mariosasko commented May 9, 2023

mariosasko commented Sep 25, 2023

jaggzh commented Nov 8, 2023 •

edited

Loading

StevenSong commented Mar 12, 2024 •

edited

Loading

yairl commented Mar 12, 2024

StevenSong commented Mar 12, 2024 •

edited

Loading

AJDERS commented Mar 18, 2024

20141888 commented Jul 4, 2024

painebenjamin commented Sep 5, 2024 •

edited

Loading

lhoestq commented Sep 6, 2024

fdschmidt93 commented Nov 6, 2024 •

edited

Loading

TParcollet commented Dec 7, 2024

fdschmidt93 commented Dec 7, 2024 •

edited

Loading

TParcollet commented Dec 7, 2024

Errror when saving to disk a dataset of images #5717

Errror when saving to disk a dataset of images #5717

Comments

jplu commented Apr 7, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

jplu commented Apr 7, 2023

mariosasko commented Apr 14, 2023

jplu commented Apr 14, 2023

mariosasko commented Apr 14, 2023

jplu commented Apr 17, 2023

mariosasko commented May 9, 2023

mariosasko commented Sep 25, 2023

jaggzh commented Nov 8, 2023 • edited Loading

StevenSong commented Mar 12, 2024 • edited Loading

yairl commented Mar 12, 2024

StevenSong commented Mar 12, 2024 • edited Loading

AJDERS commented Mar 18, 2024

20141888 commented Jul 4, 2024

painebenjamin commented Sep 5, 2024 • edited Loading

lhoestq commented Sep 6, 2024

fdschmidt93 commented Nov 6, 2024 • edited Loading

TParcollet commented Dec 7, 2024

fdschmidt93 commented Dec 7, 2024 • edited Loading

TParcollet commented Dec 7, 2024

jaggzh commented Nov 8, 2023 •

edited

Loading

StevenSong commented Mar 12, 2024 •

edited

Loading

StevenSong commented Mar 12, 2024 •

edited

Loading

painebenjamin commented Sep 5, 2024 •

edited

Loading

fdschmidt93 commented Nov 6, 2024 •

edited

Loading

fdschmidt93 commented Dec 7, 2024 •

edited

Loading