-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errror when saving to disk a dataset of images #5717
Comments
Looks like as long as the number of shards makes a batch lower than 1000 images it works. In my training set I have 40K images. If I use I will be happy to share my dataset privately if it can help to better debug. |
Hi! I didn't manage to reproduce this behavior, so sharing the dataset with us would help a lot.
This shouldn't be the case as we save raw data to disk without decoding it. |
OK, thanks! The dataset is currently hosted on a gcs bucket. How would you like to proceed for sharing the link? |
You could follow this procedure or upload the dataset to Google Drive (50K images is not that much unless high-res) and send me an email with the link. |
Thanks @mariosasko. I just sent you the GDrive link. |
@yairl-dn You should be able to bypass this issue by reducing In Datasets 3.0, the Image storage format will be simplified, so this should be easier to fix then. |
The same error occurs with my save_to_disk() of Audio() items. I still get it with: import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE=35
from datasets import Features, Array2D, Value, Dataset, Sequence, Audio
|
Similar to @jaggzh, setting This is also reproducible with this open dataset: https://huggingface.co/datasets/nlphuji/winogavil/discussions/1 Here's some code to do so: import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE = 1
from datasets import load_dataset
ds = load_dataset("nlphuji/winogavil")
ds.save_to_disk("temp") I've done some more debugging with datasets/src/datasets/table.py Lines 2111 to 2115 in ca8409a
From what I understand (and apologies I'm new to pyarrow), for an Image or Audio feature, these lines recursively call Maybe there's a fix where you could pass a mask to |
Would be nice if they have something similar to Dagshub's S3 sync; it worked like a charm for my bigger datasets. |
I guess also the proposed masking solution simply enables I'm happy to turn this into an actual PR but here's what I've implemented locally at elif pa.types.is_list(array.type):
# feature must be either [subfeature] or Sequence(subfeature)
# Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
array_offsets = _combine_list_array_offsets_with_mask(array)
# mask underlying struct array so array_values.to_pylist()
# fills None (see feature.embed_storage)
idxs = np.arange(len(array.values))
idxs = pa.ListArray.from_arrays(array_offsets, idxs).flatten()
mask = np.ones(len(array.values)).astype(bool)
mask[idxs] = False
mask = pa.array(mask)
# indexing 0 might be problematic but not sure
# how else to get arbitrary keys from a struct array
array_keys = array.values[0].keys()
# is array.values always a struct array?
array_values = pa.StructArray.from_arrays(
arrays=[array.values.field(k) for k in array_keys],
names=array_keys,
mask=mask,
)
if isinstance(feature, list):
return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature[0]))
if isinstance(feature, Sequence) and feature.length == -1:
return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature.feature)) Again though I'm new to pyarrow so this might not be the cleanest implementation, also I'm really not sure if there are other cases where this solution doesn't work. Would love to get some feedback from the hf folks! |
I have the same issue, with an audio dataset where file sizes vary significantly (~0.2-200 mb). Reducing |
Still the problem is occured. |
Came across this issue myself, with the same symptoms and reasons as everyone else; I would rather remove a troublesome file from my dataset than have to switch off this library, but it would be difficult to identify which file(s) caused the issue, and it may just shift the issue down to another shard or another file anyway. So, I took the path of least resistance and simply dropped anything beyond the first chunk when this issue occurred, and added a warning to indicate what was dropped. In the end I lost one file out of 105,024 samples and was able to complete the 1,479 shard dataset after only the one issue on shard 228. While this is certainly not an ideal solution, it does represent a much better user experience, and was acceptable for my use case. I'm going to test the Image portion and then open a pull request to propose this "lossy" behavior become the way these edge cases are handled (maybe behind an environment flag?) until someone like @mariosasko or others can formulate a more holistic solution. My work-in-progress "fix": main...painebenjamin:datasets:main (https://github.com/painebenjamin/datasets) |
Another option could be to use |
For my large audio dataset, what seems to work for me is to locally change datasets/src/datasets/features/audio.py Line 71 in 01f91ba
and datasets/src/datasets/features/audio.py Line 270 in 01f91ba
prior to uploading the dataset. Before downloading it, I just remove both changes to make sure any user with latest As a side note, the other proposed workarounds did not work for me. |
Hey @fdschmidt93 I am not sure to follow. Can users downloading your dataset from the hub read it if you created the files with large_binary? It sounds like it will not be casted properly for them? |
Yes, that should work. In full detail - I have two separate conda environments.
This seems to work just fine. More concretely, I'm downloading my own data uploaded with environment 1 from a private HF datasets repo in environment 2 (i.e., E: The private HF repo I was referring to now is public: https://huggingface.co/datasets/WueNLP/belebele-fleurs |
Interesting ... so it's not even relevant at reading time, only writting ... Thanks I'll try this out. |
Describe the bug
Hello!
I have an issue when I try to save on disk my dataset of images. The error I get is:
My dataset is around 50K images, is this error might be due to a bad image?
Thanks for the help.
Steps to reproduce the bug
Expected behavior
Having my dataset properly saved to disk.
Environment info
datasets
version: 2.11.0The text was updated successfully, but these errors were encountered: