Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Sequence(Audio/Image) feature in push_to_hub #6360

Closed
Laurent2916 opened this issue Oct 27, 2023 · 1 comment · Fixed by #6283
Closed

Add support for Sequence(Audio/Image) feature in push_to_hub #6360

Laurent2916 opened this issue Oct 27, 2023 · 1 comment · Fixed by #6283
Assignees
Labels
enhancement New feature or request

Comments

@Laurent2916
Copy link
Contributor

Feature request

Allow for Sequence of Image (or Audio) to be embedded inside the shards.

Motivation

Currently, thanks to #3685, when embed_external_files is set to True (which is the default) in push_to_hub, features of type Image and Audio are embedded inside the arrow/parquet shards, instead of only storing paths to the files.

I've noticed that this behavior does not extend to Sequence of Image, when working with a dataset of timelapse images.

Your contribution

I'll submit a PR if I find a way to add this feature

@Laurent2916 Laurent2916 added the enhancement New feature or request label Oct 27, 2023
@mariosasko
Copy link
Collaborator

This issue stems from

datasets/src/datasets/table.py

Lines 2203 to 2205 in 6d2f2a5

casted_values = _e(array.values, feature.feature)
if casted_values.type == array.values.type:
return array

I'll address it as part of #6283.

In the meantime, this should work

import pyarrow as pa
from datasets import Image

dataset = dataset.with_format("arrow")

def embed_images(pa_table):
    images_arr = pa.chunked_array(
        [
            pa.ListArray.from_arrays(chunk.offsets, Image().embed_storage(chunk.values), mask=chunk.is_null())
            for chunk in pa_table["images"].chunks
        ]
    )
    return pa_table.set_column(pa_table.schema.get_field_index("images"), "images", images_arr)

dataset = dataset.map(embed_images, batched=True)

dataset = dataset.with_format("python")

dataset.push_to_hub(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants