Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Image cast storage faster #6786

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Modexus
Copy link
Contributor

@Modexus Modexus commented Apr 5, 2024

PR for issue #6782.
Makes cast_storage of the Image class faster by removing the slow call to .pylist.
Instead directly convert each ListArray item to either Array2DExtensionType or Array3DExtensionType.

This also preserves the dtype removing the warning if the array is already uint8.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Jun 26, 2024

Hi ! Thanks for diving into this, this conversion to python lists is indeed quite slow.

Array2DExtensionType and Array3DExtensionType currently rely on pyarrow lists, but we will soon modify them to use FixedShapeTensorArray instead which is more efficient (e.g. doesn't need to store an offset for each value). So ideally it would be cool to speed this code up without using those extension types or it will be blocking to improve Array2DExtensionType and Array3DExtensionType.

If I understand correctly you just need the logic from ArrayExtensionArray.to_numpy ? If so feel free to make a separate function and ArrayExtensionArray.to_numpy can call it

@Modexus
Copy link
Contributor Author

Modexus commented Aug 21, 2024

Hey! I didn't have time to look into this but I just stumbled upon another problem.
While my fix kind of made it usable I now pre-embedded the images and even as Array3D they are really slow to load.
Don't think this can be resolved with using ArrayExtensionArray.to_numpy.

I think actually making the Array3DExtensionType faster would probably resolve both issues as you mentioned.
Is there an update on using FixedShapeTensorArray?
I'd gladly help implementing/testing it if there is some outline how to do it.

@lhoestq
Copy link
Member

lhoestq commented Aug 21, 2024

No one is working on this atm afaik (and actually we don't have any ETA unfortunately).

To do this change I think we need to:

  • update the _ArrayXD parent class of all the Array2D, Array3D types to use pa.fixed_shape_tensor type
    - pa_type = globals()[self.__class__.__name__ + "ExtensionType"](self.shape, self.dtype)
    + pa_type = pa.fixed_shape_tensor(self.shape, string_to_arrow(self.dtype))
  • remove the old extension type _ArrayXDExtensionType and extension array ArrayExtensionArray
  • probably update some functions in features.py that were using those types and use the new ones instead

@Modexus
Copy link
Contributor Author

Modexus commented Sep 5, 2024

Thanks, I have looked into this and have a working solution at least for my specific case.
But I had quite a few issues along the way that are not solved nicely.
It follows your suggestion though internally it is then just a fixed_shape_tensor as there is no ExtensionType anymore.

Hopefully, I can create a separate PR with these changes soon.

@lhoestq
Copy link
Member

lhoestq commented Sep 5, 2024

Nice, thanks @Modexus !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants