-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] (De)serialization of PyArrow Extension Arrays #51972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, would we need to expose this as a user API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, if we'd want to allow a user to use their own ExtensionTypes. If the public api should live elsewhere/look different, please let me know and I'll change it.
|
@pimdh thank you for your contribution. We have an API for registering custom serializers and deserializers (I think it's what is used internally to register the arrow serializer/deserializer too). Can that be used to add serializers and deserializers for PyArrow Extension Arrays? See Example 2 in https://docs.ray.io/en/latest/ray-core/objects/serialization.html#customized-serialization for more details. |
|
Hi, |
|
@pimdh I see your point. I'm adding @alexeykudinkin from the Ray Data team to shepherd this. |
alexeykudinkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @pimdh!
Can you please help me understand what kind of extensions you folks are using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually lift this support to happen outside of the extension mechanism -- Ray should just support FixedShapeTensorArray not requiring any extensions for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the built-in FixedShapeTensorArray, and a custom variable shape tensor array, which seems equivalent to the Ray one (and the one in the PR in the arrow repo).
I'm reading and writing with Lance, so the array type affects the storage format. Hence I'm not so keen to migrate all my code to Ray's internal extension type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another representative example is this one for encoding bfloat16 arrays: https://github.com/lancedb/lance/blob/main/python/python/lance/_arrow/bf16.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason you're adding support for FixedShapeTensorType?
It should just work out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just tested it and it works out of the box:
import ray
import numpy as np
import pyarrow as pa
from pyarrow import FixedShapeTensorType
@ray.remote
def _gen_tensor():
shape = (16, 16, 8) # Shape per tensor
num_tensors = 128 # Number of tensors
tensors = [np.ones(shape, dtype=np.int8).flatten() for _ in range(num_tensors)]
# Create array of fixed shape tensors
return pa.array(tensors, type=pa.fixed_shape_tensor(pa.int8(), shape=shape))
tensors = ray.get(_gen_tensor.remote())
print(tensors)
print(tensors.type)
|
Anyone able to help me land this? Thanks! |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
iamjustinhsu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pimdh, looks fine to me. I'll await @alexeykudinkin response since he's the expert on this
| elif pa.types.is_map(a.type): | ||
| return _map_array_to_array_payload(a) | ||
| elif isinstance(a.type, tensor_extension_types): | ||
| return _tensor_array_to_array_payload(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_tensor_array_to_array_payload is this still being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it isn't, thanks. Removed
|
Thanks for making the changes @pimdh! And special thanks to @mdekstrand for chiming in and getting everyone aligned! |
|
It looks like the CI failed due to an unrelated reason. @alexeykudinkin , would you mind restarting CI? Thanks |
a0d5114 to
eb71368
Compare
Sparks0219
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…tests Signed-off-by: Pim de Haan <pim@cusp.ai>
Signed-off-by: Pim de Haan <pim@cusp.ai>
Signed-off-by: Pim de Haan <pim@cusp.ai>
eb71368 to
4bbcdbe
Compare
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This feature adds the ability to (de)serialize arbitrary PyArrow extension arrays. This is needed to use Ray in code bases that use extension arrays. ~The serialization already seemed sufficiently general, but as far as I can tell, the deserialization can not be done in generality. Hence, this setup allows registration of custom deserializers for extension types.~ ~For serialization, the selector has been changed from `ExtensionType` to `BaseExtensionType` to accommodate for non-Python ExtensionArrays, like `pyarrow.FixedShapeTensorArray`.~ ~This is at the moment a proof-of-concept. If you like the idea, I suppose the registration function may need to move to a better place, and docs need adding.~ The implementation now works without registration on any extension type. ## Related issue number Closes ray-project#51959 ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Generalizes Arrow array (de)serialization to any `pyarrow.BaseExtensionType`, removing tensor-specific handling and adding tests for fixed/variable-shape tensors. > > - **Arrow (De)serialization**: > - Switch from tensor-specific checks to generic `pyarrow.BaseExtensionType` handling. > - Reconstruct extension arrays via `type.wrap_array(storage)`; serialize via storage payload wrapped with extension metadata. > - Remove `ray.air.util.tensor_extensions.arrow` dependencies and special-casing. > - **Tests**: > - Add roundtrip tests for `pa.FixedShapeTensorArray` and a custom variable-shape `ExtensionType`. > - Import `PicklableArrayPayload` in tests for constructing payloads. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4bbcdbe. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Pim de Haan <pim@cusp.ai> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This feature adds the ability to (de)serialize arbitrary PyArrow extension arrays. This is needed to use Ray in code bases that use extension arrays. ~The serialization already seemed sufficiently general, but as far as I can tell, the deserialization can not be done in generality. Hence, this setup allows registration of custom deserializers for extension types.~ ~For serialization, the selector has been changed from `ExtensionType` to `BaseExtensionType` to accommodate for non-Python ExtensionArrays, like `pyarrow.FixedShapeTensorArray`.~ ~This is at the moment a proof-of-concept. If you like the idea, I suppose the registration function may need to move to a better place, and docs need adding.~ The implementation now works without registration on any extension type. ## Related issue number Closes ray-project#51959 ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Generalizes Arrow array (de)serialization to any `pyarrow.BaseExtensionType`, removing tensor-specific handling and adding tests for fixed/variable-shape tensors. > > - **Arrow (De)serialization**: > - Switch from tensor-specific checks to generic `pyarrow.BaseExtensionType` handling. > - Reconstruct extension arrays via `type.wrap_array(storage)`; serialize via storage payload wrapped with extension metadata. > - Remove `ray.air.util.tensor_extensions.arrow` dependencies and special-casing. > - **Tests**: > - Add roundtrip tests for `pa.FixedShapeTensorArray` and a custom variable-shape `ExtensionType`. > - Import `PicklableArrayPayload` in tests for constructing payloads. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4bbcdbe. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Pim de Haan <pim@cusp.ai> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This feature adds the ability to (de)serialize arbitrary PyArrow extension arrays. This is needed to use Ray in code bases that use extension arrays. ~The serialization already seemed sufficiently general, but as far as I can tell, the deserialization can not be done in generality. Hence, this setup allows registration of custom deserializers for extension types.~ ~For serialization, the selector has been changed from `ExtensionType` to `BaseExtensionType` to accommodate for non-Python ExtensionArrays, like `pyarrow.FixedShapeTensorArray`.~ ~This is at the moment a proof-of-concept. If you like the idea, I suppose the registration function may need to move to a better place, and docs need adding.~ The implementation now works without registration on any extension type. ## Related issue number Closes ray-project#51959 ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Generalizes Arrow array (de)serialization to any `pyarrow.BaseExtensionType`, removing tensor-specific handling and adding tests for fixed/variable-shape tensors. > > - **Arrow (De)serialization**: > - Switch from tensor-specific checks to generic `pyarrow.BaseExtensionType` handling. > - Reconstruct extension arrays via `type.wrap_array(storage)`; serialize via storage payload wrapped with extension metadata. > - Remove `ray.air.util.tensor_extensions.arrow` dependencies and special-casing. > - **Tests**: > - Add roundtrip tests for `pa.FixedShapeTensorArray` and a custom variable-shape `ExtensionType`. > - Import `PicklableArrayPayload` in tests for constructing payloads. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4bbcdbe. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Pim de Haan <pim@cusp.ai>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This feature adds the ability to (de)serialize arbitrary PyArrow extension arrays. This is needed to use Ray in code bases that use extension arrays. ~The serialization already seemed sufficiently general, but as far as I can tell, the deserialization can not be done in generality. Hence, this setup allows registration of custom deserializers for extension types.~ ~For serialization, the selector has been changed from `ExtensionType` to `BaseExtensionType` to accommodate for non-Python ExtensionArrays, like `pyarrow.FixedShapeTensorArray`.~ ~This is at the moment a proof-of-concept. If you like the idea, I suppose the registration function may need to move to a better place, and docs need adding.~ The implementation now works without registration on any extension type. ## Related issue number Closes ray-project#51959 ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Generalizes Arrow array (de)serialization to any `pyarrow.BaseExtensionType`, removing tensor-specific handling and adding tests for fixed/variable-shape tensors. > > - **Arrow (De)serialization**: > - Switch from tensor-specific checks to generic `pyarrow.BaseExtensionType` handling. > - Reconstruct extension arrays via `type.wrap_array(storage)`; serialize via storage payload wrapped with extension metadata. > - Remove `ray.air.util.tensor_extensions.arrow` dependencies and special-casing. > - **Tests**: > - Add roundtrip tests for `pa.FixedShapeTensorArray` and a custom variable-shape `ExtensionType`. > - Import `PicklableArrayPayload` in tests for constructing payloads. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4bbcdbe. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Pim de Haan <pim@cusp.ai>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This feature adds the ability to (de)serialize arbitrary PyArrow extension arrays. This is needed to use Ray in code bases that use extension arrays. ~The serialization already seemed sufficiently general, but as far as I can tell, the deserialization can not be done in generality. Hence, this setup allows registration of custom deserializers for extension types.~ ~For serialization, the selector has been changed from `ExtensionType` to `BaseExtensionType` to accommodate for non-Python ExtensionArrays, like `pyarrow.FixedShapeTensorArray`.~ ~This is at the moment a proof-of-concept. If you like the idea, I suppose the registration function may need to move to a better place, and docs need adding.~ The implementation now works without registration on any extension type. ## Related issue number Closes ray-project#51959 ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Generalizes Arrow array (de)serialization to any `pyarrow.BaseExtensionType`, removing tensor-specific handling and adding tests for fixed/variable-shape tensors. > > - **Arrow (De)serialization**: > - Switch from tensor-specific checks to generic `pyarrow.BaseExtensionType` handling. > - Reconstruct extension arrays via `type.wrap_array(storage)`; serialize via storage payload wrapped with extension metadata. > - Remove `ray.air.util.tensor_extensions.arrow` dependencies and special-casing. > - **Tests**: > - Add roundtrip tests for `pa.FixedShapeTensorArray` and a custom variable-shape `ExtensionType`. > - Import `PicklableArrayPayload` in tests for constructing payloads. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4bbcdbe. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Pim de Haan <pim@cusp.ai> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This feature adds the ability to (de)serialize arbitrary PyArrow extension arrays. This is needed to use Ray in code bases that use extension arrays. ~The serialization already seemed sufficiently general, but as far as I can tell, the deserialization can not be done in generality. Hence, this setup allows registration of custom deserializers for extension types.~ ~For serialization, the selector has been changed from `ExtensionType` to `BaseExtensionType` to accommodate for non-Python ExtensionArrays, like `pyarrow.FixedShapeTensorArray`.~ ~This is at the moment a proof-of-concept. If you like the idea, I suppose the registration function may need to move to a better place, and docs need adding.~ The implementation now works without registration on any extension type. ## Related issue number Closes ray-project#51959 ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Generalizes Arrow array (de)serialization to any `pyarrow.BaseExtensionType`, removing tensor-specific handling and adding tests for fixed/variable-shape tensors. > > - **Arrow (De)serialization**: > - Switch from tensor-specific checks to generic `pyarrow.BaseExtensionType` handling. > - Reconstruct extension arrays via `type.wrap_array(storage)`; serialize via storage payload wrapped with extension metadata. > - Remove `ray.air.util.tensor_extensions.arrow` dependencies and special-casing. > - **Tests**: > - Add roundtrip tests for `pa.FixedShapeTensorArray` and a custom variable-shape `ExtensionType`. > - Import `PicklableArrayPayload` in tests for constructing payloads. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4bbcdbe. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Pim de Haan <pim@cusp.ai> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
This feature adds the ability to (de)serialize arbitrary PyArrow extension arrays. This is needed to use Ray in code bases that use extension arrays.
The serialization already seemed sufficiently general, but as far as I can tell, the deserialization can not be done in generality. Hence, this setup allows registration of custom deserializers for extension types.For serialization, the selector has been changed fromExtensionTypetoBaseExtensionTypeto accommodate for non-Python ExtensionArrays, likepyarrow.FixedShapeTensorArray.This is at the moment a proof-of-concept. If you like the idea, I suppose the registration function may need to move to a better place, and docs need adding.The implementation now works without registration on any extension type.
Related issue number
Closes #51959
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Generalizes Arrow array (de)serialization to any
pyarrow.BaseExtensionType, removing tensor-specific handling and adding tests for fixed/variable-shape tensors.pyarrow.BaseExtensionTypehandling.type.wrap_array(storage); serialize via storage payload wrapped with extension metadata.ray.air.util.tensor_extensions.arrowdependencies and special-casing.pa.FixedShapeTensorArrayand a custom variable-shapeExtensionType.PicklableArrayPayloadin tests for constructing payloads.Written by Cursor Bugbot for commit 4bbcdbe. This will update automatically on new commits. Configure here.