Skip to content

Conversation

@xinyuangui2
Copy link
Contributor

@xinyuangui2 xinyuangui2 commented Nov 13, 2025

Description

This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns.

Main Changes

  • Caching for Tensor Type Serialization/Deserialization: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations.

Performance Impact

This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from 0.30s to 0.11s (~63% improvement).

Without cache:

Screenshot 2025-11-13 at 9 49 33 PM We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle.

With cache

Screenshot 2025-11-13 at 9 41 15 PM The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore.

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>

# Remove metadata for hashability
schemas[0].remove_metadata()
schemas[0] = schemas[0].remove_metadata()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot remove metadata in place. Otherwise it would fail some release tests:

[2025-11-14T23:48:47Z]     raise ValueError(msg.format(self.feature_names, feature_names))
[2025-11-14T23:48:47Z] ValueError: feature_names mismatch: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 'feature_39', 'partition'] ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 'feature_39', '__index_level_0__', 'partition']
[2025-11-14T23:48:47Z] training data did not have the following fields: __index_level_0__

in https://buildkite.com/ray-project/premerge/builds/53867#019a84b7-88cd-4186-8a4e-b89b9e4604e1

I updated this a bit @goutamvenkat-anyscale

# NOTE: Type promotions aren't available in Arrow < 14.0
subset_blocks = []
for block in blocks:
cols_to_select = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiler shows col_name in block.schema.names is heavy. We use set here.

xinyuangui2 and others added 4 commits November 13, 2025 15:40
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 changed the title Several optimizations to arrow schema operations Cache PyArrow schema operations Nov 14, 2025
@xinyuangui2 xinyuangui2 marked this pull request as ready for review November 14, 2025 06:03
@xinyuangui2 xinyuangui2 requested review from a team as code owners November 14, 2025 06:03
xinyuangui2 and others added 4 commits November 13, 2025 22:22
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 14, 2025
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from raulchen November 14, 2025 21:24
@raulchen raulchen added the go add ONLY when ready to merge, run all tests label Nov 14, 2025
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Copy link
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@raulchen raulchen merged commit bd8491e into ray-project:master Nov 17, 2025
6 checks passed
with self._cache_lock:
if self._serialize_cache is None:
self._serialize_cache = self._arrow_ext_serialize_compute()
return self._serialize_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's serialized you can skip the lock

Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Description

This PR adds caching for PyArrow schema operations to improve
performance during batching operations, especially for tables with a
large number of columns.

### Main Changes

- **Caching for Tensor Type Serialization/Deserialization**: Added cache
for tensor type serialization and deserialization operations. This
significantly reduces overhead for frequently accessed tensor types
during schema operations.

### Performance Impact

This optimization is particularly beneficial during batching operations
for tables with a large number of columns. In one of our tests with 200
columns, the batching time per batch decreased from **0.30s to 0.11s**
(~63% improvement).

#### Without cache:
<img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM"
src="https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728"
/>
We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in
different places. Each time `__arrow_ext_deserialize__` will create a
new object and `__arrow_ext_serialize__` includes expensive pickle.

#### With cache
<img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM"
src="https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131"
/>
The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is
not a bottleneck anymore.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
## Description

This PR adds caching for PyArrow schema operations to improve
performance during batching operations, especially for tables with a
large number of columns.

### Main Changes

- **Caching for Tensor Type Serialization/Deserialization**: Added cache
for tensor type serialization and deserialization operations. This
significantly reduces overhead for frequently accessed tensor types
during schema operations.

### Performance Impact

This optimization is particularly beneficial during batching operations
for tables with a large number of columns. In one of our tests with 200
columns, the batching time per batch decreased from **0.30s to 0.11s**
(~63% improvement).

#### Without cache:
<img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM"
src="https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728"
/>
We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in
different places. Each time `__arrow_ext_deserialize__` will create a
new object and `__arrow_ext_serialize__` includes expensive pickle.

#### With cache
<img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM"
src="https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131"
/>
The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is
not a bottleneck anymore.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description

This PR adds caching for PyArrow schema operations to improve
performance during batching operations, especially for tables with a
large number of columns.

### Main Changes

- **Caching for Tensor Type Serialization/Deserialization**: Added cache
for tensor type serialization and deserialization operations. This
significantly reduces overhead for frequently accessed tensor types
during schema operations.

### Performance Impact

This optimization is particularly beneficial during batching operations
for tables with a large number of columns. In one of our tests with 200
columns, the batching time per batch decreased from **0.30s to 0.11s**
(~63% improvement).

#### Without cache:
<img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM"
src="https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728"
/>
We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in
different places. Each time `__arrow_ext_deserialize__` will create a
new object and `__arrow_ext_serialize__` includes expensive pickle.

#### With cache 
<img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM"
src="https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131"
/>
The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is
not a bottleneck anymore.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants