Cache PyArrow schema operations #58583

xinyuangui2 · 2025-11-13T02:50:00Z

Description

This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns.

Main Changes

Caching for Tensor Type Serialization/Deserialization: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations.

Performance Impact

This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from 0.30s to 0.11s (~63% improvement).

Without cache:

We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle.

With cache

The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore.

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 · 2025-11-13T22:37:59Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py


    # Remove metadata for hashability
-    schemas[0].remove_metadata()
+    schemas[0] = schemas[0].remove_metadata()


https://github.com/apache/arrow/blob/cd23a765442bdbaaef43d0e4b239094fb01e37ae/cpp/src/arrow/type.cc#L2475

This remove_metadata() doesn't mutate

I think we cannot remove metadata in place. Otherwise it would fail some release tests:

[2025-11-14T23:48:47Z] raise ValueError(msg.format(self.feature_names, feature_names)) [2025-11-14T23:48:47Z] ValueError: feature_names mismatch: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 'feature_39', 'partition'] ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 'feature_39', '__index_level_0__', 'partition'] [2025-11-14T23:48:47Z] training data did not have the following fields: __index_level_0__

in https://buildkite.com/ray-project/premerge/builds/53867#019a84b7-88cd-4186-8a4e-b89b9e4604e1

I updated this a bit @goutamvenkat-anyscale

xinyuangui2 · 2025-11-13T22:38:25Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

    # NOTE: Type promotions aren't available in Arrow < 14.0
    subset_blocks = []
    for block in blocks:
-        cols_to_select = [


Profiler shows col_name in block.schema.names is heavy. We use set here.

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

This reverts commit aece4fd.

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/air/util/tensor_extensions/arrow.py

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/air/util/tensor_extensions/arrow.py

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

srinathk10

LGTM

python/ray/air/util/tensor_extensions/arrow.py

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: xgui <xgui@anyscale.com>

alexeykudinkin · 2025-11-18T00:22:01Z

python/ray/air/util/tensor_extensions/arrow.py

+        with self._cache_lock:
+            if self._serialize_cache is None:
+                self._serialize_cache = self._arrow_ext_serialize_compute()
+            return self._serialize_cache


If it's serialized you can skip the lock

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

xinyuangui2 added 8 commits November 13, 2025 02:48

several optimizations

4ae14df

Signed-off-by: xgui <xgui@anyscale.com>

fix

46a609e

Signed-off-by: xgui <xgui@anyscale.com>

fix init

80c2196

Signed-off-by: xgui <xgui@anyscale.com>

cache pyarrow schema

b78b54f

Signed-off-by: xgui <xgui@anyscale.com>

add ttl to cache

859f444

Signed-off-by: xgui <xgui@anyscale.com>

fix doc

853def5

Signed-off-by: xgui <xgui@anyscale.com>

add threadcachettl unittest

de12bbb

Signed-off-by: xgui <xgui@anyscale.com>

add tests for caches

519df7e

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 commented Nov 13, 2025

View reviewed changes

xinyuangui2 and others added 4 commits November 13, 2025 15:40

Merge branch 'master' into cache-pyarrow-schema

5659817

revert remove_metadata change

aece4fd

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

fix key

55631ad

Signed-off-by: xgui <xgui@anyscale.com>

Revert "revert remove_metadata change"

19ea48f

This reverts commit aece4fd.

xinyuangui2 changed the title ~~Several optimizations to arrow schema operations~~ Cache PyArrow schema operations Nov 14, 2025

xinyuangui2 marked this pull request as ready for review November 14, 2025 06:03

xinyuangui2 requested review from a team as code owners November 14, 2025 06:03

xinyuangui2 requested review from goutamvenkat-anyscale, raulchen and srinathk10 November 14, 2025 06:05

xinyuangui2 and others added 4 commits November 13, 2025 22:22

Update arrow.py

203896b

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

per-class deserialization cache

89d51ec

Signed-off-by: xgui <xgui@anyscale.com>

per-class deserialization cache

b2c97bc

Signed-off-by: xgui <xgui@anyscale.com>

Merge branch 'ray-project:master' into cache-pyarrow-schema

f739297

ray-gardener bot added the data Ray Data-related issues label Nov 14, 2025

raulchen reviewed Nov 14, 2025

View reviewed changes

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved

xinyuangui2 and others added 3 commits November 14, 2025 10:37

Merge branch 'master' into cache-pyarrow-schema

aa9c99e

use lru cache

ffb059a

Signed-off-by: xgui <xgui@anyscale.com>

add one more unittest to ensure different classes are not affected

62d597a

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 added 2 commits November 14, 2025 20:00

update pydoc

25761e7

Signed-off-by: xgui <xgui@anyscale.com>

update type

dc63bae

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from raulchen November 14, 2025 21:24

Merge branch 'master' into cache-pyarrow-schema

4c509fb

raulchen approved these changes Nov 14, 2025

View reviewed changes

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved

raulchen added the go add ONLY when ready to merge, run all tests label Nov 14, 2025

Reduce ARROW_EXTENSION_SERIALIZATION_CACHE_MAXSIZE

e066bcf

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

srinathk10 reviewed Nov 14, 2025

View reviewed changes

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved

doc fix

b1269f5

Signed-off-by: xgui <xgui@anyscale.com>

cursor bot reviewed Nov 14, 2025

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

xinyuangui2 added 2 commits November 15, 2025 00:49

avoid in-place mutate inside the func

5e3c6ec

Signed-off-by: xgui <xgui@anyscale.com>

add developer annotation

05e5669

Signed-off-by: xgui <xgui@anyscale.com>

raulchen merged commit bd8491e into ray-project:master Nov 17, 2025
6 checks passed

alexeykudinkin reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache PyArrow schema operations #58583

Cache PyArrow schema operations #58583

Uh oh!

xinyuangui2 commented Nov 13, 2025 •

edited

Loading

Uh oh!

xinyuangui2 Nov 13, 2025

Uh oh!

goutamvenkat-anyscale Nov 14, 2025

Uh oh!

xinyuangui2 Nov 15, 2025

Uh oh!

xinyuangui2 Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

srinathk10 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Cache PyArrow schema operations #58583

Cache PyArrow schema operations #58583

Uh oh!

Conversation

xinyuangui2 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Main Changes

Performance Impact

Without cache:

With cache

Uh oh!

xinyuangui2 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xinyuangui2 commented Nov 13, 2025 •

edited

Loading