[data] Fix errors with concatenation with mixed pyarrow native and extension types #56811

omatthew98 · 2025-09-23T00:23:36Z

Why are these changes needed?

If we had an execution where we needed to concatenate native pyarrow types and pyarrow extension types, we would get errors like the following:

⚠️  Dataset dataset_5_0 execution failed: : 0.00 row [00:00, ? row/s]
- Repartition 1: 0.00 row [00:00, ? row/s]
  *- Split Repartition: : 0.00 row [00:00, ? row/s]
2025-09-22 17:21:34,068 ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2025-09-22 17:21:34,068	ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
  File "/Users/mowen/code/ray/python/ray/data/exceptions.py", line 49, in handle_trace
    return fn(*args, **kwargs)
  File "/Users/mowen/code/ray/python/ray/data/_internal/plan.py", line 533, in execute
    blocks = execute_to_legacy_block_list(
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/legacy_compat.py", line 127, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/legacy_compat.py", line 175, in _bundles_to_block_list
    bundle_list = list(bundles)
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/interfaces/executor.py", line 34, in __next__
    return self.get_next()
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 680, in get_next
    bundle = state.get_output_blocking(output_split_idx)
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/streaming_executor_state.py", line 373, in get_output_blocking
    raise self._exception
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 331, in run
    continue_sched = self._scheduling_loop_step(self._topology)
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 475, in _scheduling_loop_step
    update_operator_states(topology)
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/streaming_executor_state.py", line 586, in update_operator_states
    op.all_inputs_done()
  File "/Users/mowen/code/ray/python/ray/data/_internal/execution/operators/base_physical_operator.py", line 122, in all_inputs_done
    self._output_buffer, self._stats = self._bulk_fn(self._input_buffer, ctx)
  File "/Users/mowen/code/ray/python/ray/data/_internal/planner/repartition.py", line 84, in split_repartition_fn
    return scheduler.execute(refs, num_outputs, ctx)
  File "/Users/mowen/code/ray/python/ray/data/_internal/planner/exchange/split_repartition_task_scheduler.py", line 106, in execute
    ] = reduce_bar.fetch_until_complete(list(reduce_metadata_schema))
  File "/Users/mowen/code/ray/python/ray/data/_internal/progress_bar.py", line 166, in fetch_until_complete
    for ref, result in zip(done, ray.get(done)):
  File "/Users/mowen/code/ray/python/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/Users/mowen/code/ray/python/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/Users/mowen/code/ray/python/ray/_private/worker.py", line 2952, in get
    values, debugger_breakpoint = worker.get_objects(
  File "/Users/mowen/code/ray/python/ray/_private/worker.py", line 1025, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::reduce() (pid=7442, ip=127.0.0.1)
  File "/Users/mowen/code/ray/python/ray/data/_internal/planner/exchange/shuffle_task_spec.py", line 128, in reduce
    new_block = builder.build()
  File "/Users/mowen/code/ray/python/ray/data/_internal/delegating_block_builder.py", line 68, in build
    return self._builder.build()
  File "/Users/mowen/code/ray/python/ray/data/_internal/table_block.py", line 144, in build
    return self._concat_tables(tables)
  File "/Users/mowen/code/ray/python/ray/data/_internal/arrow_block.py", line 161, in _concat_tables
    return transform_pyarrow.concat(tables, promote_types=True)
  File "/Users/mowen/code/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 706, in concat
    col = _concatenate_chunked_arrays(col_chunked_arrays)
  File "/Users/mowen/code/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 397, in _concatenate_chunked_arrays
    raise RuntimeError(f"Types mismatch: {type_} != {arr.type}")
RuntimeError: Types mismatch: uint64 != double
2025-09-22 17:21:34,069	ERROR worker.py:429 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::reduce() (pid=7442, ip=127.0.0.1)
  File "/Users/mowen/code/ray/python/ray/data/_internal/planner/exchange/shuffle_task_spec.py", line 128, in reduce
    new_block = builder.build()
  File "/Users/mowen/code/ray/python/ray/data/_internal/delegating_block_builder.py", line 68, in build
    return self._builder.build()
  File "/Users/mowen/code/ray/python/ray/data/_internal/table_block.py", line 144, in build
    return self._concat_tables(tables)
  File "/Users/mowen/code/ray/python/ray/data/_internal/arrow_block.py", line 161, in _concat_tables
    return transform_pyarrow.concat(tables, promote_types=True)
  File "/Users/mowen/code/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 706, in concat
    col = _concatenate_chunked_arrays(col_chunked_arrays)
  File "/Users/mowen/code/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 397, in _concatenate_chunked_arrays
    raise RuntimeError(f"Types mismatch: {type_} != {arr.type}")
RuntimeError: Types mismatch: uint64 != double

This PR adds a test that replicates this and fixes the underlying issue by concatenating extension types and native types separately before rejoining them.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Owen <mowen@anyscale.com>

gemini-code-assist

Code Review

This pull request refactors the concat logic in transform_pyarrow.py to correctly handle concatenation of tables with a mix of native pyarrow types and extension types, fixing a type mismatch error. The approach is to separate columns by type (native vs. extension), concatenate each group using the appropriate method, and then rejoin them. A new test case is added to validate this fix.

The refactoring is a significant improvement in correctness and structure. I've made a few suggestions to further improve efficiency and code clarity by optimizing the column processing loop, removing an obsolete TODO, and ensuring type hint consistency.