Skip to content

Commit

Permalink
[Data] Fixing DelegatingBlockBuilder to avoid re-serializing object…
Browse files Browse the repository at this point in the history
…s multiple times (#48509)

Currently, we're serializing first row in every block twice when adding
it t/h `DelegatingBlockBuilder`, carrying tangible overhead and impact
on latency for large enough rows.

Provided that `ArrowBlockBuilder` is now able to handle arbitrary Python
object we can just deprecate `DelegatingBlockBuilder` altogether.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
  • Loading branch information
alexeykudinkin authored Nov 13, 2024
1 parent 4f6a419 commit 8ef918a
Showing 1 changed file with 1 addition and 12 deletions.
13 changes: 1 addition & 12 deletions python/ray/data/_internal/delegating_block_builder.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
import collections
from typing import Any, Mapping, Optional

from ray.air.util.tensor_extensions.arrow import ArrowConversionError
from ray.data._internal.arrow_block import ArrowBlockBuilder
from ray.data._internal.block_builder import BlockBuilder
from ray.data._internal.pandas_block import PandasBlockBuilder
from ray.data.block import Block, BlockAccessor, BlockType, DataBatch


Expand All @@ -23,17 +21,8 @@ def _inferred_block_type(self) -> Optional[BlockType]:
def add(self, item: Mapping[str, Any]) -> None:
assert isinstance(item, collections.abc.Mapping), item

import pyarrow

if self._builder is None:
try:
check = ArrowBlockBuilder()
check.add(item)
check.build()
self._builder = ArrowBlockBuilder()
except (TypeError, pyarrow.lib.ArrowInvalid, ArrowConversionError):
# Can also handle nested Python objects, which Arrow cannot.
self._builder = PandasBlockBuilder()
self._builder = ArrowBlockBuilder()

self._builder.add(item)

Expand Down

0 comments on commit 8ef918a

Please sign in to comment.