Skip to content

[Python] RecordBatch.from_pylist fails for large rows #48781

@benedikt-grl

Description

@benedikt-grl

Describe the bug, including details regarding any error messages, version, and platform.

When I try to create a RecordBatch from a list with large objects, RecordBatch.from_pylist raises a TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array.

MWE:

import pyarrow as pa
import numpy as np


# Create a random array of shape [3, 720, 1280]
rng = np.random.default_rng(42)
image = rng.integers(low=0, high=255, size=(3, 720, 1280))

# Wrap into dict
row = {
    "image": {
        "data": image.tobytes(),
        "shape": image.shape,
    }
}

# Define schema
schema = pa.schema({
    "image": pa.struct({"data": pa.binary(), "shape": pa.list_(pa.uint16(), 3)})
})

# Convert to record batch
num_rows = 98
pylist = [row] * num_rows
batch = pa.RecordBatch.from_pylist(pylist, schema=schema)

Traceback:

Traceback (most recent call last):
  File "mwe.py", line 22, in <module>
    batch = pa.RecordBatch.from_pylist(pylist, schema=schema)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 2049, in pyarrow.lib._Tabular.from_pylist
  File "pyarrow/table.pxi", line 6460, in pyarrow.lib._from_pylist
  File "pyarrow/table.pxi", line 3550, in pyarrow.lib.RecordBatch.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

When num_rows is reduced to 97, the example above runs without any error.

I suspect the issue is related to the size in bytes of the pylist. Each image has 3 * 720 * 1280 * 8 bytes.
98 images have 2,167,603,200 bytes.
97 images have 2,145,484,800 bytes.
2^31 is 2,147,483,648 which is right in between these two numbers.

While in this MWE the images consume more bytes than needed, in my use case I cannot use fewer bytes.
Is there a simple way to solve this issue?

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions