-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
When I try to create a RecordBatch from a list with large objects, RecordBatch.from_pylist raises a TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array.
MWE:
import pyarrow as pa
import numpy as np
# Create a random array of shape [3, 720, 1280]
rng = np.random.default_rng(42)
image = rng.integers(low=0, high=255, size=(3, 720, 1280))
# Wrap into dict
row = {
"image": {
"data": image.tobytes(),
"shape": image.shape,
}
}
# Define schema
schema = pa.schema({
"image": pa.struct({"data": pa.binary(), "shape": pa.list_(pa.uint16(), 3)})
})
# Convert to record batch
num_rows = 98
pylist = [row] * num_rows
batch = pa.RecordBatch.from_pylist(pylist, schema=schema)Traceback:
Traceback (most recent call last):
File "mwe.py", line 22, in <module>
batch = pa.RecordBatch.from_pylist(pylist, schema=schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 2049, in pyarrow.lib._Tabular.from_pylist
File "pyarrow/table.pxi", line 6460, in pyarrow.lib._from_pylist
File "pyarrow/table.pxi", line 3550, in pyarrow.lib.RecordBatch.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
When num_rows is reduced to 97, the example above runs without any error.
I suspect the issue is related to the size in bytes of the pylist. Each image has 3 * 720 * 1280 * 8 bytes.
98 images have 2,167,603,200 bytes.
97 images have 2,145,484,800 bytes.
2^31 is 2,147,483,648 which is right in between these two numbers.
While in this MWE the images consume more bytes than needed, in my use case I cannot use fewer bytes.
Is there a simple way to solve this issue?
Component(s)
Python