-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize _combine_positional_deletes
#1271
Comments
Hi @kevinjqliu |
#1259 (comment) |
@omkenge i've tried a |
@kevinjqliu I did find a pure-python approach that is faster (~2.4x on my machine) than import pyarrow as pa
import ctypes
def create_arrow_range(start: int, end: int) -> pa.Array:
if start >= end:
raise ValueError("start must be less than end")
length = end - start
buf = pa.allocate_buffer(length * 8, resizable=False)
ptr: ctypes.Array = (ctypes.c_int64 * length).from_buffer(buf)
for i in range(length):
ptr[i] = start + i
array = pa.Array.from_buffers(pa.int64(), length, [None, buf])
return array |
@corleyma thats awesome, thanks! Would you like to open a PR and contribute the change? |
Apache Iceberg version
None
Please describe the bug 🐞
As part of the effort to remove
numpy
as a dependency in #1259, we changed_combine_positional_deletes
function to userange
instead ofnp.arrange
. This causes a performance regression. We choose to move forward for now since it is not very common to have a file affected by multiple positional deletes.apache/arrow/#44583 is opened to add equivalent functionality in pyarrow which we can then port into pyiceberg.
The text was updated successfully, but these errors were encountered: