optimize `_combine_positional_deletes` #1271

kevinjqliu · 2024-10-30T17:49:03Z

Apache Iceberg version

None

Please describe the bug 🐞

As part of the effort to remove numpy as a dependency in #1259, we changed _combine_positional_deletes function to use range instead of np.arrange. This causes a performance regression. We choose to move forward for now since it is not very common to have a file affected by multiple positional deletes.

apache/arrow/#44583 is opened to add equivalent functionality in pyarrow which we can then port into pyiceberg.

The text was updated successfully, but these errors were encountered:

omkenge · 2024-10-30T20:34:54Z

Hi @kevinjqliu
Can we rewrite _combine_positional_deletes function by using a set-based approach instead of the previous NumPy method. The set method significantly improves performance, particularly when handling large arrays of deleted positions.

kevinjqliu · 2024-10-30T22:13:19Z

#1259 (comment)
possible solution using pyarrow cython/C++ API

kevinjqliu · 2024-10-30T22:16:27Z

@omkenge i've tried a set-based approach but didn't see any performance improvements. I used #1259 (comment) to test

corleyma · 2024-11-01T00:13:03Z

@kevinjqliu I did find a pure-python approach that is faster (~2.4x on my machine) than pyarrow.array(range(...)):

import pyarrow as pa
import ctypes

def create_arrow_range(start: int, end: int) -> pa.Array:
    if start >= end:
        raise ValueError("start must be less than end")

    length = end - start

    buf = pa.allocate_buffer(length * 8, resizable=False)

    ptr: ctypes.Array = (ctypes.c_int64 * length).from_buffer(buf)
    for i in range(length):
        ptr[i] = start + i

    array = pa.Array.from_buffers(pa.int64(), length, [None, buf])

    return array

kevinjqliu · 2024-11-01T18:45:36Z

@corleyma thats awesome, thanks! Would you like to open a PR and contribute the change?

kevinjqliu mentioned this issue Oct 30, 2024

Remove numpy as a hard dependency #1270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize `_combine_positional_deletes` #1271

optimize `_combine_positional_deletes` #1271

kevinjqliu commented Oct 30, 2024

omkenge commented Oct 30, 2024

kevinjqliu commented Oct 30, 2024

kevinjqliu commented Oct 30, 2024

corleyma commented Nov 1, 2024 •

edited

Loading

kevinjqliu commented Nov 1, 2024

optimize _combine_positional_deletes #1271

optimize _combine_positional_deletes #1271

Comments

kevinjqliu commented Oct 30, 2024

Apache Iceberg version

Please describe the bug 🐞

omkenge commented Oct 30, 2024

kevinjqliu commented Oct 30, 2024

kevinjqliu commented Oct 30, 2024

corleyma commented Nov 1, 2024 • edited Loading

kevinjqliu commented Nov 1, 2024

optimize `_combine_positional_deletes` #1271

optimize `_combine_positional_deletes` #1271

corleyma commented Nov 1, 2024 •

edited

Loading