Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize _combine_positional_deletes #1271

Open
kevinjqliu opened this issue Oct 30, 2024 · 5 comments
Open

optimize _combine_positional_deletes #1271

kevinjqliu opened this issue Oct 30, 2024 · 5 comments

Comments

@kevinjqliu
Copy link
Contributor

Apache Iceberg version

None

Please describe the bug 🐞

As part of the effort to remove numpy as a dependency in #1259, we changed _combine_positional_deletes function to use range instead of np.arrange. This causes a performance regression. We choose to move forward for now since it is not very common to have a file affected by multiple positional deletes.

apache/arrow/#44583 is opened to add equivalent functionality in pyarrow which we can then port into pyiceberg.

@omkenge
Copy link
Contributor

omkenge commented Oct 30, 2024

Hi @kevinjqliu
Can we rewrite _combine_positional_deletes function by using a set-based approach instead of the previous NumPy method. The set method significantly improves performance, particularly when handling large arrays of deleted positions.

@kevinjqliu
Copy link
Contributor Author

#1259 (comment)
possible solution using pyarrow cython/C++ API

@kevinjqliu
Copy link
Contributor Author

@omkenge i've tried a set-based approach but didn't see any performance improvements. I used #1259 (comment) to test

@corleyma
Copy link

corleyma commented Nov 1, 2024

@kevinjqliu I did find a pure-python approach that is faster (~2.4x on my machine) than pyarrow.array(range(...)):

import pyarrow as pa
import ctypes

def create_arrow_range(start: int, end: int) -> pa.Array:
    if start >= end:
        raise ValueError("start must be less than end")

    length = end - start

    buf = pa.allocate_buffer(length * 8, resizable=False)

    ptr: ctypes.Array = (ctypes.c_int64 * length).from_buffer(buf)
    for i in range(length):
        ptr[i] = start + i

    array = pa.Array.from_buffers(pa.int64(), length, [None, buf])

    return array

@kevinjqliu
Copy link
Contributor Author

@corleyma thats awesome, thanks! Would you like to open a PR and contribute the change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants