Skip to content

[Python] compute.replace_substring_regex sometimes returns incorrect offsets, causing crashes/ub #28619

@asfimport

Description

@asfimport

I've come across examples where calling pyarrow.compute.replace_substring_regex caused a segfault once using the result. After some experimentation, I found that the problem lies in the offsets buffer in the result of the computation.

Here is a docker file that reproduces the problem in a few lines (though without an immediate crash):

FROM python:3.8
RUN pip install pyarrow
RUN echo "import pyarrow; \
    import pyarrow.compute; \
    options = pyarrow.compute.ReplaceSubstringOptions('a', ''); \
    values = [''] * 16; \
    arr = pyarrow.array(values, pyarrow.string()); \
    res = pyarrow.compute.replace_substring_regex(arr, options=options); \
    offsets = res.buffers()[1]; \
    assert any(offset != 0 for offset in offsets[-4:]);" > /test.py
RUN python /test.py

The docker image installs pyarrow (4.0.0 at the time of submitting this issue), and then runs python code which creates an array of 16 empty strings, and calls replace_substring_regex on the array.
The offsets buffer's last 4 bytes (representing the last offset) are checked to be non-zero, which fails.

Everything but the last offset looks fine: the valid buffer, the rest of the offsets, and the data buffer.

I have more elaborate examples of arrays which return a random value for the last offset, causing crashes sooner than simply 0 at the end.
Another hint which might help, the problem occurs at multiples of 16, i.e. changing 16 to 32, 48, etc. still shows the problem, but other values don't have a problem.
 
When I cloned the latest master, built arrow, and run the example - there was no problem. But since I didn't see the issue here on JIRA, I thought I should probably post it. I have no idea if I'm building correctly, and maybe I'm adding a bug to a bug :)

Environment: ubuntu 20.04 or macos catalina running docker engine 20.10.2 and python 3.8.6
Reporter: Dror Speiser

Related issues:

Note: This issue was originally created as ARROW-12889. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions