Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] RecordBatch.filter() segfaults if passed a ChunkedArray #38770

Closed
nph opened this issue Nov 18, 2023 · 6 comments
Closed

[C++][Python] RecordBatch.filter() segfaults if passed a ChunkedArray #38770

nph opened this issue Nov 18, 2023 · 6 comments
Assignees
Labels
Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. good-second-issue Type: bug
Milestone

Comments

@nph
Copy link
Contributor

nph commented Nov 18, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Filtering a record batch with a boolean mask in the form of a ChunkedArray results in a segmentation fault.

Reproduction on PyArrow 14.0.1:

import pandas as pd
import pyarrow as pa

# Create a simple record batch
df = pd.DataFrame(range(1, 7), columns=['x'])
record_batch = pa.RecordBatch.from_pandas(df)

# Generate a boolean mask as a chunked array
mask = pa.chunked_array([[True, False], [True, False], [True, False]])

# Flatten the chunked array mask and use this to filter the RecordBatch - all good
print(record_batch.filter(mask.combine_chunks()))
# pyarrow.RecordBatch
# x: int64 
# ----
# x: [1,3,5]

# Try filtering with the original chunked array - segfaults
print(record_batch.filter(mask))
# libc++abi: terminating with uncaught exception of type std::bad_variant_access: bad_variant_access
# Abort trap: 6

Note - this problem doesn't occur when filtering a table.

Component(s)

Python

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 23, 2023

@nph thanks for the report, can confirm the crash.

GDB back trace:

#0  0x00007ffff7c7800b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7c57859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fffb5e36026 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007fffb5e34514 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x00007fffb5e34566 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x00007fffb5e34758 in __cxxabiv1::__cxa_throw (obj=0x555556343740, tinfo=tinfo@entry=0x7fffb7836e48 <typeinfo for std::bad_variant_access>, 
    dest=dest@entry=0x7fffb649cdd0 <std::bad_variant_access::~bad_variant_access()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:98
#6  0x00007fffb614197d in std::__throw_bad_variant_access (__what=0x7fffb74b9898 "std::get: wrong index for variant", __what@entry=0x7fffb5f38000 "\177ELF\002\001\001\003")
    at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/10.4.0/variant:1284
#7  0x00007fffb614199b in std::__throw_bad_variant_access (__valueless=__valueless@entry=false) at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/10.4.0/variant:1292
#8  0x00007fffb6180aca in std::get<2ul, arrow::Datum::Empty, std::shared_ptr<arrow::Scalar>, std::shared_ptr<arrow::ArrayData>, std::shared_ptr<arrow::ChunkedArray>, std::shared_ptr<arrow::RecordBatch>, std::shared_ptr<arrow::Table> > (__v=...) at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/10.4.0/variant:1688
#9  std::get<std::shared_ptr<arrow::ArrayData>, arrow::Datum::Empty, std::shared_ptr<arrow::Scalar>, std::shared_ptr<arrow::ArrayData>, std::shared_ptr<arrow::ChunkedArray>, std::shared_ptr<arrow::RecordBatch>, std::shared_ptr<arrow::Table> > (__v=...) at /home/joris/miniconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/10.4.0/variant:1111
#10 arrow::Datum::array (this=<optimized out>) at /home/joris/scipy/repos/arrow/cpp/src/arrow/datum.h:196
#11 arrow::compute::internal::(anonymous namespace)::FilterRecordBatch (ctx=0x7fffffffb780, options=0x555556207c20, filter=..., batch=...)
    at /home/joris/scipy/repos/arrow/cpp/src/arrow/compute/kernels/vector_selection_filter_internal.cc:904

I think we seem to simply assume that filter is an array for the FilterRecordbatch case, and do not guard for that. At least we should raise a proper error that filtering with a chunked array is not support for RecordBatch (or in that case take the code path for FilterTable, which does support chunked filter).

@jorisvandenbossche jorisvandenbossche changed the title [Python]RecordBatch.filter() segfaults if passed a ChunkedArray [C++][Python] RecordBatch.filter() segfaults if passed a ChunkedArray Nov 23, 2023
@jorisvandenbossche jorisvandenbossche added this to the 15.0.0 milestone Nov 23, 2023
@nph
Copy link
Contributor Author

nph commented Nov 23, 2023

Thanks @jorisvandenbossche . On a related note, I think it would be useful if RecordBatch.filter could take an Expression, as well as a boolean mask Array, matching the functionality of Table.filter. Would it be possible to leverage the code path for FilterTable to add this functionality?

@jorisvandenbossche
Copy link
Member

On a related note, I think it would be useful if RecordBatch.filter could take an Expression

I think that would indeed be a good idea to make this consistent with Table. Do you want to open a separate issue for that? (and a PR is certainly welcome as well! ;)). I think it should be relatively straightforward to update the _filter_table helper to also work with a recordbatch (eg just converting the record batch to a table)

@nph
Copy link
Contributor Author

nph commented Dec 13, 2023

Thanks @jorisvandenbossche for the suggestion - I'll open a separate issue for this and will work on the PR.

@AlenkaF
Copy link
Member

AlenkaF commented Apr 3, 2024

I have opened a PR where in case of a ChunkedArray mask, the FilterTable code path is taken. Note the result will be a table not a record batch. The PR: #40971

@AlenkaF AlenkaF removed this from the 16.0.0 milestone Apr 9, 2024
AlenkaF added a commit that referenced this issue May 8, 2024
…unkedArray (#40971)

### Rationale for this change

Filtering a record batch with a boolean mask in the form of a `ChunkedArray` results in a segmentation fault.

### What changes are included in this PR?

In case chunked array is passed as a mask to filter record batch, the code path for `pa.Table.filter()` is taken resulting in a filtered table.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #38770

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
@AlenkaF AlenkaF added this to the 17.0.0 milestone May 8, 2024
@AlenkaF
Copy link
Member

AlenkaF commented May 8, 2024

Issue resolved by pull request 40971
#40971

@AlenkaF AlenkaF closed this as completed May 8, 2024
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…d a ChunkedArray (apache#40971)

### Rationale for this change

Filtering a record batch with a boolean mask in the form of a `ChunkedArray` results in a segmentation fault.

### What changes are included in this PR?

In case chunked array is passed as a mask to filter record batch, the code path for `pa.Table.filter()` is taken resulting in a filtered table.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#38770

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. good-second-issue Type: bug
Projects
None yet
Development

No branches or pull requests

5 participants