[C++][Python] Segfault when calling groupby on table with misaligned chunks #34238

0x26res · 2023-02-17T14:52:01Z

Describe the bug, including details regarding any error messages, version, and platform.

I have a table with ~100k records and 4 chunks of various sizes [32768, 32768, 29599, 16692]
key1 has got high cardinality (30k)
key2 has got low cardinality (10)

key1	key2	value
KEY1_12226	KEY1_08	0.348599
KEY1_10214	KEY1_08	0.954173
KEY1_26821	KEY1_09	0.416615
KEY1_24557	KEY1_06	0.883226
KEY1_27823	KEY1_08	0.0127225

I'm trying to do a groupby on key1, key2 to get the sum of value. It works fine in general. But when I preprocess the data I misalign the chunks in my table and it fails.

import numpy as np
import pyarrow as pa

KEYS_1 = [f"KEY1_{i:05d}" for i in range(30_000)]
KEYS_2 = [f"KEY1_{i:02d}" for i in range(10)]
SIDE = ["LEFT", "RIGHT"]


def generate_table(sizes):
    batches = [
        pa.record_batch(
            [
                np.random.choice(KEYS_1, size),
                np.random.choice(KEYS_2, size),
                np.random.rand(size),
            ],
            ["key1", "key2", "value"],
        )
        for size in sizes
    ]
    return pa.Table.from_batches(batches)


table = generate_table([32768, 32768, 29599, 16692])

# This works well:
pa.TableGroupBy(table, ["key1", "key2"]).aggregate(
    [
        ["value", "sum"],
    ]
)

# This misaligns the chunks
table = table.set_column(
    table.schema.get_field_index("value"),
    "value",
    table["value"].combine_chunks(),
)

print("HERE")
pa.TableGroupBy(table, ["key1", "key2"]).aggregate(
    [
        ["value", "sum"],
    ]
)  # segfault :-(
print("NEVER THERE")

It took me a while to go to the bottom of the problem. The size of the chunks and the cardinality of the keys seem to play an important factor in weather it fails or not.

The short term solution is for me to call combine_chunks before the groupby.

This is on pyarrow 11.0.0, python 3.9

Component(s)

Python

The text was updated successfully, but these errors were encountered:

lidavidm · 2023-02-17T15:18:16Z

CC @westonpace, I think there are a number of places currently where Acero doesn't quite handle misaligned buffers

westonpace · 2023-02-17T18:02:19Z

True, though I think we are using two different versions of "align" here.

There is a known issue with "data alignment". We require at least 8-byte aligned pointers. There is a PR in the works (#14758) to enforce alignment (requires a data copy) as data goes into Acero.

However, I believe what @0x26res is referring to here is chunk alignment within a table. Generally one assumes that a table consists of consistent chunks, e.g. a vector of record batches. However, it is possible for some columns to be chunked differently than other columns. Acero handles this by requiring all input be record batches, so misaligned tables are aligned by zero-copy slicing into appropriate sized chunks. This defect stems from the fact that group_by was previously implemented outside of Acero using an internal helper function.

TL;DR: This is fixed by #14867 . I ran the given example (thanks for this!) and can reproduce on master but not on #14867.

lidavidm · 2023-02-17T18:11:13Z

Ah, I assumed it was data alignment! Thanks for digging into it.

jorisvandenbossche · 2023-02-21T20:56:01Z

This is resolved by #14867

0x26res added the Type: bug label Feb 17, 2023

github-actions bot added the Component: Python label Feb 17, 2023

lidavidm added the Component: C++ label Feb 17, 2023

lidavidm changed the title ~~Segfault when calling groupby on table with misaligned chunks~~ [C++][Python] Segfault when calling groupby on table with misaligned chunks Feb 17, 2023

westonpace mentioned this issue Feb 17, 2023

GH-14866: [C++] Remove internal GroupBy implementation #14867

Merged

jorisvandenbossche closed this as completed Feb 21, 2023

jorisvandenbossche assigned westonpace Feb 21, 2023

jorisvandenbossche added this to the 12.0.0 milestone Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python] Segfault when calling groupby on table with misaligned chunks #34238

[C++][Python] Segfault when calling groupby on table with misaligned chunks #34238

0x26res commented Feb 17, 2023 •

edited

Loading

lidavidm commented Feb 17, 2023

westonpace commented Feb 17, 2023

lidavidm commented Feb 17, 2023

jorisvandenbossche commented Feb 21, 2023

[C++][Python] Segfault when calling groupby on table with misaligned chunks #34238

[C++][Python] Segfault when calling groupby on table with misaligned chunks #34238

Comments

0x26res commented Feb 17, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

lidavidm commented Feb 17, 2023

westonpace commented Feb 17, 2023

lidavidm commented Feb 17, 2023

jorisvandenbossche commented Feb 21, 2023

0x26res commented Feb 17, 2023 •

edited

Loading