Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support union arrays in concat_tables #44397

Open
rudolfbyker opened this issue Oct 14, 2024 · 1 comment
Open

Support union arrays in concat_tables #44397

rudolfbyker opened this issue Oct 14, 2024 · 1 comment

Comments

@rudolfbyker
Copy link

Describe the enhancement requested

It would be nice to have another step up from promote_options="permissive", e.g., promote_options="union" which uses dense unions when columns are heterogeneous across schemas. For example:

from pyarrow import table, concat_tables

t1 = table({"a": [1, 2, 3]})
t2 = table({"a": ["a", "b", "c"]})

concat_tables(tables=[t1, t2], promote_options="permissive")  # Currently raises `ArrowTypeError`.
concat_tables(tables=[t1, t2], promote_options="union")  # Does not exist at the moment.

The latter should use a dense union for column "a".

I've implemented this myself, but it's hard to do, because there is no is_mergeable function which exposes the logic used by concat_tables(tables=…, promote_options="permissive") for me to use, causing me to have to re-implement that, either using guesswork, or using lots of try-excepts. Here is a rough attempt, which works for some cases, but not all. It also does not preserve metadata, nor support missing columns:

from itertools import chain
from logging import getLogger
from typing import Sequence

from pyarrow import (
    Table,
    concat_tables,
    ArrowTypeError,
    table,
    chunked_array,
    ArrowInvalid,
    UnionArray,
    array,
    int8,
    int32,
)

logger = getLogger(__name__)


def concat_tables_heterogeneous(tables: Sequence[Table]) -> Table:
    """
    Concatenate multiple tables vertically.
    This is similar to `pyarrow.concat_tables`, but it allows for heterogeneous schemas by using dense unions.
    """
    try:
        return concat_tables(tables=tables, promote_options="permissive")
    except ArrowTypeError:
        logger.warning(
            "Heterogeneous table schemas detected. "
            "Some columns will be represented as dense unions, which are slower."
        )

    # TODO: Ask the `pyarrow` maintainers to give us a `is_mergeable` function that we can use the check which columns
    #   are mergeable without using dense unions, instead of maintaining our own heuristics here.
    it = iter(tables)
    column_names = next(it).column_names
    for t in it:
        if t.column_names != column_names:
            raise NotImplementedError(
                "The tables don't all have the same column names."
            )

    result = {}
    for column_name in column_names:
        try:
            result[column_name] = chunked_array([t[column_name] for t in tables])
        except ArrowInvalid:
            # These can't be concatenated into a normal `ChunkedArray`. Use a dense union.
            result[column_name] = UnionArray.from_dense(
                array(
                    list(chain(*([i] * t.num_rows for i, t in enumerate(tables)))),
                    type=int8(),
                ),
                array(
                    list(chain(*(range(t.num_rows) for t in tables))),
                    type=int32(),
                ),
                [array(t[column_name]) for t in tables],
            )

    return table(data=result)

Component(s)

Python

@rudolfbyker
Copy link
Author

rudolfbyker commented Oct 14, 2024

One problem with my implementation above (besides the fact that it's done in Python rather than on a lower level in the Arrow engine) is that chunked_array fill fail on columns where concat_tables would have succeeded (e.g., when merging int64, float64, and null). So instead of chunked_array we could use:

        try:
            result[column_name] = concat_tables(
                tables=[t.select([column_name]) for t in tables],
                promote_options="permissive",
            )[column_name]
        except ArrowTypeError:

But that still feels hacky...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant