Use a join for upsert deduplication #1685

Fokko · 2025-02-19T20:28:17Z

This changes the deduplication logic to use join to duplicate the rows. While the original design wasn't wrong, it is more efficient to push things down into PyArrow to have better multi-threading and no GIL.

I did a small benchmark:

import time
import pyarrow as pa

from pyiceberg.catalog import Catalog
from pyiceberg.exceptions import NoSuchTableError
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, IntegerType


def _drop_table(catalog: Catalog, identifier: str) -> None:
    try:
        catalog.drop_table(identifier)
    except NoSuchTableError:
        pass
def test_vo(session_catalog: Catalog):
    catalog = session_catalog
    identifier = "default.test_upsert_benchmark"
    _drop_table(catalog, identifier)

    schema = Schema(
        NestedField(1, "idx", IntegerType(), required=True),
        NestedField(2, "number", IntegerType(), required=True),
        # Mark City as the identifier field, also known as the primary-key
        identifier_field_ids=[1],
    )

    tbl = catalog.create_table(identifier, schema=schema)

    arrow_schema = pa.schema(
        [
            pa.field("idx", pa.int32(), nullable=False),
            pa.field("number", pa.int32(), nullable=False),
        ]
    )

    # Write some data
    df = pa.Table.from_pylist(
        [
            {"idx": idx, "number": idx}
            for idx in range(1, 100000)
        ],
        schema=arrow_schema,
    )
    tbl.append(df)

    df_upsert = pa.Table.from_pylist(
        # Overlap
        [
            {"idx": idx, "number": idx}
            for idx in range(80000, 90000)
        ]+
        # Update
        [
            {"idx": idx, "number": idx + 1}
            for idx in range(90000, 100000)
        ]
        # Insert
        + [
            {"idx": idx, "number": idx}
            for idx in range(100000, 110000)],
        schema=arrow_schema,
    )

    start = time.time()

    tbl.upsert(df_upsert)

    stop = time.time()

    print(f"Took {stop-start} seconds")

And the result was:

Took 2.0412521362304688 seconds on the fd-join branch
Took 3.5236432552337646 seconds on lastest main

kevinjqliu

LGTM! a few nit comments

kevinjqliu · 2025-02-20T16:15:39Z

pyiceberg/table/upsert_util.py

-    return rows_to_update_table
+    non_key_cols = all_columns - join_cols_set
+
+    diff_expr = functools.reduce(operator.or_, [pc.field(f"{col}-lhs") != pc.field(f"{col}-rhs") for col in non_key_cols])


de morgans law in the wild 🥇

kevinjqliu · 2025-02-20T16:23:18Z

pyiceberg/table/upsert_util.py

+        source_table
+        # We already know that the schema is compatible, this is to fix large_ types
+        .cast(target_table.schema)
+        .join(target_table, keys=list(join_cols_set), join_type="inner", left_suffix="-lhs", right_suffix="-rhs")


nit: should we add coalesce_keys=True here to avoid duplicates in the resulting join table

since we only check if source_table has duplicates, the target_table might produce duplicates

Great catch! Since we've already filtered the target_table, I think we could also do the check there, it isn't that expensive anymore.

Included a test 👍

kevinjqliu · 2025-02-20T16:27:36Z

pyiceberg/table/upsert_util.py

+        .join(target_table, keys=list(join_cols_set), join_type="inner", left_suffix="-lhs", right_suffix="-rhs")
+        .filter(diff_expr)
+        .drop_columns([f"{col}-rhs" for col in non_key_cols])
+        .rename_columns({f"{col}-lhs" if col not in join_cols else col: col for col in source_table.column_names})


oh this is a dictionary! https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.rename_columns
and the non-join columns will be ignored by create_match_filter

Yes, only the non-join columns get postfixed :)

pyiceberg/table/upsert_util.py

kevinjqliu

LGTM! Thanks for adding the benchmark numbers too!

We're seeing problems with the join and complex types

This reverts commit b95e792.

This changes the deduplication logic to use join to duplicate the rows. While the original design wasn't wrong, it is more efficient to push things down into PyArrow to have better multi-threading and no GIL. I did a small benchmark: ```python import time import pyarrow as pa from pyiceberg.catalog import Catalog from pyiceberg.exceptions import NoSuchTableError from pyiceberg.schema import Schema from pyiceberg.types import NestedField, StringType, IntegerType def _drop_table(catalog: Catalog, identifier: str) -> None: try: catalog.drop_table(identifier) except NoSuchTableError: pass def test_vo(session_catalog: Catalog): catalog = session_catalog identifier = "default.test_upsert_benchmark" _drop_table(catalog, identifier) schema = Schema( NestedField(1, "idx", IntegerType(), required=True), NestedField(2, "number", IntegerType(), required=True), # Mark City as the identifier field, also known as the primary-key identifier_field_ids=[1], ) tbl = catalog.create_table(identifier, schema=schema) arrow_schema = pa.schema( [ pa.field("idx", pa.int32(), nullable=False), pa.field("number", pa.int32(), nullable=False), ] ) # Write some data df = pa.Table.from_pylist( [ {"idx": idx, "number": idx} for idx in range(1, 100000) ], schema=arrow_schema, ) tbl.append(df) df_upsert = pa.Table.from_pylist( # Overlap [ {"idx": idx, "number": idx} for idx in range(80000, 90000) ]+ # Update [ {"idx": idx, "number": idx + 1} for idx in range(90000, 100000) ] # Insert + [ {"idx": idx, "number": idx} for idx in range(100000, 110000)], schema=arrow_schema, ) start = time.time() tbl.upsert(df_upsert) stop = time.time() print(f"Took {stop-start} seconds") ``` And the result was: ``` Took 2.0412521362304688 seconds on the fd-join branch Took 3.5236432552337646 seconds on lastest main ```

Fokko added 4 commits February 16, 2025 21:55

Join

71b474c

Merge branch 'main' of github.com:apache/iceberg-python into fd-join

ee94650

WIP

d5217cb

Cleanup

caa9c57

Fokko changed the title ~~Use a join for deduplication~~ Use a join for upsert deduplication Feb 19, 2025

lint

385b760

kevinjqliu approved these changes Feb 20, 2025

View reviewed changes

Fokko added 6 commits February 21, 2025 17:07

Merge branch 'main' of github.com:apache/iceberg-python into fd-join

ad0bd9d

Thanks Kevin!

a452d83

Linterrr

8326db4

Merge branch 'main' of github.com:apache/iceberg-python into fd-join

73a7fe0

so many merge conflicts 😱😱😱

690824f

Remove unneeded change

33eead0

kevinjqliu approved these changes Feb 21, 2025

View reviewed changes

kevinjqliu merged commit b95e792 into apache:main Feb 21, 2025
7 checks passed

kevinjqliu mentioned this pull request Feb 25, 2025

Push upsert changes detection to Arrow #1679

Closed

Fokko mentioned this pull request Apr 6, 2025

Fallback for upsert when arrow cannot compare source rows with target rows #1878

Merged

Fokko deleted the fd-join branch April 23, 2025 06:05

Fokko added a commit to Fokko/iceberg-python that referenced this pull request Apr 23, 2025

Revert apache#1685: Iterate rather than join for upsert

301d319

We're seeing problems with the join and complex types

kevinjqliu added a commit to kevinjqliu/iceberg-python that referenced this pull request Apr 24, 2025

Partial Revert "Use a join for upsert deduplication (apache#1685)"

fabbbd9

This reverts commit b95e792.

kevinjqliu mentioned this pull request Apr 24, 2025

fix upsert with complex types #1949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use a join for upsert deduplication #1685

Use a join for upsert deduplication #1685

Uh oh!

Fokko commented Feb 19, 2025 •

edited

Loading

Uh oh!

kevinjqliu left a comment

Uh oh!

kevinjqliu Feb 20, 2025

Uh oh!

kevinjqliu Feb 20, 2025

Uh oh!

Fokko Feb 21, 2025

Uh oh!

Fokko Feb 21, 2025

Uh oh!

kevinjqliu Feb 20, 2025

Uh oh!

Fokko Feb 21, 2025

Uh oh!

Uh oh!

kevinjqliu left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use a join for upsert deduplication #1685

Use a join for upsert deduplication #1685

Uh oh!

Conversation

Fokko commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevinjqliu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fokko commented Feb 19, 2025 •

edited

Loading

kevinjqliu left a comment •

edited

Loading