-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a join for upsert deduplication #1685
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! a few nit comments
return rows_to_update_table | ||
non_key_cols = all_columns - join_cols_set | ||
|
||
diff_expr = functools.reduce(operator.or_, [pc.field(f"{col}-lhs") != pc.field(f"{col}-rhs") for col in non_key_cols]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
de morgans law in the wild 🥇
source_table | ||
# We already know that the schema is compatible, this is to fix large_ types | ||
.cast(target_table.schema) | ||
.join(target_table, keys=list(join_cols_set), join_type="inner", left_suffix="-lhs", right_suffix="-rhs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we add coalesce_keys=True
here to avoid duplicates in the resulting join table
since we only check if source_table has duplicates, the target_table might produce duplicates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! Since we've already filtered the target_table
, I think we could also do the check there, it isn't that expensive anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included a test 👍
.join(target_table, keys=list(join_cols_set), join_type="inner", left_suffix="-lhs", right_suffix="-rhs") | ||
.filter(diff_expr) | ||
.drop_columns([f"{col}-rhs" for col in non_key_cols]) | ||
.rename_columns({f"{col}-lhs" if col not in join_cols else col: col for col in source_table.column_names}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this is a dictionary! https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.rename_columns
and the non-join columns will be ignored by create_match_filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only the non-join columns get postfixed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding the benchmark numbers too!
This changes the deduplication logic to use join to duplicate the rows. While the original design wasn't wrong, it is more efficient to push things down into PyArrow to have better multi-threading and no GIL.
I did a small benchmark:
And the result was: