Writing with large arrow types in MERGE #1753

ion-elgreco · 2023-10-22T10:52:12Z

Description

It seems it's not possible to write with large arrow types in .merge() yet, however there is write support with write_deltalake(large_types=True), we should add this also in merge.

Use Case
Converting Polars dataframes to arrow and then merging immediately instead of casting to normal arrow types which may not fit if the arrays are too large.

DeltaError: Generic DeltaTable error: Execution error: Fail to build join indices in NestedLoopJoinExec, error:Arrow error: Invalid argument error: Invalid comparison operation: LargeUtf8 == Utf8

The text was updated successfully, but these errors were encountered:

# Description This refactors the merge operation to use DataFusion's DataFrame and LogicalPlan APIs The NLJ is eliminated and the query planner can pick the optimal join operator. This also enables the operation to use multiple threads and should result in significant speed up. Merge is still limited to using a single thread in some area. When collecting benchmarks, I encountered multiple OoM issues with Datafusion's hash join implementation. There are multiple tickets upstream open regarding this. For now, I've limited the number of partitions to just 1 to prevent this. Predicates passed as SQL are also easier to use now. Manual casting was required to ensure data types were aligned. Now the logical plan will perform type coercion when optimizing the plan. # Related Issues - enhances #850 - closes #1790 - closes #1753

# Description This refactors the merge operation to use DataFusion's DataFrame and LogicalPlan APIs The NLJ is eliminated and the query planner can pick the optimal join operator. This also enables the operation to use multiple threads and should result in significant speed up. Merge is still limited to using a single thread in some area. When collecting benchmarks, I encountered multiple OoM issues with Datafusion's hash join implementation. There are multiple tickets upstream open regarding this. For now, I've limited the number of partitions to just 1 to prevent this. Predicates passed as SQL are also easier to use now. Manual casting was required to ensure data types were aligned. Now the logical plan will perform type coercion when optimizing the plan. # Related Issues - enhances delta-io#850 - closes delta-io#1790 - closes delta-io#1753

@stinodego

…iter/merge (#1820) # Description This ports some functionality that @stinodego and I had worked on in Polars. Where we converted a pyarrow schema to a compatible delta schema. It converts the following: - uint -> int - timestamp(any timeunit) -> timestamp(us) I adjusted the functionality to do schema conversion from large to normal when necessary, which is still needed in MERGE as workaround #1753. Additional things I've added: - Schema conversion for every input in write_deltalake/merge - Add Pandas dataframe conversion - Add Pandas dataframe as input in merge # Related Issue(s) - closes #686 - closes #1467 --------- Co-authored-by: Will Jones <willjones127@gmail.com>

@stinodego

…iter/merge (delta-io#1820) This ports some functionality that @stinodego and I had worked on in Polars. Where we converted a pyarrow schema to a compatible delta schema. It converts the following: - uint -> int - timestamp(any timeunit) -> timestamp(us) I adjusted the functionality to do schema conversion from large to normal when necessary, which is still needed in MERGE as workaround delta-io#1753. Additional things I've added: - Schema conversion for every input in write_deltalake/merge - Add Pandas dataframe conversion - Add Pandas dataframe as input in merge - closes delta-io#686 - closes delta-io#1467 --------- Co-authored-by: Will Jones <willjones127@gmail.com>

ion-elgreco added the enhancement New feature or request label Oct 22, 2023

ion-elgreco mentioned this issue Nov 7, 2023

feat(python): add pyarrow to delta compatible schema conversion in writer/merge #1820

Merged

ion-elgreco mentioned this issue Nov 19, 2023

refactor: merge to use logical plans #1720

Merged

Blajda closed this as completed in #1720 Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing with large arrow types in MERGE #1753

Writing with large arrow types in MERGE #1753

ion-elgreco commented Oct 22, 2023 •

edited

Loading

Writing with large arrow types in MERGE #1753

Writing with large arrow types in MERGE #1753

Comments

ion-elgreco commented Oct 22, 2023 • edited Loading

Description

ion-elgreco commented Oct 22, 2023 •

edited

Loading