Skip to content

table_scan execution panic for tables with matching data and delete files #217

@gruuya

Description

@gruuya

If I change the following

.sql("select product_id, sum(amount) from warehouse.test.orders group by product_id order by product_id")

to be a select * instead of an aggregation, i see the following panic

thread 'test_equality_delete' panicked at datafusion_iceberg/tests/equality_delete.rs:182:10:
Failed to execute select query: Context("SanityCheckPlan", Plan("Plan: [\"SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]\", \"  DataSourceExec: file_groups={0 groups: []}, projection=[id, customer_id, product_id, date, amount], file_type=parquet\"] does not satisfy distribution requirements: SinglePartition. Child-0 output partitioning: UnknownPartitioning(0)"))

The plan for that looks like this

    "| initial_physical_plan                                      | UnionExec                                                                                                                                                                                                                                                                                                      |",
    "|                                                            |   ProjectionExec: expr=[id@0 as id, customer_id@1 as customer_id, product_id@2 as product_id, date@3 as date, amount@4 as amount]                                                                                                                                                                              |",
    "|                                                            |     HashJoinExec: mode=CollectLeft, join_type=RightAnti, on=[(id@0, id@0), (customer_id@1, customer_id@1), (product_id@2, product_id@2), (date@3, date@3)]                                                                                                                                                     |",
    "|                                                            |       DataSourceExec: file_groups={1 group: [[test/orders/data/date_day=18262/64b47434-6d07-11f0-88f8-de51894a27a1.parquet]]}, projection=[id, customer_id, product_id, date, date_day], file_type=parquet                                                                                                     |",
    "|                                                            |       DataSourceExec: file_groups={1 group: [[test/orders/data/date_day=18262/64b02528-6d07-11f0-88f7-6d5546fe4045.parquet]]}, projection=[id, customer_id, product_id, date, amount], file_type=parquet                                                                                                       |",
    "|                                                            |   ProjectionExec: expr=[id@0 as id, customer_id@1 as customer_id, product_id@2 as product_id, date@3 as date, amount@4 as amount]                                                                                                                                                                              |",
    "|                                                            |     HashJoinExec: mode=CollectLeft, join_type=RightAnti, on=[(id@0, id@0), (customer_id@1, customer_id@1), (product_id@2, product_id@2), (date@3, date@3)]                                                                                                                                                     |",
    "|                                                            |       DataSourceExec: file_groups={1 group: [[test/orders/data/date_day=18294/64b48f32-6d07-11f0-88f9-bf3bba452bfc.parquet]]}, projection=[id, customer_id, product_id, date, date_day], file_type=parquet                                                                                                     |",
    "|                                                            |       DataSourceExec: file_groups={1 group: [[test/orders/data/date_day=18294/64af9cb6-6d07-11f0-88f6-1d36c1f3beb3.parquet]]}, projection=[id, customer_id, product_id, date, amount], file_type=parquet                                                                                                       |",
    "|                                                            |   DataSourceExec: file_groups={0 groups: []}, projection=[id, customer_id, product_id, date, amount], file_type=parquet                                                                                                                                                                                        |",
    "|                                                            |                                                                                                                                                                                                                                                                                                                |",

That last DataSourceExec: file_groups={0 groups: []} is the cause of the problem, and it happens due to the fact that in this test all data file groups have matching equality delete groups, meaning that once all of those are paired up for a (anti) join, there are no more file groups left when constructing the other plan here

let file_scan_config = FileScanConfigBuilder::new(object_store_url, file_schema, file_source)
.with_file_groups(file_groups)
.with_statistics(statistics)
.with_projection(projection)
.with_limit(limit)
.with_table_partition_cols(table_partition_cols)
.build();
let other_plan = ParquetFormat::default()
.create_physical_plan(session, file_scan_config)
.await?;

Consequently that last no-op plan is added, but since it has wrong partitioning it causes the sanity check panic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions