Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repartition Optimisation Pass Breaks Sorting (wrong answer with limit) #423

Closed
tustvold opened this issue May 25, 2021 · 4 comments · Fixed by #1776
Closed

Repartition Optimisation Pass Breaks Sorting (wrong answer with limit) #423

tustvold opened this issue May 25, 2021 · 4 comments · Fixed by #1776
Assignees
Labels
bug Something isn't working

Comments

@tustvold
Copy link
Contributor

Describe the bug

The Repartition optimisation pass fails to account for order sensitive operations, which causes it to repartition a sorted partition. As there isn't (yet) an order preserving merge operator (#362) this results in an incorrect final physical plan.

To Reproduce

For example,

LogicalPlanBuilder::scan_csv(&path, options, None)?
            .sort(vec![col("c4").sort(true, true)])?
            .filter(col("c2").eq(lit(1)))?
            .limit(10)?
            .build()?;

Results in

GlobalLimitExec {
    input: MergeExec {
        input: CoalesceBatchesExec {
            input: FilterExec {
                predicate: BinaryExpr {
                    left: Column {
                        name: "c2",
                    },
                    op: Eq,
                    right: TryCastExpr {
                        expr: Literal {
                            value: Int32(1),
                        },
                        cast_type: Int64,
                    },
                },
                input: RepartitionExec {
                    input: SortExec {
                        input: CsvExec {
                            source: PartitionedFiles {
                                path: "/home/raphael/repos/external/arrow-datafusion/testing/data/csv/aggregate_test_100.csv",
                                filenames: [
                                    "/home/raphael/repos/external/arrow-datafusion/testing/data/csv/aggregate_test_100.csv",
                                ],
                            },
                            ...
                        },
                        expr: [
                            PhysicalSortExpr {
                                expr: Column {
                                    name: "c4",
                                },
                                options: SortOptions {
                                    descending: false,
                                    nulls_first: true,
                                },
                            },
                        ],
                        ...
                    },
                    ...
                },
            },
            target_batch_size: 4096,
        },
    },
    limit: 10,
}

This is incorrect, as the MergeExec does not preserve ordering and therefore the limit will return an arbitrary set of rows

Expected behavior

I would expect the Repartition pass to not introduce partitioning on paths with operators that are order sensitive. Potentially once there is an order preserving merge, this restriction could be relaxed provided the inserted merge is order preserving, but until then this optimisation is incorrect.

I will shortly be creating a ticket with some proposals on how we could make DataFusion sort order aware

Additional context

This issue was spawned out of investigation on #378 (comment)

@tustvold tustvold added the bug Something isn't working label May 25, 2021
@alamb alamb changed the title Repartition Optimisation Pass Breaks Sorting Repartition Optimisation Pass Breaks Sorting (wrong answer with limit) May 25, 2021
@alamb
Copy link
Contributor

alamb commented May 25, 2021

This is a good find 🕵️ 👍 @tustvold

@Dandandan
Copy link
Contributor

@tustvold thanks for the detailed write up!

@rdettai
Copy link
Contributor

rdettai commented Oct 29, 2021

Didn't look into this in details, but it looks very much like something that would require to dig up #92 to be solved 😃

@alamb alamb self-assigned this Feb 4, 2022
@alamb
Copy link
Contributor

alamb commented Feb 4, 2022

I just hit a variation of this in IOx -- see https://github.com/influxdata/influxdb_iox/pull/3633#issuecomment-1030126757

Conveniently @tustvold added some infrastructure to support this in #1732

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants