Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent limit.slt failure #9450

Closed
alamb opened this issue Mar 4, 2024 · 4 comments · Fixed by #9474
Closed

Intermittent limit.slt failure #9450

alamb opened this issue Mar 4, 2024 · 4 comments · Fixed by #9474
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

Describe the bug

When I run limit.slt locally it fails like this:

cargo test --test sqllogictests -- limit
(venv) andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ cargo test --test sqllogictests -- limit
...
    Finished test [unoptimized + debuginfo] target(s) in 15.35s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-5fa91f6f3736c5fa)
Running "limit.slt"
External error: query result mismatch:
[SQL] EXPLAIN SELECT DISTINCT i FROM t1000;
[Diff] (-expected|+actual)
    logical_plan
    Aggregate: groupBy=[[t1000.i]], aggr=[[]]
    --TableScan: t1000 projection=[i]
    physical_plan
    AggregateExec: mode=FinalPartitioned, gby=[i@0 as i], aggr=[]
    --CoalesceBatchesExec: target_batch_size=8192
    ----RepartitionExec: partitioning=Hash([i@0], 4), input_partitions=4
    ------AggregateExec: mode=Partial, gby=[i@0 as i], aggr=[]
-   --------MemoryExec: partitions=4, partition_sizes=[1, 2, 1, 1]
+   --------MemoryExec: partitions=4, partition_sizes=[1, 1, 2, 1]
at test_files/limit.slt:392

Error: Execution("1 failures")
error: test failed, to rerun pass `-p datafusion-sqllogictest --test sqllogictests`

Caused by:
  process didn't exit successfully: `/Users/andrewlamb/Software/arrow-datafusion/target/debug/deps/sqllogictests-5fa91f6f3736c5fa limit` (exit status: 1)

@huaxingao also saw this failure on #9411 (comment)

To Reproduce

Here is an example failure on CI showing the same failure mode: https://github.com/apache/arrow-datafusion/actions/runs/8133918181/job/22226135558?pr=9411

Expected behavior

No response

Additional context

No response

@alamb alamb added the bug Something isn't working label Mar 4, 2024
@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2024

I think the issue is that the partitioning of the test is not deterministic PARTITION BY t1.column has all the same values.

# generate BIGINT data from 1 to 1000 in multiple partitions
statement ok
CREATE TABLE t1000 (i BIGINT) AS
WITH t AS (VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0))
SELECT ROW_NUMBER() OVER (PARTITION BY t1.column1) FROM t t1, t t2, t t3;

# verify that there are multiple partitions in the input (i.e. MemoryExec says
# there are 4 partitions) so that this tests multi-partition limit.
query TT
EXPLAIN SELECT DISTINCT i FROM t1000;
----
logical_plan
Aggregate: groupBy=[[t1000.i]], aggr=[[]]
--TableScan: t1000 projection=[i]
physical_plan
AggregateExec: mode=FinalPartitioned, gby=[i@0 as i], aggr=[]
--CoalesceBatchesExec: target_batch_size=8192
----RepartitionExec: partitioning=Hash([i@0], 4), input_partitions=4
------AggregateExec: mode=Partial, gby=[i@0 as i], aggr=[]
--------MemoryExec: partitions=4, partition_sizes=[1, 2, 1, 1]

I think we can fix this by changing the test to use different values.

@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2024

So I think the problem is that the input is hash partitioned into 4 partitions but somehow one of the partitions gets two batches and which partition gets the two batches is non deterministic

explain CREATE TABLE t1000 (i BIGINT) AS
WITH t AS (VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0))
SELECT ROW_NUMBER() OVER (PARTITION BY t1.column1) FROM t t1, t t2, t t3;
----
logical_plan
CreateMemoryTable: Bare { table: "t1000" }
--Projection: CAST(ROW_NUMBER() PARTITION BY [t1.column1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS Int64) AS i
----WindowAggr: windowExpr=[[ROW_NUMBER() PARTITION BY [t1.column1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
------CrossJoin:
--------CrossJoin:
----------SubqueryAlias: t1
------------SubqueryAlias: t
--------------Values: (Int64(0)), (Int64(0)), (Int64(0)), (Int64(0)), (Int64(0))...
----------SubqueryAlias: t2
------------SubqueryAlias: t
--------------Projection: 
----------------Values: (Int64(0)), (Int64(0)), (Int64(0)), (Int64(0)), (Int64(0))...
--------SubqueryAlias: t3
----------SubqueryAlias: t
------------Projection: 
--------------Values: (Int64(0)), (Int64(0)), (Int64(0)), (Int64(0)), (Int64(0))...

Another way to fix the issue might be to add a configuration option such as datafusion.explain.show_statistics that would control if the partition_sizes were output in explain plan.

Something like

set datafusion.explain.show_sizes = false;

And then the MemoryExec output would be generated without partition_sizesL

MemoryExec: partitions=4

@alamb alamb changed the title Intermittent limit.slt fails sometimes Intermittent limit.slt failure Mar 4, 2024
@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2024

This test was recently changed in #9444

@mustafasrepo
Copy link
Contributor

The second option seems like the most robust solution. Since in the current setup of the test, each time hash function changes, we might need to update test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants