Skip to content

Conversation

@NGA-TRAN
Copy link
Contributor

Reproducer for #18341

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 28, 2025
@NGA-TRAN
Copy link
Contributor Author

@gene-bordegaray : can you help review this?

Copy link
Contributor

@gene-bordegaray gene-bordegaray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Just want to ask clarifying question to make sure I am understanding:

  • the Round Robin Repartition into the Aggregate is useful because we can disperse work across partitions and then accumulate their results. Using the aggregated results we can use the Hash Repartition to hand off work with the same key (such as env = 'prod') to workers, thus is more efficient
  • the parquet query is not working this way as the Reparititons are not separated by the Aggregate. The Aggregate does all this work on a single partition then does Repartitioning too late.

@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

This makes sense. Just want to ask clarifying question to make sure I am understanding:

  • the Round Robin Repartition into the Aggregate is useful because we can disperse work across partitions and then accumulate their results. Using the aggregated results we can use the Hash Repartition to hand off work with the same key (such as env = 'prod') to workers, thus is more efficient

Yes -- you can read more about the two phase aggregation in general here https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html#tymethod.state

@NGA-TRAN
Copy link
Contributor Author

the parquet query is not working this way as the Reparititons are not separated by the Aggregate. The Aggregate does all this work on a single partition then does Repartitioning too late.

Correct.

And the ticket of this reproducer #18341 proposes 3 different ways to improve it

@gene-bordegaray
Copy link
Contributor

the parquet query is not working this way as the Reparititons are not separated by the Aggregate. The Aggregate does all this work on a single partition then does Repartitioning too late.

Correct.

And the ticket of this reproducer #18341 proposes 3 different ways to improve it

Ok awesome, I will take some time to understand these approaches and give some of my thoughts

@alamb alamb added this pull request to the merge queue Oct 30, 2025
@alamb
Copy link
Contributor

alamb commented Oct 30, 2025

Thanks @NGA-TRAN @gene-bordegaray and @Dandandan

Merged via the queue into apache:main with commit f57da83 Oct 30, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants