Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Final aggregation should not bind to the input of partial aggregation #155

Merged
merged 5 commits into from
Mar 5, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 3, 2024

Which issue does this PR close?

Closes #156.

Rationale for this change

We now bind aggregate expressions to the input of partial aggregate for HashAggregate no matter what the mode is. However, the output of partial aggregate is not same as its input. This incorrect binding will cause unmatched column index. Currently it doesn't expose issues because we don't check the index of bound reference in native side, and the bound columns in aggregate expressions are not used.

However, this is a potential bug and causes some issues when starting to check the index of bound reference.

What changes are included in this PR?

This patch adds the check of the index of bound reference. The aggregate expressions of final aggregation are not bound to the input of partial aggregation anymore but sent to native side as unbound expressions.

How are these changes tested?

@viirya
Copy link
Member Author

viirya commented Mar 3, 2024

cc @huaxingao

@viirya viirya force-pushed the fix_final_agg_binding branch from 375f11c to f5c4758 Compare March 3, 2024 09:03
@viirya
Copy link
Member Author

viirya commented Mar 3, 2024

cc @sunchao

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for catching the problem.

@@ -38,6 +38,25 @@ import org.apache.comet.CometSparkSessionExtensions.isSpark34Plus
* Test suite dedicated to Comet native aggregate operator
*/
class CometAggregateSuite extends CometTestBase with AdaptiveSparkPlanHelper {
import testImplicits._

test("Final aggregation should not bind to the input of partial aggregation") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm does this test the issue in this PR? seems it passes in the main branch without the fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding the bound index check, it fails. It passes now because we don't have the check now. As I mentioned in the description, the bound is incorrect but because we don't check it, it is not exposed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we need to add the bound check as required for SortMergeJoin work. In that work, it checks joining keys bindings internally in DataFusion. Which checks both binding index and binding column name. So we need to bind column reference to input schema.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@viirya viirya merged commit a131c44 into apache:main Mar 5, 2024
19 checks passed
@viirya
Copy link
Member Author

viirya commented Mar 5, 2024

Merged. Thanks.

@viirya viirya deleted the fix_final_agg_binding branch March 5, 2024 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Final aggregation should not bind to the input of partial aggregation
3 participants