Skip to content

Conversation

@asl3
Copy link
Contributor

@asl3 asl3 commented Oct 20, 2025

What changes were proposed in this pull request?

Add numSourceRows metric for MergeIntoExec, from source node's numOutputRows.

Assumption is that all child nodes have numOutputRows. If not found, numSourceRows would be -1.

Why are the changes needed?

Improve completeness and debuggability of Merge Into metrics.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test cases for numSourceNodes metric.

Was this patch authored or co-authored using generative AI tooling?

No.

@asl3 asl3 changed the title [SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec [SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec Oct 20, 2025
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asl3 i left some initial style comments

None
}

sourceChild.flatMap { child =>
Copy link
Member

@szehon-ho szehon-ho Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, why do we need to traverse again here? I thought join.left and join.right is already the child and we can directly check that node? We dont want to traverse as each node without numOutputRows risks a wrong information (because that node may change the numOutputRows from its child)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to findSourceSide, as we still need a step to find the source node with numOutputRows.

For example, with:

+- *(2) BroadcastHashJoin ...
                     :- *(2) Project ... 
                     :  +- BatchScan ... 
                     +- BroadcastQueryStage ...
                        +- BroadcastExchange ... 
                           +- *(1) Project ...
                              +- *(1) LocalTableScan ...

we find BroadcastQueryStage has the source table (after checking isTargetTableScan), but still need a step to traverse for LocalTableScan. As it is collectFirst, I think we don't worry about traversing too far

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments on the tests

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asl3 a few more comments

None
}

sourceSide.flatMap { side =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need this, we want to find the first join child (on source side) with numOutputRows, else -1?

}.isDefined
}

def findSourceScan(join: BaseJoinExec): Option[SparkPlan] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name should reflect that its finding the source side child, not the source itself.

HyukjinKwon pushed a commit that referenced this pull request Oct 31, 2025
…mmary metric is not found

### What changes were proposed in this pull request?
Clarify javadocs to explain situations where the metric is not found.

### Why are the changes needed?
As we begin to handle more write summaries like in #52669, involving more complex walks of the executed plan graph, the code to calculate merge summary may encounter some plan it does not expect and would need to populate -1.

This was actually called out in #52595 (comment) , it was not done as it was not case then, but now I realize it will be possible as this code evolves.  Especially as we plan to still populate MergeSummary in cases where the optimizer rewrites Merge plan to get rid of MergeRowsExec or Join.

As it is more an error-handling case, we don't need to change the model of the MergeSummary to return Long or OptionalLong, so we can put -1 and indicate this in the javadoc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #52797 from szehon-ho/SPARK-53891-follow.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants