[SPARK-52578][SQL] Add `numSourceRows` metric for `MergeIntoExec` #52669

asl3 · 2025-10-20T16:40:48Z

What changes were proposed in this pull request?

Add numSourceRows metric for MergeIntoExec, from source node's numOutputRows.

Assumption is that all child nodes have numOutputRows. If not found, numSourceRows would be -1.

Why are the changes needed?

Improve completeness and debuggability of Merge Into metrics.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test cases for numSourceNodes metric.

Was this patch authored or co-authored using generative AI tooling?

No.

szehon-ho

Thanks @asl3 i left some initial style comments

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

szehon-ho · 2025-10-22T23:28:25Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

+        None
+      }
+
+      sourceChild.flatMap { child =>


actually, why do we need to traverse again here? I thought join.left and join.right is already the child and we can directly check that node? We dont want to traverse as each node without numOutputRows risks a wrong information (because that node may change the numOutputRows from its child)

I renamed to findSourceSide, as we still need a step to find the source node with numOutputRows.

For example, with:

+- *(2) BroadcastHashJoin ... :- *(2) Project ... : +- BatchScan ... +- BroadcastQueryStage ... +- BroadcastExchange ... +- *(1) Project ... +- *(1) LocalTableScan ...

we find BroadcastQueryStage has the source table (after checking isTargetTableScan), but still need a step to traverse for LocalTableScan. As it is collectFirst, I think we don't worry about traversing too far

szehon-ho

Some more comments on the tests

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala

szehon-ho

Thanks @asl3 a few more comments

szehon-ho · 2025-10-30T18:40:26Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

+        None
+      }
+
+      sourceSide.flatMap { side =>


Maybe we don't need this, we want to find the first join child (on source side) with numOutputRows, else -1?

szehon-ho · 2025-10-30T18:41:19Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

+      }.isDefined
+    }
+
+    def findSourceScan(join: BaseJoinExec): Option[SparkPlan] = {


The method name should reflect that its finding the source side child, not the source itself.

…mmary metric is not found ### What changes were proposed in this pull request? Clarify javadocs to explain situations where the metric is not found. ### Why are the changes needed? As we begin to handle more write summaries like in #52669, involving more complex walks of the executed plan graph, the code to calculate merge summary may encounter some plan it does not expect and would need to populate -1. This was actually called out in #52595 (comment) , it was not done as it was not case then, but now I realize it will be possible as this code evolves. Especially as we plan to still populate MergeSummary in cases where the optimizer rewrites Merge plan to get rid of MergeRowsExec or Join. As it is more an error-handling case, we don't need to change the model of the MergeSummary to return Long or OptionalLong, so we can put -1 and indicate this in the javadoc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #52797 from szehon-ho/SPARK-53891-follow. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

asl3 added 5 commits October 20, 2025 09:32

merge numsourcerows metric

e2b8ae1

numsourcerows -1

3c08974

comment

262b1d1

comment

5bdb385

nit

1746f82

asl3 changed the title ~~[SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec~~ [SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec Oct 20, 2025

github-actions bot added SQL labels Oct 20, 2025

asl3 added 3 commits October 20, 2025 09:46

format

36b14dd

fmt

c448379

BaseJoinExec

e1e14b8

szehon-ho reviewed Oct 22, 2025

View reviewed changes

szehon-ho reviewed Oct 23, 2025

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala Show resolved Hide resolved

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala Outdated Show resolved Hide resolved

asl3 added 4 commits October 27, 2025 09:15

style, aqe tests

25760a0

Merge remote-tracking branch 'origin/master' into merge-numsourcerows

9f6ebea

mergeSummary

6944108

test

62a8340

szehon-ho mentioned this pull request Oct 30, 2025

[SPARK-53891][SQL][FOLLOW-UP] Clarify javadocs for case when merge summary metric is not found #52797

Closed

szehon-ho reviewed Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52578][SQL] Add `numSourceRows` metric for `MergeIntoExec` #52669

[SPARK-52578][SQL] Add `numSourceRows` metric for `MergeIntoExec` #52669

asl3 commented Oct 20, 2025 •

edited

Loading

Uh oh!

szehon-ho left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho Oct 22, 2025 •

edited

Loading

Uh oh!

asl3 Oct 27, 2025

Uh oh!

szehon-ho left a comment

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Uh oh!

szehon-ho Oct 30, 2025

Uh oh!

szehon-ho Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec #52669

Are you sure you want to change the base?

[SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec #52669

Conversation

asl3 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asl3 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-52578][SQL] Add `numSourceRows` metric for `MergeIntoExec` #52669

[SPARK-52578][SQL] Add `numSourceRows` metric for `MergeIntoExec` #52669

asl3 commented Oct 20, 2025 •

edited

Loading

szehon-ho Oct 22, 2025 •

edited

Loading