- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
          [SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec
          #52669
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
numSourceRows metric for MergeIntoExec
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @asl3 i left some initial style comments
        
          
                ...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | None | ||
| } | ||
|  | ||
| sourceChild.flatMap { child => | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, why do we need to traverse again here? I thought join.left and join.right is already the child and we can directly check that node? We dont want to traverse as each node without numOutputRows risks a wrong information (because that node may change the numOutputRows from its child)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to findSourceSide, as we still need a step to find the source node with numOutputRows.
For example, with:
+- *(2) BroadcastHashJoin ...
                     :- *(2) Project ... 
                     :  +- BatchScan ... 
                     +- BroadcastQueryStage ...
                        +- BroadcastExchange ... 
                           +- *(1) Project ...
                              +- *(1) LocalTableScan ...
we find BroadcastQueryStage has the source table (after checking isTargetTableScan), but still need a step to traverse for LocalTableScan. As it is collectFirst, I think we don't worry about traversing too far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments on the tests
        
          
                sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala
          
            Show resolved
            Hide resolved
        
              
          
                sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @asl3 a few more comments
| None | ||
| } | ||
|  | ||
| sourceSide.flatMap { side => | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need this, we want to find the first join child (on source side) with numOutputRows, else -1?
| }.isDefined | ||
| } | ||
|  | ||
| def findSourceScan(join: BaseJoinExec): Option[SparkPlan] = { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method name should reflect that its finding the source side child, not the source itself.
…mmary metric is not found ### What changes were proposed in this pull request? Clarify javadocs to explain situations where the metric is not found. ### Why are the changes needed? As we begin to handle more write summaries like in #52669, involving more complex walks of the executed plan graph, the code to calculate merge summary may encounter some plan it does not expect and would need to populate -1. This was actually called out in #52595 (comment) , it was not done as it was not case then, but now I realize it will be possible as this code evolves. Especially as we plan to still populate MergeSummary in cases where the optimizer rewrites Merge plan to get rid of MergeRowsExec or Join. As it is more an error-handling case, we don't need to change the model of the MergeSummary to return Long or OptionalLong, so we can put -1 and indicate this in the javadoc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #52797 from szehon-ho/SPARK-53891-follow. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Add
numSourceRowsmetric forMergeIntoExec, from source node'snumOutputRows.Assumption is that all child nodes have
numOutputRows. If not found,numSourceRowswould be -1.Why are the changes needed?
Improve completeness and debuggability of Merge Into metrics.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test cases for numSourceNodes metric.
Was this patch authored or co-authored using generative AI tooling?
No.