Skip to content

Conversation

LennonChin
Copy link
Contributor

@LennonChin LennonChin commented Sep 12, 2025

Spark Optimizer's ColumnPruning will replace count(*)/count(1) Aggregate plan's child to a Project node with empty projection list:

object ColumnPruning extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(
    plan.transformWithPruning(AlwaysProcess.fn, ruleId) {
    ...
    // Prunes the unused columns from child of Aggregate/Expand/Generate/ScriptTransformation
    case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) =>
      a.copy(child = prunedChild(child, a.references))
    ...
}

but AuthZ plugin's PrivilegesBuilder.buildQuery method will ignore to check child node when plan's inputSet is empty, in this scenario, Aggregate node's child plan's privileges are ignored, which cause count(*)/ count(1) will ignored all privileges that should be checked.

Why are the changes needed?

this patch add a holder node ChildOutputHolder when Aggregate node's references and it's child node's outputSet hava no intersection, it will hold the child node's outputSet used for build privilege objects, and ChildOutputHolder node will be eliminated after RuleAuthorization rule work completed.

How was this patch tested?

updated old unit tests and added new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@LennonChin
Copy link
Contributor Author

LennonChin commented Sep 12, 2025

related issue: #7173, cc @bowenliang123

@codecov-commenter
Copy link

codecov-commenter commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 0% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (8b56295) to head (952548f).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
.../spark/authz/rule/plan/RuleChildOutputMarker.scala 0.00% 6 Missing ⚠️
...ugin/spark/authz/rule/plan/ChildOutputHolder.scala 0.00% 4 Missing ⚠️
...rk/authz/rule/RuleEliminateChildOutputHolder.scala 0.00% 3 Missing ⚠️
.../kyuubi/plugin/spark/authz/PrivilegesBuilder.scala 0.00% 2 Missing ⚠️
...ugin/spark/authz/ranger/RangerSparkExtension.scala 0.00% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #7204   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         695     698    +3     
  Lines       43505   43495   -10     
  Branches     5888    5886    -2     
======================================
+ Misses      43505   43495   -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.


import org.apache.kyuubi.plugin.spark.authz.util.WithInternalChild

case class ChildOutputHolder(child: LogicalPlan, fixedOutput: Seq[Attribute])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add comments to explain why we need this.

fixedOutput is not a good name, it fixed what? maybe just call it childOutput

if (columnPrune(p.references.toSeq ++ p.output, p.inputSet).isEmpty) {
// If plan is project and output don't have relation to input, can ignore.
if (!p.isInstanceOf[Project]) {
// If plan tree exists ChildOutputHolder, we should build child logic plan.
Copy link
Member

@pan3793 pan3793 Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable name existsChildOutputHolder itself explains what it does, a good comment explains WHY, not WHAT.

@pan3793 pan3793 requested a review from wForget September 12, 2025 15:19
@pan3793
Copy link
Member

pan3793 commented Sep 12, 2025

this approach lgtm, I suggest adding some comments.

cc @wForget will this affect lineage?

also cc @zhouyifan279 since you have worked closely on this part

@LennonChin LennonChin requested a review from pan3793 September 15, 2025 01:37
@wForget
Copy link
Member

wForget commented Sep 15, 2025

cc @wForget will this affect lineage?

It seems not, ChildOutputHolder extends UnaryNode, so it will pass the child's lineage.

// some nodes in the tree is fixed by RuleChildOutputMarker in some special
// scenarios, such as the Aggregate(count(*)) child node. To avoid missing child node
// permissions, we need to continue checking down.
if (!p.isInstanceOf[Project] || existsChildOutputHolder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect children that are not ChildOutputHolder ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect children that are not ChildOutputHolder ?

This check is to allow PrivilegeBuilder.buildQuery method continue drill down to the child nodes of the ChildOutputHolder node. Without this check, the PrivilegeBuilder.buildQuery method will terminate before hitting ChildOutputHolder. Since the outputs held by ChildOutputHolder are all useful, combined with columnPrune method, I think other child nodes will not be affected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is in for (child <- p.children) {, which will be applied to all children of p. Do we need it for children of p that are not ChildOutputHolder?

Copy link
Contributor Author

@LennonChin LennonChin Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is in for (child <- p.children) {, which will be applied to all children of p. Do we need it for children of p that are not ChildOutputHolder?

Normally, recursive checking of deeper nodes is necessary, but the presence of an Aggregate(count(*)) node is an exception, as it blocks recursion. In current case, ChildOutputHolder is only added when encountered Aggregate(count(*)) node. When a ChildOutputHolder node exists in the plan tree, it indicates that nodes deeper than ChildOutputHolder holds useful output information, we need to continue checking deeper nodes. At the same time, when recursing to the node deeper than ChildOutputHolder, we regress to normal judgment logic. Therefore, I think the judgment here is reasonable.

@LennonChin LennonChin requested a review from wForget September 15, 2025 07:59
@LennonChin LennonChin changed the title [AUTHZ] select(*)/select(1) should check sub nodes' privileges [AUTHZ] count(*)/count(1) should check sub nodes' privileges Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants