-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52060][SQL] Make OneRowRelationExec
node
#50849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52060][SQL] Make OneRowRelationExec
node
#50849
Conversation
OneRowRelationExec
nodeOneRowRelationExec
node
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
Outdated
Show resolved
Hide resolved
} | ||
|
||
override def simpleString(maxFields: Int): String = { | ||
s"$nodeName${truncatedString(output, "[", ",", "]", maxFields)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from the default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default implementation returns Scan OneRowRelation
, while the existing implementation (using RDDScan) returns Scan OneRowRelation[]
. I figured we shouldn't change this in the off chance that someone is relying on it.
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
Outdated
Show resolved
Hide resolved
there are failures like
which i also see in other PRs such as here, so i think the failures are unrelated |
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
Show resolved
Hide resolved
|
||
private val emptyRow: InternalRow = InternalRow.empty | ||
|
||
private val rdd = session.sparkContext.parallelize(Seq(emptyRow), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do this
private val rdd = {
val proj = UnsafeProjection.create(schema)
val emptyRow = proj(InternalRow.empty)
session.sparkContext.parallelize(Seq(emptyRow), 1)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then def doExecute()
can just return this RDD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, sure i implemented this - but because we still want to increment the number of output rows at the time when the row is actually processed, i did not end up simply returning rdd
from doExecute()
... lmk if you think there a better way.
|
||
override def inputRDD: RDD[InternalRow] = rdd | ||
|
||
override protected val createUnsafeProjection: Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do https://github.com/apache/spark/pull/50849/files#r2097521541 , then this can be false.
@richardc-db can you do a rebase to re-trigger the CI? We fixed an OOM test issue recently. |
c993f51
to
eef16fe
Compare
hmm, @cloud-fan, seems like this causes a test to fail with
I'm guessing this is because the inputRDD has unsafe rows? The test passes after undoing the change described here... do you think we should undo this change and go back to adding an unsafe projection in
and setting |
@richardc-db let's change it back, didn't realize this serde issue... |
@cloud-fan sounds good! I switched it back and seems like tests pass now |
|
||
override val output: Seq[Attribute] = Nil | ||
|
||
private val rdd: RDD[InternalRow] = session.sparkContext.parallelize(Seq(InternalRow.empty), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about it more, I think we can avoid serializing any row:
session.sparkContext.parallelize(Nil, 1).mapPartitionsInternal { _ =>
val proj = UnsafeProjection.create(Seq.empty[Expression])
Iterator(proj.apply(InternalRow.empty))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the unsafe row is generated on the fly at the worker side, no serialization is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks for the help! ended up doing
private val rdd: RDD[InternalRow] = {
val numOutputRows = longMetric("numOutputRows")
session
.sparkContext
.parallelize(Seq(InternalRow()), 1)
.mapPartitionsInternal { _ =>
val proj = UnsafeProjection.create(Seq.empty[Expression])
Iterator(proj.apply(InternalRow.empty)).map { r =>
numOutputRows += 1
r
}
}
}
to ensure the metrics are filled properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one idea to improve the perf: https://github.com/apache/spark/pull/50849/files#r2107403483
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
Outdated
Show resolved
Hide resolved
thanks, merging to master! |
What changes were proposed in this pull request?
creates a new OneRowRelationExec node, which is more or less a copy of the RDDScanExec node.
We want a dedicated node because this helps make it more clear when a one row relation, i.e. for patterns like
SELECT version()
is used.Why are the changes needed?
this makes it more clear in the code that a one row relation is used and allows us to avoid checking the hard coded "OneRowRelation" string when pattern matching.
Does this PR introduce any user-facing change?
yes, the plan will now be
OneRowRelationExec
rather thanRDDScanExec
. The plan string should be the same, however.How was this patch tested?
added UTs
Was this patch authored or co-authored using generative AI tooling?