-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50572][SQL] Fix race condition in CachedRDDBuilder.cachedColumnBuffers #49179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The previous code had a race condition that mean that `cachedColumnBuffers` could return `null` if another thread was concurrently was calling `clearCache`. The bug is caused by us checking _cachedColumnBuffers and return it as two separate operations outside a synchronized block. So it possible for another thread to set it to `null` after the check but before the return. ``` java.lang.NullPointerException: null at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.filteredCachedBatches(InMemoryTableScanExec.scala:156) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD$lzycompute(InMemoryTableScanExec.scala:98) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD(InMemoryTableScanExec.scala:84) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.doExecute(InMemoryTableScanExec.scala:163) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:191) at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:527) at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:455) at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:454) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:498) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:51) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:751) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:191) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:364) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:445) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4218) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3459) at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4208) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526) ... 23 more ```
|
@cloud-fan since you were involved in #21018 where the problematic code was introduced. |
|
An alternative fix could be to mark the entire method as synchronized if we do not think these are on a critical path. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
I'd like to have a look at it, when I have time. |
It was just waiting for someone to review it. As far as I am aware it good to go as it is and it fixes a real race condition. |
|
I find this PR quite interesting indeed. It's a pity this hasn't gone ahead. Did you send a message to the dev mail list? |
|
Sorry about the late response, but no I don't think I did. |
What changes were proposed in this pull request?
Fix race condition in the class
CachedRDDBuilder.Why are the changes needed?
The previous code had a race condition that mean that
cachedColumnBufferscould returnnullif another thread was concurrently was callingclearCache.The bug is caused by us checking
_cachedColumnBuffersand return it as two separate operations outside a synchronized block. So it possible for another thread to set it tonullafter the check but before the return.Does this PR introduce any user-facing change?
Yes, fixes a race condition that can cause crashes.
How was this patch tested?
Adds a test that shows that reliably fails on the old code. Not sure if want to merge that style of test as it very specific for this bug and takes 3 seconds to run.
Was this patch authored or co-authored using generative AI tooling?
No.