-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JIRA Issue Information=== Bug SPARK-55411 === This comment was automatically generated by GitHub Actions |
sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala
Outdated
Show resolved
Hide resolved
|
thanks for the repo, ill try to take a look. |
| partitioning.numPartitions, | ||
| partitioning.partitionValues) | ||
| partitioning.partitionValues, | ||
| partitioning.originalPartitionValues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found originalPartitionValues is not always populated. is it intentional?
| projectedExpressions.map(_.dataType)) | ||
| basePartitioning.partitionValues.map { r => | ||
| val projectedRow = KeyGroupedPartitioning.project(expressions, | ||
| val projectedRow = KeyGroupedPartitioning.project(basePartitioning.expressions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the wrong projected excepression is the root cause of the ArrayIndexOutOfBoundsException you hit and passing in basePartitioning.expressions looks good.
But the test you added will unlikely pass as there is an issue with the test framework.
I left a note here:
spark/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala
Lines 2801 to 2802 in 3405255
| // Do not use `bucket()` in "one side partition" tests as its implementation in | |
| // `InMemoryBaseTable` conflicts with `BucketFunction` |
bucket() in these one side shuffle tests.
The problem is that the bucket() implementation here:
Lines 93 to 95 in 3405255
| override def produceResult(input: InternalRow): Int = { | |
| (input.getLong(1) % input.getInt(0)).toInt | |
| } |
and in
InMemoryBaseTable: spark/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala
Lines 240 to 247 in 3405255
| val valueTypePairs = cols.map(col => extractor(col.fieldNames, cleanedSchema, row)) | |
| var valueHashCode = 0 | |
| valueTypePairs.foreach( pair => | |
| if ( pair._1 != null) valueHashCode += pair._1.hashCode() | |
| ) | |
| var dataTypeHashCode = 0 | |
| valueTypePairs.foreach(dataTypeHashCode += _._2.hashCode()) | |
| ((valueHashCode + 31 * dataTypeHashCode) & Integer.MAX_VALUE) % numBuckets |
So technically the partition keys that the datasource reports and the calculated key of the partition where the partitioner puts the shuffled records don't match.
@pan3793, could you please keep your fix in KeyGroupedPartitionedScan.scala and fix the BucketTransform key calculation in InMemoryBaseTable?
You don't need need the other changes. originalPartitionValues seems unrelated as it is used only when partially clustered distribution is enabled.
BTW, I'm working on refactoring SPJ based on this idea: #53859 (comment) and it looks prosmising so far, but I need some more days to wrap it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Do not use
bucket()in "one side partition" tests as its implementation in
//InMemoryBaseTableconflicts withBucketFunction
Oh, god, @peter-toth, thanks a lot for pointing this out, I wasn't aware of it and have spent a few hours trying to figure out why SMJ partition key value mismatch and produce wrong result after fixing the ArrayIndexOutOfBoundsException ...
Actually, the current code changes are just a draft; the test cases have not yet passed. I will try to fix it following your guidance. Thank you again, @peter-toth!
…join keys are less than cluster keys
bbf8c3b to
cb78da6
Compare
| case (v, t) => | ||
| throw new IllegalArgumentException(s"Match: unsupported argument(s) type - ($v, $t)") | ||
| } | ||
| (acc + valueHash) & 0xFFFFFFFFFFFFL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala> Long.MaxValue + 1L
res0: Long = -9223372036854775808
scala> (Long.MaxValue + 1L) & 0xFFFFFFFFFFFFL
res1: Long = 0
scala> (Long.MaxValue + 2L) & 0xFFFFFFFFFFFFL
res2: Long = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is needed because % N can return negative results, isn't it? That seems like problem at both places as bucket N should return max N different values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use Math.floorMod()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bucket num should be >=1 (seems we don't have such a check though), then (non_negative_long % positive_int) should always be positive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's correct, but
Lines 93 to 95 in 3405255
| override def produceResult(input: InternalRow): Int = { | |
| (input.getLong(1) % input.getInt(0)).toInt | |
| } |
Math.floorMod() then we wouldn't need that & 0xFFFFFFFFFFFFL non-negative conversion.
|
Looks good to me, let's wait for CI. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks @pan3793 for the fix and @dongjoon-hyun for the review. Merged to As the bug affects earlier versions too and the |
|
Thank you so much again, @pan3793 and @peter-toth ! +1 for backporting, too. |
…when join keys are less than cluster keys Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) It's a bug fix. Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. No. Closes apache#54182 from pan3793/spj-subset-joinkey-bug. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>
…when join keys are less than cluster keys Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) It's a bug fix. Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. No. Closes apache#54182 from pan3793/spj-subset-joinkey-bug. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>
|
The issue was introduced by SPARK-44647 (4.0.0), I opened backport PR to branch-4.1 and branch-4.0 |
…when join keys are less than cluster keys Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) It's a bug fix. Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. No. Closes apache#54182 from pan3793/spj-subset-joinkey-bug. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
late lgtm, thanks @pan3793 and @peter-toth for the fix!
…when join keys are less than cluster keys Backport #54182 to branch-4.1 ### What changes were proposed in this pull request? Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) ### Why are the changes needed? It's a bug fix. ### Does this PR introduce _any_ user-facing change? Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. ### How was this patch tested? New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54259 from pan3793/SPARK-55411-4.1. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>
What changes were proposed in this pull request?
Fix a
java.lang.ArrayIndexOutOfBoundsExceptionwhenspark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true, by correcting theexpression(should pass the full partition expression instead of the projected one) passed toKeyGroupedPartitioning#project.Also, fix a test code issue, change the calculation result of
BucketTransformdefined atInMemoryBaseTable.scalato matchBucketFunctionsdefined attransformFunctions.scala(thanks @peter-toth for pointing this out!)Why are the changes needed?
It's a bug fix.
Does this PR introduce any user-facing change?
Some queries that failed when
spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=truenow run normally.How was this patch tested?
New UT is added, previously it failed with
ArrayIndexOutOfBoundsException, now passed.UTs affected by
bucket()calculate logic change are tuned.Was this patch authored or co-authored using generative AI tooling?
No.