[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54182

pan3793 · 2026-02-06T18:43:59Z

What changes were proposed in this pull request?

Fix a java.lang.ArrayIndexOutOfBoundsException when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true, by correcting the expression(should pass the full partition expression instead of the projected one) passed to KeyGroupedPartitioning#project.

Also, fix a test code issue, change the calculation result of BucketTransform defined at InMemoryBaseTable.scala to match BucketFunctions defined at transformFunctions.scala (thanks @peter-toth for pointing this out!)

Why are the changes needed?

It's a bug fix.

Does this PR introduce any user-facing change?

Some queries that failed when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true now run normally.

How was this patch tested?

New UT is added, previously it failed with ArrayIndexOutOfBoundsException, now passed.

$ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411"
...
[info] - bug *** FAILED *** (1 second, 884 milliseconds)
[info]   java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
[info]   at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471)
[info]   at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58)
...

UTs affected by bucket() calculate logic change are tuned.

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions · 2026-02-06T18:44:09Z

JIRA Issue Information

=== Bug SPARK-55411 ===
Summary: SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys
Assignee: None
Status: Open
Affected: ["4.0.2","4.1.1"]

This comment was automatically generated by GitHub Actions

sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala

szehon-ho · 2026-02-06T18:53:41Z

thanks for the repo, ill try to take a look.

pan3793 · 2026-02-06T18:58:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

      partitioning.numPartitions,
-      partitioning.partitionValues)
+      partitioning.partitionValues,
+      partitioning.originalPartitionValues)


I found originalPartitionValues is not always populated. is it intentional?

peter-toth · 2026-02-07T14:56:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/KeyGroupedPartitionedScan.scala

+                projectedExpressions.map(_.dataType))
            basePartitioning.partitionValues.map { r =>
-            val projectedRow = KeyGroupedPartitioning.project(expressions,
+            val projectedRow = KeyGroupedPartitioning.project(basePartitioning.expressions,


Actually the wrong projected excepression is the root cause of the ArrayIndexOutOfBoundsException you hit and passing in basePartitioning.expressions looks good.

But the test you added will unlikely pass as there is an issue with the test framework.
I left a note here:

spark/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala

Lines 2801 to 2802 in 3405255

// Do not use `bucket()` in "one side partition" tests as its implementation in

// `InMemoryBaseTable` conflicts with `BucketFunction`

, but forgot to open a fix for the problem with using bucket() in these one side shuffle tests.

The problem is that the bucket() implementation here:

spark/sql/core/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

Lines 93 to 95 in 3405255

override def produceResult(input: InternalRow): Int = {

(input.getLong(1) % input.getInt(0)).toInt

}

and in InMemoryBaseTable:

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

Lines 240 to 247 in 3405255

val valueTypePairs = cols.map(col => extractor(col.fieldNames, cleanedSchema, row))

var valueHashCode = 0

valueTypePairs.foreach( pair =>

if ( pair._1 != null) valueHashCode += pair._1.hashCode()

)

var dataTypeHashCode = 0

valueTypePairs.foreach(dataTypeHashCode += _._2.hashCode())

((valueHashCode + 31 * dataTypeHashCode) & Integer.MAX_VALUE) % numBuckets

mismatch.
So technically the partition keys that the datasource reports and the calculated key of the partition where the partitioner puts the shuffled records don't match.

@pan3793, could you please keep your fix in KeyGroupedPartitionedScan.scala‎ and fix the BucketTransform key calculation in InMemoryBaseTable?
You don't need need the other changes. originalPartitionValues seems unrelated as it is used only when partially clustered distribution is enabled.

BTW, I'm working on refactoring SPJ based on this idea: #53859 (comment) and it looks prosmising so far, but I need some more days to wrap it up.

// Do not use bucket() in "one side partition" tests as its implementation in
// InMemoryBaseTable conflicts with BucketFunction

Oh, god, @peter-toth, thanks a lot for pointing this out, I wasn't aware of it and have spent a few hours trying to figure out why SMJ partition key value mismatch and produce wrong result after fixing the ArrayIndexOutOfBoundsException ...

Actually, the current code changes are just a draft; the test cases have not yet passed. I will try to fix it following your guidance. Thank you again, @peter-toth!

…join keys are less than cluster keys

pan3793 · 2026-02-08T08:34:00Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

+            case (v, t) =>
+              throw new IllegalArgumentException(s"Match: unsupported argument(s) type - ($v, $t)")
+          }
+          (acc + valueHash) & 0xFFFFFFFFFFFFL


scala> Long.MaxValue + 1L res0: Long = -9223372036854775808 scala> (Long.MaxValue + 1L) & 0xFFFFFFFFFFFFL res1: Long = 0 scala> (Long.MaxValue + 2L) & 0xFFFFFFFFFFFFL res2: Long = 1

Ah, this is needed because % N can return negative results, isn't it? That seems like problem at both places as bucket N should return max N different values.

Should we use Math.floorMod()?

the bucket num should be >=1 (seems we don't have such a check though), then (non_negative_long % positive_int) should always be positive?

Yeah, that's correct, but

spark/sql/core/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

Lines 93 to 95 in 3405255

override def produceResult(input: InternalRow): Int = {

(input.getLong(1) % input.getInt(0)).toInt

}

seems also wrong as it can return values between -N+1 and N-1 so we should probably fix both places. If we used Math.floorMod() then we wouldn't need that & 0xFFFFFFFFFFFFL non-negative conversion.

peter-toth · 2026-02-08T12:52:30Z

Looks good to me, let's wait for CI.

dongjoon-hyun

+1, LGTM. Thank you, @pan3793 and all.

cc @sunchao , too.

peter-toth · 2026-02-10T15:59:10Z

Thanks @pan3793 for the fix and @dongjoon-hyun for the review.

Merged to master (4.2.0).

As the bug affects earlier versions too and the BucketFunction related change is a correctness fix, I would suggest backporting this change to other active branches. @pan3793, can you please open backport PRs (there were conflicts during a simple cherry-pick...)?

dongjoon-hyun · 2026-02-10T16:15:32Z

Thank you so much again, @pan3793 and @peter-toth !

+1 for backporting, too.

…when join keys are less than cluster keys Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) It's a bug fix. Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. No. Closes apache#54182 from pan3793/spj-subset-joinkey-bug. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>

pan3793 · 2026-02-11T02:40:19Z

The issue was introduced by SPARK-44647 (4.0.0), I opened backport PR to branch-4.1 and branch-4.0

[SPARK-55411][SQL][4.1] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54259
[SPARK-55411][SQL][4.0] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54260

…when join keys are less than cluster keys Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) It's a bug fix. Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. No. Closes apache#54182 from pan3793/spj-subset-joinkey-bug. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>

szehon-ho

late lgtm, thanks @pan3793 and @peter-toth for the fix!

…when join keys are less than cluster keys Backport #54182 to branch-4.1 ### What changes were proposed in this pull request? Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) ### Why are the changes needed? It's a bug fix. ### Does this PR introduce _any_ user-facing change? Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. ### How was this patch tested? New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54259 from pan3793/SPARK-55411-4.1. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>

github-actions bot added the SQL label Feb 6, 2026

pan3793 commented Feb 6, 2026

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala Outdated Show resolved Hide resolved

pan3793 commented Feb 6, 2026

View reviewed changes

pan3793 changed the title ~~[SPARK-XXXXX][SQL] Internel error when SPJ ALLOW_JOIN_KEYS_SUBSET_OF_PARTITION_KEYS enabled~~ [SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys Feb 7, 2026

peter-toth reviewed Feb 7, 2026

View reviewed changes

[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when …

cb78da6

…join keys are less than cluster keys

pan3793 force-pushed the spj-subset-joinkey-bug branch from bbf8c3b to cb78da6 Compare February 8, 2026 05:17

fix test related to bucket calc change

beade2b

pan3793 marked this pull request as ready for review February 8, 2026 08:27

ensure positive

55bbd70

pan3793 commented Feb 8, 2026

View reviewed changes

floorMod

216352c

peter-toth approved these changes Feb 8, 2026

View reviewed changes

dongjoon-hyun approved these changes Feb 8, 2026

View reviewed changes

peter-toth closed this in fbf096e Feb 10, 2026

This was referenced Feb 11, 2026

[SPARK-55411][SQL][4.1] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54259

Closed

[SPARK-55411][SQL][4.0] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54260

Open

szehon-ho reviewed Feb 11, 2026

View reviewed changes

	// Do not use `bucket()` in "one side partition" tests as its implementation in
	// `InMemoryBaseTable` conflicts with `BucketFunction`

	override def produceResult(input: InternalRow): Int = {
	(input.getLong(1) % input.getInt(0)).toInt
	}

	val valueTypePairs = cols.map(col => extractor(col.fieldNames, cleanedSchema, row))
	var valueHashCode = 0
	valueTypePairs.foreach( pair =>
	if ( pair._1 != null) valueHashCode += pair._1.hashCode()
	)
	var dataTypeHashCode = 0
	valueTypePairs.foreach(dataTypeHashCode += _._2.hashCode())
	((valueHashCode + 31 * dataTypeHashCode) & Integer.MAX_VALUE) % numBuckets

[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54182

[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54182

Conversation

pan3793 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA Issue Information

Uh oh!

Uh oh!

szehon-ho commented Feb 6, 2026

Uh oh!

pan3793 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Feb 8, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 10, 2026

Uh oh!

pan3793 commented Feb 11, 2026

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pan3793 commented Feb 6, 2026 •

edited

Loading

github-actions bot commented Feb 6, 2026 •

edited

Loading

peter-toth Feb 7, 2026 •

edited

Loading

peter-toth Feb 8, 2026 •

edited

Loading

peter-toth Feb 8, 2026 •

edited

Loading

peter-toth commented Feb 10, 2026 •

edited

Loading

szehon-ho left a comment •

edited

Loading