Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Feb 11, 2026

Backport #54182 to branch-4.0

What changes were proposed in this pull request?

Fix a java.lang.ArrayIndexOutOfBoundsException when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true, by correcting the expression(should pass the full partition expression instead of the projected one) passed to KeyGroupedPartitioning#project.

Also, fix a test code issue, change the calculation result of BucketTransform defined at InMemoryBaseTable.scala to match BucketFunctions defined at transformFunctions.scala (thanks peter-toth for pointing this out!)

Why are the changes needed?

It's a bug fix.

Does this PR introduce any user-facing change?

Some queries that failed when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true now run normally.

How was this patch tested?

New UT is added, previously it failed with ArrayIndexOutOfBoundsException, now passed.

$ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411"
...
[info] - bug *** FAILED *** (1 second, 884 milliseconds)
[info]   java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
[info]   at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471)
[info]   at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58)
...

UTs affected by bucket() calculate logic change are tuned.

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793
Copy link
Member Author

pan3793 commented Feb 11, 2026

Python UDF failures are likely irrelevant, try to fix it by #54263

…when join keys are less than cluster keys

Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`.

Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!)

It's a bug fix.

Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally.

New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed.

```
$ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411"
...
[info] - bug *** FAILED *** (1 second, 884 milliseconds)
[info]   java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
[info]   at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471)
[info]   at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58)
...
```

UTs affected by `bucket()` calculate logic change are tuned.

No.

Closes apache#54182 from pan3793/spj-subset-joinkey-bug.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI.

peter-toth pushed a commit that referenced this pull request Feb 11, 2026
…when join keys are less than cluster keys

Backport #54182 to branch-4.0

### What changes were proposed in this pull request?

Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`.

Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!)

### Why are the changes needed?

It's a bug fix.

### Does this PR introduce _any_ user-facing change?

Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally.

### How was this patch tested?

New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed.

```
$ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411"
...
[info] - bug *** FAILED *** (1 second, 884 milliseconds)
[info]   java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
[info]   at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471)
[info]   at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58)
...
```

UTs affected by `bucket()` calculate logic change are tuned.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #54260 from pan3793/SPARK-55411-4.0.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
@peter-toth peter-toth closed this Feb 11, 2026
@peter-toth
Copy link
Contributor

Thank you @pan3793 and @szehon-ho.

Merged to branch-4.0 (4.0.3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants