Commit 5db8778
[SPARK-43781][SQL] Fix IllegalStateException when cogrouping two datasets derived from the same source
### What changes were proposed in this pull request?
When cogroup two datasets derived from same source, eg:
```scala
val inputType = StructType(Array(StructField("id", LongType, false),
StructField("type", StringType, false)))
val keyType = StructType(Array(StructField("id", LongType, false)))
val inputRows = new java.util.ArrayList[Row]()
inputRows.add(Row(1L, "foo"))
inputRows.add(Row(1L, "bar"))
inputRows.add(Row(2L, "foo"))
val input = spark.createDataFrame(inputRows, inputType)
val fooGroups = input.filter("type = 'foo'").groupBy("id").as(RowEncoder(keyType),
RowEncoder(inputType))
val barGroups = input.filter("type = 'bar'").groupBy("id").as(RowEncoder(keyType),
RowEncoder(inputType))
val result = fooGroups.cogroup(barGroups) { case (row, iterator, iterator1) =>
iterator.toSeq ++ iterator1.toSeq
}(RowEncoder(inputType)).collect()
```
The error will be reported:
```
21:03:27.651 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.IllegalStateException: Couldn't find id#19L in [id#0L,type#1]
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
...
```
The reason are `DeduplicateRelations` rewrite `LocalRelation` but can't rewrite `left(right)Group` and `left(right)Attr` in `CoGroup`. In fact, the `Join` will face same situation. But `Join` regenerate plan when invoke itself to avoid this situation. Please refer https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1089
This PR let `DeduplicateRelations` handle with `CoGroup` case
### Why are the changes needed?
Fix IllegalStateException when cogrouping two datasets derived from the same source
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new test
Closes #41554 from Hisoka-X/SPARK-43781_cogrouping_two_datasets.
Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>1 parent 3164ff5 commit 5db8778
File tree
2 files changed
+63
-2
lines changed- sql
- catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis
- core/src/test/scala/org/apache/spark/sql
2 files changed
+63
-2
lines changedLines changed: 37 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| |||
228 | 228 | | |
229 | 229 | | |
230 | 230 | | |
231 | | - | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
232 | 267 | | |
233 | 268 | | |
234 | 269 | | |
| |||
Lines changed: 26 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
916 | 916 | | |
917 | 917 | | |
918 | 918 | | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
919 | 945 | | |
920 | 946 | | |
921 | 947 | | |
| |||
0 commit comments