Commit fb0894b
[SPARK-17698][SQL] Join predicates should not contain filter clauses
## What changes were proposed in this pull request?
Jira : https://issues.apache.org/jira/browse/SPARK-17698
`ExtractEquiJoinKeys` is incorrectly using filter predicates as the join condition for joins. `canEvaluate` [0] tries to see if the an `Expression` can be evaluated using output of a given `Plan`. In case of filter predicates (eg. `a.id='1'`), the `Expression` passed for the right hand side (ie. '1' ) is a `Literal` which does not have any attribute references. Thus `expr.references` is an empty set which theoretically is a subset of any set. This leads to `canEvaluate` returning `true` and `a.id='1'` is treated as a join predicate. While this does not lead to incorrect results but in case of bucketed + sorted tables, we might miss out on avoiding un-necessary shuffle + sort. See example below:
[0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91
eg.
```
val df = (1 until 10).toDF("id").coalesce(1)
hc.sql("DROP TABLE IF EXISTS table1").collect
df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1")
hc.sql("DROP TABLE IF EXISTS table2").collect
df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2")
sqlContext.sql("""
SELECT a.id, b.id
FROM table1 a
FULL OUTER JOIN table2 b
ON a.id = b.id AND a.id='1' AND b.id='1'
""").explain(true)
```
BEFORE: This is doing shuffle + sort over table scan outputs which is not needed as both tables are bucketed and sorted on the same columns and have same number of buckets. This should be a single stage job.
```
SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as double)], FullOuter
:- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200)
: +- *FileScan parquet default.table1[id#38] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
+- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200)
+- *FileScan parquet default.table2[id#39] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
```
AFTER :
```
SortMergeJoin [id#32], [id#33], FullOuter, ((cast(id#32 as double) = 1.0) && (cast(id#33 as double) = 1.0))
:- *FileScan parquet default.table1[id#32] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
+- *FileScan parquet default.table2[id#33] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
```
## How was this patch tested?
- Added a new test case for this scenario : `SPARK-17698 Join predicates should not contain filter clauses`
- Ran all the tests in `BucketedReadSuite`
Author: Tejas Patil <tejasp@fb.com>
Closes #15272 from tejasapatil/SPARK-17698_join_predicate_filter_clause.1 parent e895bc2 commit fb0894b
File tree
4 files changed
+109
-26
lines changed- sql
- catalyst/src/main/scala/org/apache/spark/sql/catalyst
- expressions
- optimizer
- planning
- hive/src/test/scala/org/apache/spark/sql/sources
4 files changed
+109
-26
lines changedLines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
87 | | - | |
88 | | - | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
| |||
Lines changed: 3 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
68 | | - | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
69 | 71 | | |
70 | 72 | | |
71 | 73 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
| 115 | + | |
115 | 116 | | |
116 | 117 | | |
117 | 118 | | |
| |||
125 | 126 | | |
126 | 127 | | |
127 | 128 | | |
| 129 | + | |
128 | 130 | | |
129 | 131 | | |
130 | 132 | | |
| |||
Lines changed: 101 additions & 23 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
235 | 235 | | |
236 | 236 | | |
237 | 237 | | |
238 | | - | |
| 238 | + | |
| 239 | + | |
239 | 240 | | |
240 | 241 | | |
241 | 242 | | |
| |||
268 | 269 | | |
269 | 270 | | |
270 | 271 | | |
271 | | - | |
| 272 | + | |
272 | 273 | | |
273 | 274 | | |
274 | 275 | | |
275 | 276 | | |
276 | | - | |
| 277 | + | |
277 | 278 | | |
278 | 279 | | |
279 | 280 | | |
| |||
297 | 298 | | |
298 | 299 | | |
299 | 300 | | |
300 | | - | |
| 301 | + | |
301 | 302 | | |
302 | 303 | | |
303 | 304 | | |
304 | 305 | | |
305 | 306 | | |
306 | | - | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
307 | 314 | | |
308 | 315 | | |
309 | 316 | | |
310 | 317 | | |
311 | 318 | | |
312 | | - | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
313 | 326 | | |
314 | 327 | | |
315 | 328 | | |
316 | 329 | | |
317 | | - | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
318 | 337 | | |
319 | 338 | | |
320 | 339 | | |
321 | 340 | | |
322 | 341 | | |
323 | | - | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
324 | 349 | | |
325 | 350 | | |
326 | 351 | | |
327 | 352 | | |
328 | 353 | | |
329 | | - | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
330 | 361 | | |
331 | 362 | | |
332 | 363 | | |
333 | 364 | | |
334 | | - | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
335 | 372 | | |
336 | 373 | | |
337 | 374 | | |
338 | 375 | | |
339 | 376 | | |
340 | | - | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
341 | 384 | | |
342 | 385 | | |
343 | 386 | | |
344 | 387 | | |
345 | 388 | | |
346 | 389 | | |
347 | | - | |
348 | | - | |
349 | | - | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
350 | 397 | | |
351 | 398 | | |
352 | 399 | | |
353 | 400 | | |
354 | 401 | | |
355 | 402 | | |
356 | 403 | | |
357 | | - | |
358 | | - | |
359 | | - | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
360 | 411 | | |
361 | 412 | | |
362 | 413 | | |
363 | 414 | | |
364 | 415 | | |
365 | 416 | | |
366 | 417 | | |
367 | | - | |
368 | | - | |
369 | | - | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
370 | 425 | | |
371 | 426 | | |
372 | 427 | | |
373 | 428 | | |
374 | 429 | | |
375 | 430 | | |
376 | 431 | | |
377 | | - | |
378 | | - | |
379 | | - | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
380 | 439 | | |
381 | 440 | | |
382 | 441 | | |
| |||
408 | 467 | | |
409 | 468 | | |
410 | 469 | | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
411 | 489 | | |
412 | 490 | | |
413 | 491 | | |
| |||
0 commit comments