-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsupported Filter over SemiJoin #2877
Comments
CC: @rui-mo Rui, I'm curious how Gluten plans with query (TPC-DS q45). Does it use Spark's Java operators for the semi join? |
Masha, I checked both q45 and above query, and found they were similar. In Spark, it is planned as a ShuffledHashJoin with existence join type and a filter as below.
ExistenceJoin in Spark appends a column of boolean type after the left table to illustrate whether one row is joined or not. This join type is not supported in Gluten for now, so in TPC-DS q45, this join is executed by vanilla Spark and this filter is executed by Velox. But we are working on enabling more operators and functions to offload more computing into Velox. |
@rui-mo Rui, thank you for the clarifying. This is very helpful. ExistingJoin in Spark appears to work similar to SemiJoin in Presto. Both act as a projection by emitting all rows from the left side with an additional boolean column set to true if there is a match, false if there is no match and null if "unknown" (left key is null and build side is not empty or there is no match and build side has a null). In Velox, SemiJoin acts as a filter by emitting a subset of rows from the left side for which there is a match on the right side. There is a left semi join and right semi join to allow for using the smaller side as the build side. Both semi joins support filter pushdown from build to probe side. As @XuPingyong pointed out, the existing implementation of SemiJoin in Velox doesn’t support queries like TPC-DS q45. Presto’s and Spark's semantics for the semi join do not allow for dynamic filter pushdown. Hence, we may want to keep Velox’s implementation, but extend it to allow the queries above. One option is to extend the plan node to allow for specifying whether semi join should act as a filter, i.e. produce a subset of the left side rows, or as a project, i.e. emit all the left side rows with an extra boolean column. CC: @oerling |
…tor#3068) Summary: Semi joins are used to implement correlated IN and EXISTS subqueries. For example, the following query can be implemented as a semi join. In this case semi join acts as a filter. It returns a subset of the left-side rows that have a match on the right side. This is a kLeftSemiFilter join type. `SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key)` If left side is much smaller than the right side, it is more efficient to build hash table from the left side. This can be achieved by using kRightSemiFilter join type. kLeftSemiFilter and kRightSemiFilter joins are symmetrical, i.e. `kLeftSemiFilter(A, B) == kRightSemiFilter(B, A)`. Similar to how `kLeft(A, B) == kRight(B, A)`. A semi join may have an additional non-equi filter. `SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key AND u.x > t.y)` Furthermore, the semi join may be combined with another filter using a conjunct other than AND. `SELECT * FROM t WHERE t.y > 10 OR EXISTS (SELECT * FROM u WHERE u.key = t.key)` One way to implement this query is to use semi join as a project, i.e. return all left-side rows with an additional boolean flag that indicates whether there is a match on the right side, then use that boolean to evaluate the filter: `t.y > 10 OR match`. This is a kLeftSemiProject join type. Both Spark and Presto plan correlated subqueries this way. Similarly to kLeftSemiFilter join type, if left side is much smaller than the right side, it is more efficient to build hash table from the left side. This can be achieved by using kRightSemiProject join type. kLeftSemiProject and kRightSemiProject joins are symmetrical: `kLeftSemiProject(A, B) == kRightSemiProject(B, A)`. Fixes facebookincubator#2877 Pull Request resolved: facebookincubator#3068 Reviewed By: xiaoxmeng Differential Revision: D40939317 Pulled By: mbasmanova fbshipit-source-id: 63867aba9ddfb969b3a14edfcea0aa46d61bdde9
…tor#3068) Summary: Semi joins are used to implement correlated IN and EXISTS subqueries. For example, the following query can be implemented as a semi join. In this case semi join acts as a filter. It returns a subset of the left-side rows that have a match on the right side. This is a kLeftSemiFilter join type. `SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key)` If left side is much smaller than the right side, it is more efficient to build hash table from the left side. This can be achieved by using kRightSemiFilter join type. kLeftSemiFilter and kRightSemiFilter joins are symmetrical, i.e. `kLeftSemiFilter(A, B) == kRightSemiFilter(B, A)`. Similar to how `kLeft(A, B) == kRight(B, A)`. A semi join may have an additional non-equi filter. `SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key AND u.x > t.y)` Furthermore, the semi join may be combined with another filter using a conjunct other than AND. `SELECT * FROM t WHERE t.y > 10 OR EXISTS (SELECT * FROM u WHERE u.key = t.key)` One way to implement this query is to use semi join as a project, i.e. return all left-side rows with an additional boolean flag that indicates whether there is a match on the right side, then use that boolean to evaluate the filter: `t.y > 10 OR match`. This is a kLeftSemiProject join type. Both Spark and Presto plan correlated subqueries this way. Similarly to kLeftSemiFilter join type, if left side is much smaller than the right side, it is more efficient to build hash table from the left side. This can be achieved by using kRightSemiProject join type. kLeftSemiProject and kRightSemiProject joins are symmetrical: `kLeftSemiProject(A, B) == kRightSemiProject(B, A)`. Fixes facebookincubator#2877 Pull Request resolved: facebookincubator#3068 Reviewed By: xiaoxmeng Differential Revision: D40939317 Pulled By: mbasmanova fbshipit-source-id: adf55a8d1a6c067c4d6ecf4b4d510bede7955f1a
@XuPingyong I added support for semi join project in #3068. Would you try it out and let us know how it goes? |
Thanks very much. LGTM. |
@XuPingyong That's right. I didn't update the plan translation in Prestissimo yet to use the kSemiJoinProject join type. Let me send out a PR with these changes. |
@storm-dance Here is a PR to update plan translation in Prestissimo to use kLeftSemiProject join: |
Thanks. my problem is sloved and I appricate it. |
@storm-dance Thank you for confirming that these changes are sufficient to fix the original issue. |
Semi-joins with additional filters are not yet supported in Prestissimo / Velox:
From @XuPingyong
TPC-DS q45 fails with: Unsupported Filter over SemiJoin
Originally posted by @XuPingyong in #2820 (comment)
The text was updated successfully, but these errors were encountered: