Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported Filter over SemiJoin #2877

Closed
aditi-pandit opened this issue Oct 19, 2022 · 9 comments
Closed

Unsupported Filter over SemiJoin #2877

aditi-pandit opened this issue Oct 19, 2022 · 9 comments
Assignees
Labels
enhancement New feature or request operators

Comments

@aditi-pandit
Copy link
Collaborator

aditi-pandit commented Oct 19, 2022

Semi-joins with additional filters are not yet supported in Prestissimo / Velox:

SELECT * FROM lineitem 
WHERE orderkey < 10 OR orderkey IN (SELECT orderkey FROM orders WHERE (orderkey + custkey) % 2 = 0);

VeloxUserError:  Unsupported Filter over SemiJoin: ...

From @XuPingyong

TPC-DS q45 fails with: Unsupported Filter over SemiJoin

--TPC-DS Q45
select  ca_zip, ca_city, sum(ws_sales_price)
 from web_sales, customer, customer_address, date_dim, item
 where ws_bill_customer_sk = c_customer_sk
 	and c_current_addr_sk = ca_address_sk
 	and ws_item_sk = i_item_sk
 	and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475', '85392', '85460', '80348', '81792')
 	      or
 	      i_item_id in (select i_item_id
                             from item
                             where i_item_sk in (2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
                             )
 	    )
 	and ws_sold_date_sk = d_date_sk
 	and d_qoy = 2 and d_year = 2002
 group by ca_zip, ca_city
 order by ca_zip, ca_city
 limit 100;

Originally posted by @XuPingyong in #2820 (comment)

@mbasmanova mbasmanova added the enhancement New feature or request label Oct 19, 2022
@mbasmanova mbasmanova changed the title Semi-join with filter error Unsupported Filter over SemiJoin Oct 19, 2022
@mbasmanova
Copy link
Contributor

CC: @rui-mo Rui, I'm curious how Gluten plans with query (TPC-DS q45). Does it use Spark's Java operators for the semi join?

@rui-mo
Copy link
Collaborator

rui-mo commented Oct 19, 2022

CC: @rui-mo Rui, I'm curious how Gluten plans with query (TPC-DS q45). Does it use Spark's Java operators for the semi join?

Masha, I checked both q45 and above query, and found they were similar. In Spark, it is planned as a ShuffledHashJoin with existence join type and a filter as below.

(122) ShuffledHashJoin
Left keys [1]: [i_item_id#363]
Right keys [1]: [i_item_id#852]
Join condition: None

(123) Filter
Input [5]: [ws_sales_price#751, ca_city#244, ca_zip#247, i_item_id#363, exists#913]
Condition : (substr(ca_zip#247, 1, 5) IN (85669,86197,88274,83405,86475,85392,85460,80348,81792) OR exists#913)

ExistenceJoin in Spark appends a column of boolean type after the left table to illustrate whether one row is joined or not. This join type is not supported in Gluten for now, so in TPC-DS q45, this join is executed by vanilla Spark and this filter is executed by Velox. But we are working on enabling more operators and functions to offload more computing into Velox.

@mbasmanova
Copy link
Contributor

@rui-mo Rui, thank you for the clarifying. This is very helpful. ExistingJoin in Spark appears to work similar to SemiJoin in Presto. Both act as a projection by emitting all rows from the left side with an additional boolean column set to true if there is a match, false if there is no match and null if "unknown" (left key is null and build side is not empty or there is no match and build side has a null).

In Velox, SemiJoin acts as a filter by emitting a subset of rows from the left side for which there is a match on the right side. There is a left semi join and right semi join to allow for using the smaller side as the build side. Both semi joins support filter pushdown from build to probe side.

As @XuPingyong pointed out, the existing implementation of SemiJoin in Velox doesn’t support queries like TPC-DS q45.

Presto’s and Spark's semantics for the semi join do not allow for dynamic filter pushdown. Hence, we may want to keep Velox’s implementation, but extend it to allow the queries above. One option is to extend the plan node to allow for specifying whether semi join should act as a filter, i.e. produce a subset of the left side rows, or as a project, i.e. emit all the left side rows with an extra boolean column.

CC: @oerling

@mbasmanova mbasmanova self-assigned this Oct 19, 2022
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Nov 2, 2022
…tor#3068)

Summary:
Semi joins are used to implement correlated IN and EXISTS subqueries.

For example, the following query can be implemented as a semi join. In this case semi
join acts as a filter. It returns a subset of the left-side rows that have a
match on the right side. This is a kLeftSemiFilter join type.

`SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key)`

If left side is much smaller than the right side, it is more efficient to build
hash table from the left side. This can be achieved by using kRightSemiFilter
join type. kLeftSemiFilter and kRightSemiFilter joins are symmetrical, i.e.
`kLeftSemiFilter(A, B) == kRightSemiFilter(B, A)`. Similar to how
`kLeft(A, B) == kRight(B, A)`.

A semi join may have an additional non-equi filter.

`SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key AND u.x >
t.y)`

Furthermore, the semi join may be combined with another filter using a conjunct
other than AND.

`SELECT * FROM t WHERE t.y > 10 OR EXISTS (SELECT * FROM u WHERE u.key =
t.key)`

One way to implement this query is to use semi join as a project, i.e. return
all left-side rows with an additional boolean flag that indicates whether there is a
match on the right side, then use that boolean to evaluate the filter:
`t.y > 10 OR match`. This is a kLeftSemiProject join type. Both Spark and Presto
plan correlated subqueries this way.

Similarly to kLeftSemiFilter join type, if left side is much smaller than the right side,
it is more efficient to build hash table from the left side. This can be achieved by using
kRightSemiProject join type. kLeftSemiProject and kRightSemiProject joins are
symmetrical: `kLeftSemiProject(A, B) == kRightSemiProject(B, A)`.

Fixes facebookincubator#2877

Pull Request resolved: facebookincubator#3068

Reviewed By: xiaoxmeng

Differential Revision: D40939317

Pulled By: mbasmanova

fbshipit-source-id: 63867aba9ddfb969b3a14edfcea0aa46d61bdde9
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Nov 2, 2022
…tor#3068)

Summary:
Semi joins are used to implement correlated IN and EXISTS subqueries.

For example, the following query can be implemented as a semi join. In this case semi
join acts as a filter. It returns a subset of the left-side rows that have a
match on the right side. This is a kLeftSemiFilter join type.

`SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key)`

If left side is much smaller than the right side, it is more efficient to build
hash table from the left side. This can be achieved by using kRightSemiFilter
join type. kLeftSemiFilter and kRightSemiFilter joins are symmetrical, i.e.
`kLeftSemiFilter(A, B) == kRightSemiFilter(B, A)`. Similar to how
`kLeft(A, B) == kRight(B, A)`.

A semi join may have an additional non-equi filter.

`SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key AND u.x >
t.y)`

Furthermore, the semi join may be combined with another filter using a conjunct
other than AND.

`SELECT * FROM t WHERE t.y > 10 OR EXISTS (SELECT * FROM u WHERE u.key =
t.key)`

One way to implement this query is to use semi join as a project, i.e. return
all left-side rows with an additional boolean flag that indicates whether there is a
match on the right side, then use that boolean to evaluate the filter:
`t.y > 10 OR match`. This is a kLeftSemiProject join type. Both Spark and Presto
plan correlated subqueries this way.

Similarly to kLeftSemiFilter join type, if left side is much smaller than the right side,
it is more efficient to build hash table from the left side. This can be achieved by using
kRightSemiProject join type. kLeftSemiProject and kRightSemiProject joins are
symmetrical: `kLeftSemiProject(A, B) == kRightSemiProject(B, A)`.

Fixes facebookincubator#2877

Pull Request resolved: facebookincubator#3068

Reviewed By: xiaoxmeng

Differential Revision: D40939317

Pulled By: mbasmanova

fbshipit-source-id: adf55a8d1a6c067c4d6ecf4b4d510bede7955f1a
@mbasmanova
Copy link
Contributor

@XuPingyong I added support for semi join project in #3068. Would you try it out and let us know how it goes?

@storm-dance
Copy link

Thanks very much. LGTM.
But the error "Unsupported Filter over SemiJoin" still exists in prestodb/presto native-engine.

@mbasmanova
Copy link
Contributor

@XuPingyong That's right. I didn't update the plan translation in Prestissimo yet to use the kSemiJoinProject join type. Let me send out a PR with these changes.

@mbasmanova
Copy link
Contributor

mbasmanova commented Nov 10, 2022

@storm-dance Here is a PR to update plan translation in Prestissimo to use kLeftSemiProject join:

prestodb/presto#18616

@storm-dance
Copy link

@storm-dance Here is a PR to update plan translation in Prestissimo to use kLeftSemiProject join:

prestodb/presto#18616

Thanks. my problem is sloved and I appricate it.

@mbasmanova
Copy link
Contributor

@storm-dance Thank you for confirming that these changes are sufficient to fix the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request operators
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants