Unsupported Filter over SemiJoin #2877

aditi-pandit · 2022-10-19T00:50:29Z

Semi-joins with additional filters are not yet supported in Prestissimo / Velox:

SELECT * FROM lineitem 
WHERE orderkey < 10 OR orderkey IN (SELECT orderkey FROM orders WHERE (orderkey + custkey) % 2 = 0);

VeloxUserError:  Unsupported Filter over SemiJoin: ...

From @XuPingyong

TPC-DS q45 fails with: Unsupported Filter over SemiJoin

--TPC-DS Q45
select  ca_zip, ca_city, sum(ws_sales_price)
 from web_sales, customer, customer_address, date_dim, item
 where ws_bill_customer_sk = c_customer_sk
 	and c_current_addr_sk = ca_address_sk
 	and ws_item_sk = i_item_sk
 	and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475', '85392', '85460', '80348', '81792')
 	      or
 	      i_item_id in (select i_item_id
                             from item
                             where i_item_sk in (2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
                             )
 	    )
 	and ws_sold_date_sk = d_date_sk
 	and d_qoy = 2 and d_year = 2002
 group by ca_zip, ca_city
 order by ca_zip, ca_city
 limit 100;

Originally posted by @XuPingyong in #2820 (comment)

The text was updated successfully, but these errors were encountered:

mbasmanova · 2022-10-19T00:55:51Z

CC: @rui-mo Rui, I'm curious how Gluten plans with query (TPC-DS q45). Does it use Spark's Java operators for the semi join?

rui-mo · 2022-10-19T03:32:19Z

CC: @rui-mo Rui, I'm curious how Gluten plans with query (TPC-DS q45). Does it use Spark's Java operators for the semi join?

Masha, I checked both q45 and above query, and found they were similar. In Spark, it is planned as a ShuffledHashJoin with existence join type and a filter as below.

(122) ShuffledHashJoin
Left keys [1]: [i_item_id#363]
Right keys [1]: [i_item_id#852]
Join condition: None

(123) Filter
Input [5]: [ws_sales_price#751, ca_city#244, ca_zip#247, i_item_id#363, exists#913]
Condition : (substr(ca_zip#247, 1, 5) IN (85669,86197,88274,83405,86475,85392,85460,80348,81792) OR exists#913)

ExistenceJoin in Spark appends a column of boolean type after the left table to illustrate whether one row is joined or not. This join type is not supported in Gluten for now, so in TPC-DS q45, this join is executed by vanilla Spark and this filter is executed by Velox. But we are working on enabling more operators and functions to offload more computing into Velox.

mbasmanova · 2022-10-19T11:35:54Z

@rui-mo Rui, thank you for the clarifying. This is very helpful. ExistingJoin in Spark appears to work similar to SemiJoin in Presto. Both act as a projection by emitting all rows from the left side with an additional boolean column set to true if there is a match, false if there is no match and null if "unknown" (left key is null and build side is not empty or there is no match and build side has a null).

In Velox, SemiJoin acts as a filter by emitting a subset of rows from the left side for which there is a match on the right side. There is a left semi join and right semi join to allow for using the smaller side as the build side. Both semi joins support filter pushdown from build to probe side.

As @XuPingyong pointed out, the existing implementation of SemiJoin in Velox doesn’t support queries like TPC-DS q45.

Presto’s and Spark's semantics for the semi join do not allow for dynamic filter pushdown. Hence, we may want to keep Velox’s implementation, but extend it to allow the queries above. One option is to extend the plan node to allow for specifying whether semi join should act as a filter, i.e. produce a subset of the left side rows, or as a project, i.e. emit all the left side rows with an extra boolean column.

CC: @oerling

…tor#3068) Summary: Semi joins are used to implement correlated IN and EXISTS subqueries. For example, the following query can be implemented as a semi join. In this case semi join acts as a filter. It returns a subset of the left-side rows that have a match on the right side. This is a kLeftSemiFilter join type. `SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key)` If left side is much smaller than the right side, it is more efficient to build hash table from the left side. This can be achieved by using kRightSemiFilter join type. kLeftSemiFilter and kRightSemiFilter joins are symmetrical, i.e. `kLeftSemiFilter(A, B) == kRightSemiFilter(B, A)`. Similar to how `kLeft(A, B) == kRight(B, A)`. A semi join may have an additional non-equi filter. `SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key AND u.x > t.y)` Furthermore, the semi join may be combined with another filter using a conjunct other than AND. `SELECT * FROM t WHERE t.y > 10 OR EXISTS (SELECT * FROM u WHERE u.key = t.key)` One way to implement this query is to use semi join as a project, i.e. return all left-side rows with an additional boolean flag that indicates whether there is a match on the right side, then use that boolean to evaluate the filter: `t.y > 10 OR match`. This is a kLeftSemiProject join type. Both Spark and Presto plan correlated subqueries this way. Similarly to kLeftSemiFilter join type, if left side is much smaller than the right side, it is more efficient to build hash table from the left side. This can be achieved by using kRightSemiProject join type. kLeftSemiProject and kRightSemiProject joins are symmetrical: `kLeftSemiProject(A, B) == kRightSemiProject(B, A)`. Fixes facebookincubator#2877 Pull Request resolved: facebookincubator#3068 Reviewed By: xiaoxmeng Differential Revision: D40939317 Pulled By: mbasmanova fbshipit-source-id: 63867aba9ddfb969b3a14edfcea0aa46d61bdde9

…tor#3068) Summary: Semi joins are used to implement correlated IN and EXISTS subqueries. For example, the following query can be implemented as a semi join. In this case semi join acts as a filter. It returns a subset of the left-side rows that have a match on the right side. This is a kLeftSemiFilter join type. `SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key)` If left side is much smaller than the right side, it is more efficient to build hash table from the left side. This can be achieved by using kRightSemiFilter join type. kLeftSemiFilter and kRightSemiFilter joins are symmetrical, i.e. `kLeftSemiFilter(A, B) == kRightSemiFilter(B, A)`. Similar to how `kLeft(A, B) == kRight(B, A)`. A semi join may have an additional non-equi filter. `SELECT * FROM t WHERE EXISTS (SELECT * FROM u WHERE u.key = t.key AND u.x > t.y)` Furthermore, the semi join may be combined with another filter using a conjunct other than AND. `SELECT * FROM t WHERE t.y > 10 OR EXISTS (SELECT * FROM u WHERE u.key = t.key)` One way to implement this query is to use semi join as a project, i.e. return all left-side rows with an additional boolean flag that indicates whether there is a match on the right side, then use that boolean to evaluate the filter: `t.y > 10 OR match`. This is a kLeftSemiProject join type. Both Spark and Presto plan correlated subqueries this way. Similarly to kLeftSemiFilter join type, if left side is much smaller than the right side, it is more efficient to build hash table from the left side. This can be achieved by using kRightSemiProject join type. kLeftSemiProject and kRightSemiProject joins are symmetrical: `kLeftSemiProject(A, B) == kRightSemiProject(B, A)`. Fixes facebookincubator#2877 Pull Request resolved: facebookincubator#3068 Reviewed By: xiaoxmeng Differential Revision: D40939317 Pulled By: mbasmanova fbshipit-source-id: adf55a8d1a6c067c4d6ecf4b4d510bede7955f1a

mbasmanova · 2022-11-02T21:47:32Z

@XuPingyong I added support for semi join project in #3068. Would you try it out and let us know how it goes?

storm-dance · 2022-11-04T03:42:41Z

Thanks very much. LGTM.
But the error "Unsupported Filter over SemiJoin" still exists in prestodb/presto native-engine.

mbasmanova · 2022-11-04T11:29:21Z

@XuPingyong That's right. I didn't update the plan translation in Prestissimo yet to use the kSemiJoinProject join type. Let me send out a PR with these changes.

mbasmanova · 2022-11-10T22:54:58Z

@storm-dance Here is a PR to update plan translation in Prestissimo to use kLeftSemiProject join:

prestodb/presto#18616

storm-dance · 2022-11-11T02:14:31Z

@storm-dance Here is a PR to update plan translation in Prestissimo to use kLeftSemiProject join:

prestodb/presto#18616

Thanks. my problem is sloved and I appricate it.

mbasmanova · 2022-11-11T02:22:26Z

@storm-dance Thank you for confirming that these changes are sufficient to fix the original issue.

mbasmanova added the enhancement New feature or request label Oct 19, 2022

aditi-pandit mentioned this issue Oct 19, 2022

Unknown plan node type: com.facebook.presto.sql.planner.plan.WindowNode #2820

Closed

mbasmanova changed the title ~~Semi-join with filter error~~ Unsupported Filter over SemiJoin Oct 19, 2022

mbasmanova self-assigned this Oct 19, 2022

mbasmanova added the operators label Oct 19, 2022

mbasmanova mentioned this issue Nov 2, 2022

Add kLeftSemiProject and kRightSemiProject join types #3068

Closed

facebook-github-bot closed this as completed in d8ccca4 Nov 3, 2022

mbasmanova reopened this Nov 3, 2022

mbasmanova closed this as completed Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupported Filter over SemiJoin #2877

Unsupported Filter over SemiJoin #2877

aditi-pandit commented Oct 19, 2022 •

edited by mbasmanova

Loading

mbasmanova commented Oct 19, 2022

rui-mo commented Oct 19, 2022

mbasmanova commented Oct 19, 2022

mbasmanova commented Nov 2, 2022

storm-dance commented Nov 4, 2022

mbasmanova commented Nov 4, 2022

mbasmanova commented Nov 10, 2022 •

edited

Loading

storm-dance commented Nov 11, 2022

mbasmanova commented Nov 11, 2022

Unsupported Filter over SemiJoin #2877

Unsupported Filter over SemiJoin #2877

Comments

aditi-pandit commented Oct 19, 2022 • edited by mbasmanova Loading

mbasmanova commented Oct 19, 2022

rui-mo commented Oct 19, 2022

mbasmanova commented Oct 19, 2022

mbasmanova commented Nov 2, 2022

storm-dance commented Nov 4, 2022

mbasmanova commented Nov 4, 2022

mbasmanova commented Nov 10, 2022 • edited Loading

storm-dance commented Nov 11, 2022

mbasmanova commented Nov 11, 2022

aditi-pandit commented Oct 19, 2022 •

edited by mbasmanova

Loading

mbasmanova commented Nov 10, 2022 •

edited

Loading