Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement physical plan for EXISTS subquery #123

Closed
alamb opened this issue Apr 26, 2021 · 4 comments
Closed

Implement physical plan for EXISTS subquery #123

alamb opened this issue Apr 26, 2021 · 4 comments
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10819

The TPC-H queries include use of the EXISTS which is used to test for the existence of any record in a subquery. For example:

and *exists* (
    select
        *
    from
        lineitem
    where
        l_orderkey = o_orderkey
        and l_commitdate < l_receiptdate
)
@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Andy Grove(andygrove) @ 2020-12-31T19:42:45.132+0000:

The example given here is a correlated subquery that can be translated into a join.

Here is a random stackoverflow discussion on this for reference (I have not reviewed it)

https://stackoverflow.com/questions/1772609/procedurally-transform-subquery-into-join

@alamb
Copy link
Contributor Author

alamb commented Oct 21, 2022

In case anyone is curious -- we support correlated versions of these queries (via a join) but if there is no correlation (not super useful) we do not

❯ create table foo as select * from (values (1), (2), (NULL)) as sql
;
0 rows in set. Query took 0.022 seconds.
3 rows in set. Query took 0.007 seconds.
❯ create table bar as select * from (values (1), (2), (NULL)) as sql;
0 rows in set. Query took 0.000 seconds.
❯ select * from foo where exists (select column1 from bar);
NotImplemented("Physical plan does not support logical expression EXISTS (<subquery>)")
❯ select * from foo where exists (select column1 from bar where foo.column1 = bar.column1);
+---------+
| column1 |
+---------+
| 2       |
| 1       |
+---------+

@logan-keede
Copy link
Contributor

logan-keede commented Feb 9, 2025

In case anyone is curious -- we support correlated versions of these queries (via a join) but if there is no correlation (not super useful) we do not

❯ create table foo as select * from (values (1), (2), (NULL)) as sql
;
0 rows in set. Query took 0.022 seconds.
3 rows in set. Query took 0.007 seconds.
❯ create table bar as select * from (values (1), (2), (NULL)) as sql;
0 rows in set. Query took 0.000 seconds.
❯ select * from foo where exists (select column1 from bar);
NotImplemented("Physical plan does not support logical expression EXISTS ()")
❯ select * from foo where exists (select column1 from bar where foo.column1 = bar.column1);
+---------+
| column1 |
+---------+
| 2 |
| 1 |
+---------+

> explain select * from foo where exists (select column1 from bar);
+---------------+-----------------------------------------------------+
| plan_type     | plan                                                |
+---------------+-----------------------------------------------------+
| logical_plan  | LeftSemi Join:                                      |
|               |   TableScan: foo projection=[column1]               |
|               |   SubqueryAlias: __correlated_sq_1                  |
|               |     TableScan: bar projection=[]                    |
| physical_plan | NestedLoopJoinExec: join_type=RightSemi             |
|               |   DataSourceExec: partitions=1, partition_sizes=[1] |
|               |   DataSourceExec: partitions=1, partition_sizes=[1] |
|               |                                                     |
+---------------+-----------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.007 seconds.

> select * from foo where exists (select column1 from bar);
+---------+
| column1 |
+---------+
| 1       |
| 2       |
| NULL    |
+---------+
3 row(s) fetched. 
Elapsed 0.006 seconds.

I get this on datafusion 45.0.0, perhaps this issue can be closed or at least updated if there is a case where we still don't support EXISTS in physical plan.

@alamb
Copy link
Contributor Author

alamb commented Feb 10, 2025

Looks good -- thanks for checking @logan-keede -- we can open a new issue if we find another hole

@alamb alamb closed this as completed Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

3 participants