Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for correlated subqueries #4017

Closed
kokosing opened this issue Nov 23, 2015 · 19 comments
Closed

Support for correlated subqueries #4017

kokosing opened this issue Nov 23, 2015 · 19 comments
Assignees

Comments

@kokosing
Copy link
Contributor

It relates to subqueries used in IN, EXISTS predicates and scalar subqueries.

@kokosing
Copy link
Contributor Author

Relates to #2878

@matthewwardrop
Copy link

Greetings all! I note that the referenced issue was closed for 0.132... just wondering if any progress has been made on supporting correlated subqueries?

@kbajda
Copy link

kbajda commented Feb 5, 2016

@matthewwardrop : yes, @kokosing is actively working on this. Expect PRs soon!

@matthewwardrop
Copy link

Excellent news! Thanks kokosing :). Looking forward to it being deployed :).

@kokosing
Copy link
Contributor Author

@martint FYI

Once the initial support correlated subqueries (see https://github.com/prestodb/presto/wiki/Correlated-subqueries) is done, our (roadmap) next goals are:

@kokosing
Copy link
Contributor Author

Also this is on our short term road map: #6638

@GrigorievNick
Copy link

GrigorievNick commented Jun 7, 2017

I have case like this

select 
(select username from cassandra.raw.players_by_brand as players where brand='brand_name' and players.bucket = buckets.bIndex)
from unnest(sequence(1, 5)) AS buckets (bIndex) limit 10;

It push me same exception.
Will this be fixed too?

@kokosing
Copy link
Contributor Author

kokosing commented Jun 7, 2017

Yes. Your case seems to be a great example to use LATERAL join which is also in flight. I would rewrite your query to something like:

select *
from unnest(sequence(1, 5)) AS buckets (bIndex), 
LATERAL (select username from cassandra.raw.players_by_brand as players where brand='brand_name' and players.bucket = buckets.bIndex)
 limit 10;

@kokosing kokosing self-assigned this Jun 7, 2017
@brandynabrams
Copy link

brandynabrams commented Aug 10, 2017

Hi All,

Out of curiosity, will a coorelated subquery join such as this work in the future as part of your ongoing support for coorelated subqueries?

from user_people up
left join order_applications oa on oa.id = (select id from order_applications where person_id = up.id order by id desc limit 1)

@kokosing
Copy link
Contributor Author

@brandynabrams Currently I do work on subqueries which are using LIMIT (and ORDER BY).

Generally, since all the subqueries in TPCH and TPCDS extending support for subqueries become less important and we switched to things that may affect broader user audience (like join reordering).

Anyway, still I feel somehow personally related and engaged with subqueries, push the support for them in my spare time. Once I finish #8435, I could start work on subqueries with LIMIT.

@brandynabrams
Copy link

Hey @kokosing,

Thanks for the reply! Yes, I agree tasks like join reordering (which would be awesome) are more important.

I'm very used to Postgres, and subqueries that use limit & order by was used very frequently by myself & my team.

Some of our micro services still use postgres, but our data warehouse (treasure data) uses hive & presto combo.. and this type of unsupported subquery makes its frustrating sometimes.. currently my work around is a window function w/ rank() & partition by or joining the table to itself with a max() function, but these type of subqueries that involve limit / order by seem to be way simpler and provide cleaner code.. (if you use other syntax that is more performant / better to use.. i'm happy to try it out)

But yah, long story short, if you are able to find time to add these types of subqueries.. that would be AWESOME! thanks man.

@kokosing
Copy link
Contributor Author

kokosing commented Jun 6, 2018

Todays support for correlated subqueries is decent in Presto. There are still some subquery patterns which are still not supported, but these should be tracked as separate issues.

@kokosing kokosing closed this as completed Jun 6, 2018
@aandis
Copy link
Contributor

aandis commented Oct 23, 2018

hey @kokosing which connectors support correlated subqueries in presto?

@kokosing
Copy link
Contributor Author

All. Because subqueries are supported by query planner and execution engine.

I kind of do not understand question. If you mean which support subquery pushdown then none.

@aandis
Copy link
Contributor

aandis commented Oct 24, 2018

@kokosing yeah. When I run queries like

select foo from table1 where bar in (select bar from table2 where qux = 'qux')

presto does a full table scan of table1. I am looking for ways to avoid that.

@kokosing
Copy link
Contributor Author

Can you please open separate issue for that and please attach output of the EXPLAIN (distributed) of the query

@findepi
Copy link
Contributor

findepi commented Oct 24, 2018

select foo from table1 where bar in (select bar from table2 where qux = 'qux')

presto does a full table scan of table1. I am looking for ways to avoid that.

@aandis Presto needs a full table scan because the condition bar = ... is not known upfront.

You have the options:

  1. you can split your query into 2 queries, i.e evaluate select bar from table2 where qux = 'qux' in your program and pass to second query as a parameter; this will allow any kind of optimizations for table1 filter
  2. this use-case will benefit from dynamic filtering Add support for dynamic filtering #7428, but still would not be as performant as the first option

(No need for new separate issue, since this is already covered by #7428 and #8680)

@aandis
Copy link
Contributor

aandis commented Oct 24, 2018

@findepi yeah I've tried option 1 in the past. It has it's own problems because the parameter list may exceed max query length/parameter length presto allows. So you have to split it in batches and run multiple queries which is additional non core application logic.
Additionally, more parameters means more planning time #10700 (comment) with a connector like cassandra.

It's nice to know #7428 will fix this. I'll keep an eye out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants