Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RFC for JDBC Join Push down #32

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Thanzeel-Hassan-IBM
Copy link

  • Add RFC for JDBC Join Push down

Internal review done in this PR : #31

* Add RFC for JDBC Join Push down

---------

Co-authored-by: Ajas M <Ajas.M@ibm.com>
Co-authored-by: Haritha K <HARITHA.K@ibm.com>
Co-authored-by: Glerin <Glerin.Pinhero@ibm.com>
Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed till high level design

Comment on lines 19 to 21
At present, when a query joins multiple tables, it creates a separate TableScanNode for each table. Each TableScanNode select all the records from that table. The join operation is then executed in-memory in Presto using a JOIN node by applying JoinCriteria, FilterPredicate and other criteria (like order by, limit, etc.).

However, if the query joins tables from the same JDBC datasource, it would be more efficient to let the datasource handle the join instead of creating a separate TableScanNode for each table and joining them in Presto. If we "Push down" or send these joins to remote JDBC datasource it increases the query performance. i.e., decreases the query execution time. We have seen improvements from 3x to 10x.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be more efficient

This is not always true. Pushing down an expanding Join may be worse for performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decreases the query execution time. We have seen improvements from 3x to 10x.

Can you reword this to be more specific and talk about the improvements in query latency, scanned data size, CPU/disk/network usage ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a table which shows the data i got from the presto UI. I dont think it shows Disk and network usage..
Can you take a look ? @aaneja

RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved
RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved
RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved

## Proposed Implementation

At present, if presto get a join query (from the CLI or UI) which is trying to join tables either from same datasource or from different datasource, it is received as a string formatted sql query. Presto validates the syntax and converts it to Query (Statement) object using presto parser and analyzer. This Query object is converted to presto internal reference architecture called Plan, using its logical and physical optimizers. Finally, this plan is executed by the executor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, instead of having the preamble here describing the basics of the parser and planner in this PR, and repeating them in multiple RFC's how about we create a dedicated section in https://prestodb.io/docs/current/ that we can link to ?

RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved
RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved
RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved
RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved
RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved
RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved
RFC-0009-jdbc-join-push-down.md Outdated Show resolved Hide resolved

**5. Enable presto Join pushdown capabilities by setting the session flag optimizer_inner_join_pushdown_enabled = true.**

## Low level Design
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rework this section or remove it. A lot of this is implementation detail. What we should be highlighting is

  1. How we would group tables by connector using the same mechanism in ReorderJoins. Describe how we would limit the grouping to only join sources that are either TableScan or Filter<-TableScan or Project<-Filter<-TableScan
  2. Describe the updates to the SPI to add a table handle or other data structure to represent a 'group' of individual table handles that are joined
  3. Describe how the connectors, and how the JDBC connector in particular can indicate that it wants to 'participate' in this creation of grouped-tables
  4. Describe how the JDBC connector will un-wrap these grouped-tables and build the SQL necessary to achieve the JOIN condition that it would use when querying the JDBC source
  5. Describe what the corner cases are - why we would need a way to add a table alias

Please create smaller sections for these as you see fit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, please add a section dedicated to SPI changes - you can add classes, interfaces and types that are being added or removed, and add code-comments on them to describe how they would be used

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed some things in that section.. Working on the rest

@Thanzeel-Hassan-IBM
Copy link
Author

Updating the RFC with the above review comments.. The main idea is to reduce the verbosity. This level of in depth details of the implementation is not required.

I will make the modifications and make it easier to follow.

@Thanzeel-Hassan-IBM Thanzeel-Hassan-IBM force-pushed the jdbc-join-pushdown branch 5 times, most recently from 50814df to a72a706 Compare November 13, 2024 10:41
Remove in depth implementation of GroupInnerJoinsByConnector Optimizer

Fix heading

Added explain analyze outputs

remove extra details

added section header

remove pics

Added table for performance difference
@prestodb-ci prestodb-ci added from:IBM PRs from IBM and removed from:IBM PRs from IBM labels Nov 13, 2024
@ethanyzhang ethanyzhang added the from:IBM PRs from IBM label Nov 13, 2024
@prestodb-ci
Copy link

Saved that user @Thanzeel-Hassan-IBM is from IBM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PRs from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants