Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support JOIN command in PPL #2913

Open
9 tasks
LantaoJin opened this issue Aug 8, 2024 · 8 comments
Open
9 tasks

[RFC] Support JOIN command in PPL #2913

LantaoJin opened this issue Aug 8, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request PPL Piped processing language

Comments

@LantaoJin
Copy link
Member

LantaoJin commented Aug 8, 2024

Is your feature request related to a problem?
Current JOIN syntax is only available in SQL query due to the implementation of JOIN is in v1 engine only. To support JOIN command in PPL, we should implement it in v2 engine. One proposal is reusing legacy JOIN implementation in v2 engine, but the legacy JOIN implementation lacks scalability and hardly to enhance, #49 listed many requirements of enhancement.

What solution would you like?
To mitigate the impacts of customer SQL experience, an ideal process to support JOIN will be as following:

PPL Join Syntax Options: (Updated at Sep 14)

#### Option 1: Join clause as a source extension existing in Search command and should be at the beginning of PPL.

SEARCH source=<left-table> AS <leftAlias> [joinType] JOIN <right-table> AS <rightAlias> ON joinCriteria 
| <other piped command>
  • Join clause is not a new PPL command, it's a source extension.
  • Since Join clause is a part of source (search command), it must fill at the beginning of ppl command.
  • Only tableSource could be used in Join sides.
  • The table aliases can be used directly.
  • Join reorder optimization could be applied in future
  • This idea is from this paper

Here is an example:
Original SQL (TPC-H Q13 with table alias):

SELECT c_count, COUNT(*) AS custdist
FROM
  ( SELECT c.c_custkey, COUNT(o.o_orderkey) c_count
    FROM customer AS c
    LEFT OUTER JOIN orders AS o ON c.c_custkey = o.o_custkey
         AND o.o_comment NOT LIKE '%unusual%packages%'
    GROUP BY c.c_custkey
  ) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;

The rewriting PPL of Option 1:

SEARCH source=customer AS c LEFT OUTER JOIN orders AS o ON c.c_custkey = o.o_custkey AND o.o_comment NOT LIKE '%unusual%packages%'
| STATS count(o.o_orderkey) AS c_count BY c.c_custkey
| STATS count(*) AS custdist BY c_count
| SORT - custdist, - c_count

(Winner) Option 2: Join clause as a PPL command could be existing anywhere.

In Option 2, Join sides could be tableSource:

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN hint.left = <leftAlias>, hint.right = <rightAlias> ON joinCriteria <right-table>
| <other piped command>

Also, in Option 2, Join sides can be sub-search (sub-search itself unsupported yet)

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN hint.left = <leftAlias>, hint.right = <rightAlias> ON joinCriteria
  [ 
    SEARCH source=<right-table>
    | <other piped command>
  ]
| <other piped command>
  • Join clause is a new PPL command
  • Join command should start with | and exist anywhere.
  • Not only support tableSource but also support sub-search in Join sides.
  • The table aliases cannot be used directly. Aliases must be assigned in specific syntax (here we reuse hint syntax).
  • Hardly to apply Join reorder optimization if more than two sub-searches.
  • The syntax will be looks similar to SPL2 Join. Note: not SPL1 Join

The rewriting PPL of Option 2 for previous example:

SEARCH source=customer
| FIELDS c_custkey
| LEFT OUTER JOIN hint.left = c, hint.right = o ON c.c_custkey = o.o_custkey
   [
      SEARCH source=orders
      | WHERE o_comment NOT LIKE '%unusual%packages%'
      | FIELDS o_orderkey, o_custkey
   ]
| STATS count(o.o_orderkey) AS c_count BY c.c_custkey
| STATS count(*) AS custdist BY c_count
| SORT - custdist, - c_count

Syntax Milestones:

1.1 Support two tables join

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias> ON joinCriteria <relation>
| <other piped command>

joinType: INNER | LEFT [OUTER] | [LEFT] SEMI | [LEFT] ANTI
joinCriteria: <left-table>.<field>=<right-table>.<field> [AND <left-table>.<field>=<right-table>.<field>]
or <leftAlias>.<field>=<rightAlias>.<field> [AND <leftAlias>.<field>=<rightAlias>.<field>]

<relation> could be sub-search, but sub-search is unsupported yet.

1.2 Support multiple tables join

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias> ON joinCriteria1 <relation1>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias> ON joinCriteria2 <relation2>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias> ON joinCriteria3 <relation3>
| <other piped command>

<relationX> could be sub-search, but sub-search is unsupported yet.

1.5 Support join hint

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias>
   [hint.left.key1=value1[ hint.right.key2=value2]]
   ON joinCriteria <relation>
| <other piped command>

<relation> could be sub-search, but sub-search is unsupported yet.

1.6 Support more join types

joinType: INNER | LEFT [OUTER] | [LEFT] SEMI | [LEFT] ANTI | FULL [OUTER] | CROSS | ...

What alternatives have you considered?
No

Do you have any additional context?
Add any other context or screenshots about the feature request here.

@LantaoJin LantaoJin added enhancement New feature or request untriaged labels Aug 8, 2024
@LantaoJin LantaoJin changed the title [RFC]Support JOIN command in PPL [RFC] Support JOIN command in PPL Aug 8, 2024
@LantaoJin LantaoJin added the PPL Piped processing language label Aug 8, 2024
@dblock dblock removed the untriaged label Aug 26, 2024
@dblock
Copy link
Member

dblock commented Aug 26, 2024

[Catch All Triage - 1, 2, 3, 4, 5]

@penghuo
Copy link
Collaborator

penghuo commented Sep 6, 2024

  • +1 on 1 Re-implement the JOIN operator in SQL v2 engine and provide a the JOIN command in PPL only.
  • syntax looks good!
  • Do we consider different JOIN strategy? for instance, P0: support coordination-node join. P1: support broadcast join.
  • Can we colleberate with [RFC] Join support in OpenSearch OpenSearch#15185 to support broadcast join?

@LantaoJin
Copy link
Member Author

  • Do we consider different JOIN strategy? for instance, P0: support coordination-node join. P1: support broadcast join.
  • Can we colleberate with [RFC] Join support in OpenSearch OpenSearch to support broadcast join?

Thanks for bring the RFC of Core on table. All items are coordination-node joins. I will update this RFC once the opensearch-project/OpenSearch#15185 could provide (or mention in milestones) an API to leverage here to support BCJ.

@YANG-DB
Copy link
Member

YANG-DB commented Sep 9, 2024

@LantaoJin this looks great !
Are you also thinking about the planning step of the join order (in more than 2 tables) ?
If this is the case the query planner should employ a statistical based cost estimator, can you please add this into the design ?

@LantaoJin
Copy link
Member Author

I added two options of Join PPL syntax. Would you provide any comment based on the new updates? @penghuo @YANG-DB @dai-chen

@LantaoJin
Copy link
Member Author

@LantaoJin this looks great ! Are you also thinking about the planning step of the join order (in more than 2 tables) ? If this is the case the query planner should employ a statistical based cost estimator, can you please add this into the design ?

Join reorder maybe depend on which syntax option do we choose as well. See my latest updates (two options).

@YANG-DB
Copy link
Member

YANG-DB commented Sep 10, 2024

@LantaoJin thanks for the excellent review and suggestions - IMO option 2 if more corresponding to the pipeline style of syntax and also does look familiar from other pipe based languages...
My vote if with option #2
P.S - reordering the join is a separate aspect of the join syntax style and should allow query rewrite (performance driven) in any syntax option we select...

@brijos
Copy link

brijos commented Sep 11, 2024

We support INNER, OUTER LEFT, and CROSS joins in SQL so that should be our baseline in PPL as well. Option 2 it sounds like it will give us more flexibility to grow as well in the future which will be better for the community.

@YANG-DB YANG-DB moved this to In Progress in PPL Commands Sep 13, 2024
@YANG-DB YANG-DB moved this from In Progress to InReview in PPL Commands Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PPL Piped processing language
Projects
Status: InReview
Development

No branches or pull requests

5 participants