Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOIN support improvement #49

Open
dai-chen opened this issue May 13, 2021 · 5 comments
Open

JOIN support improvement #49

dai-chen opened this issue May 13, 2021 · 5 comments
Labels
enhancement New feature or request PPL Piped processing language SQL

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented May 13, 2021

Is your feature request related to a problem? Please describe.

There is limited support for JOIN queries which needs improvement. In particular, post processing such as filtering and aggregation after inner/outer join is most wanted. This capability is required by PPL lookup command.

Here are the feature requests from community:

  1. Feature request: support select columns from right index in left join query opendistro-for-elasticsearch/sql#505: Left join
  2. Does not support INNER JOIN with 3 indexes opendistro-for-elasticsearch/sql#351: Multi-join
  3. Feature request: support group by field, and filter with left join index's attribute opendistro-for-elasticsearch/sql#301: Aggregate after join
  4. JOIN query with functions in SELECT field is not supported opendistro-for-elasticsearch/sql#289: Select function after join
  5. JOINs with aggregations in SELECT clause will return duplicate columns for each joined table opendistro-for-elasticsearch/sql#236: Aggregate after join
  6. Treat multiple tables separated by comma as COMMA JOIN (CROSS JOIN)  opendistro-for-elasticsearch/sql#221: Comma join
  7. LEFT JOIN with WHERE doesn't filter the result as expected opendistro-for-elasticsearch/sql#124: Left join with WHERE
  8. Aggregation support for JOIN on multiple indices opendistro-for-elasticsearch/sql#110: Aggregate after join
  9. Multiple indexes opendistro-for-elasticsearch/sql#987: Multi-join

Describe the solution you'd like
N/A

Describe alternatives you've considered
N/A

Additional context
Add any other context or screenshots about the feature request here.

@dai-chen dai-chen added enhancement New feature or request SQL PPL Piped processing language Priority-High labels May 13, 2021
@chloe-zh
Copy link
Contributor

JOIN with nested loops join algorithm in new engine
PoC: https://github.com/chloe-zh/opensearch-sql/tree/poc/join

@dai-chen
Copy link
Collaborator Author

Inquiry: #857

@dai-chen
Copy link
Collaborator Author

dai-chen commented Oct 6, 2022

Requirements in observability: #892

@dai-chen
Copy link
Collaborator Author

dai-chen commented Dec 9, 2022

Another option we can explore is to reuse legacy JOIN implementation in v2 engine directly by:

  1. ExprValueRow wraps an ExprValue into legacy Row interface.
  2. LegacyToV2Adapter adapts legacy join operator to behave like a V2 operator.
  3. V2ToLegacyAdapter adapts V2 operator for integrating with legacy join operator.

Untitled Diagram

  • Pros:
    • Minimum efforts to get JOIN work in v2 engine to a) support lookup in PPL; b) fix most of the issue above.
    • Legacy code was basically functional tested and performance tested.
  • Cons:
    • May have unknown issue with certain data type or functions.
    • Still single node implementation

The main blocker to make this happen is v2 engine doesn't support symbol resolution very well. For example, SELECT t1.name FROM index t1 ..., alias t1 will be removed after analysis. There is no alias in DSL query and value environment. Finally the value is named t1.name by project operator. This will cause problem in analysis and execution in JOIN case which has 2 or more table aliased (See the column names and null value in result below). Anyway, this is something to improve in v2 engine itself instead of problem in this quick workaround.

Test branch: https://github.com/dai-chen/sql-1/tree/test-legacy-join-port

PUT account-test/_doc/1
{ 
  "name": "John",
  "balance": 12000
}

PUT account-test/_doc/2
{ 
  "name": "John",
  "balance": 10000
}

PUT account-test/_doc/3
{ 
  "name": "Allen",
  "balance": 25000
}

PUT customer-test/_doc/1
{ 
  "name": "John",
  "age": 25,
  "city": "Seattle"
}

PUT customer-test/_doc/2
{ 
  "name": "Allen",
  "age": 30,
  "city": "Portland"
}
POST _plugins/_sql
{
  "query": """
    SELECT t1.name, t1.balance, t2.age, t2.city
    FROM account-test t1
    JOIN customer-test t2
    ON t1.name = t2.name
  """
}
{
  "schema": [
    {
      "name": "name",
      "type": "text"
    },
    {
      "name": "balance",
      "type": "long"
    },
    {
      "name": "age",
      "type": "long"
    },
    {
      "name": "city",
      "type": "text"
    }
  ],
  "datarows": [
    [
      "John",
      10000,
      25,
      "Seattle"
    ],
    [
      "John",
      12000,
      25,
      "Seattle"
    ],
    [
      "Allen",
      25000,
      30,
      "Portland"
    ]
  ],
  "total": 3,
  "size": 3,
  "status": 200
}

POST _plugins/_sql
{
  "query": """
    SELECT
     t2.city,
     AVG(t1.balance) AS avgBal
    FROM account-test t1
    JOIN customer-test t2
    ON t1.name = t2.name
    GROUP BY t2.city
    ORDER BY avgBal
  """
}
{
  "schema": [
    {
      "name": "city",
      "type": "text"
    },
    {
      "name": "AVG(t1.balance)",
      "alias": "avgBal",
      "type": "double"
    }
  ],
  "datarows": [
    [
      null,
      11000.0
    ],
    [
      null,
      25000.0
    ]
  ],
  "total": 2,
  "size": 2,
  "status": 200
}

@dai-chen
Copy link
Collaborator Author

#2873

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PPL Piped processing language SQL
Projects
None yet
Development

No branches or pull requests

3 participants