JOIN support improvement #49

dai-chen · 2021-05-13T19:41:27Z

Is your feature request related to a problem? Please describe.

There is limited support for JOIN queries which needs improvement. In particular, post processing such as filtering and aggregation after inner/outer join is most wanted. This capability is required by PPL lookup command.

Here are the feature requests from community:

Feature request: support select columns from right index in left join query opendistro-for-elasticsearch/sql#505: Left join
Does not support INNER JOIN with 3 indexes opendistro-for-elasticsearch/sql#351: Multi-join
Feature request: support group by field, and filter with left join index's attribute opendistro-for-elasticsearch/sql#301: Aggregate after join
JOIN query with functions in SELECT field is not supported opendistro-for-elasticsearch/sql#289: Select function after join
JOINs with aggregations in SELECT clause will return duplicate columns for each joined table opendistro-for-elasticsearch/sql#236: Aggregate after join
Treat multiple tables separated by comma as COMMA JOIN (CROSS JOIN) opendistro-for-elasticsearch/sql#221: Comma join
LEFT JOIN with WHERE doesn't filter the result as expected opendistro-for-elasticsearch/sql#124: Left join with WHERE
Aggregation support for JOIN on multiple indices opendistro-for-elasticsearch/sql#110: Aggregate after join
Multiple indexes opendistro-for-elasticsearch/sql#987: Multi-join

Describe the solution you'd like
N/A

Describe alternatives you've considered
N/A

Additional context
Add any other context or screenshots about the feature request here.

chloe-zh · 2021-07-26T08:01:35Z

JOIN with nested loops join algorithm in new engine
PoC: https://github.com/chloe-zh/opensearch-sql/tree/poc/join

dai-chen · 2022-09-27T15:41:52Z

Inquiry: #857

dai-chen · 2022-10-06T16:37:53Z

Requirements in observability: #892

dai-chen · 2022-12-09T00:02:34Z

Another option we can explore is to reuse legacy JOIN implementation in v2 engine directly by:

ExprValueRow wraps an ExprValue into legacy Row interface.
LegacyToV2Adapter adapts legacy join operator to behave like a V2 operator.
V2ToLegacyAdapter adapts V2 operator for integrating with legacy join operator.

Pros:
- Minimum efforts to get JOIN work in v2 engine to a) support lookup in PPL; b) fix most of the issue above.
- Legacy code was basically functional tested and performance tested.
Cons:
- May have unknown issue with certain data type or functions.
- Still single node implementation

The main blocker to make this happen is v2 engine doesn't support symbol resolution very well. For example, SELECT t1.name FROM index t1 ..., alias t1 will be removed after analysis. There is no alias in DSL query and value environment. Finally the value is named t1.name by project operator. This will cause problem in analysis and execution in JOIN case which has 2 or more table aliased (See the column names and null value in result below). Anyway, this is something to improve in v2 engine itself instead of problem in this quick workaround.

Test branch: https://github.com/dai-chen/sql-1/tree/test-legacy-join-port

PUT account-test/_doc/1
{ 
  "name": "John",
  "balance": 12000
}

PUT account-test/_doc/2
{ 
  "name": "John",
  "balance": 10000
}

PUT account-test/_doc/3
{ 
  "name": "Allen",
  "balance": 25000
}

PUT customer-test/_doc/1
{ 
  "name": "John",
  "age": 25,
  "city": "Seattle"
}

PUT customer-test/_doc/2
{ 
  "name": "Allen",
  "age": 30,
  "city": "Portland"
}

POST _plugins/_sql
{
  "query": """
    SELECT t1.name, t1.balance, t2.age, t2.city
    FROM account-test t1
    JOIN customer-test t2
    ON t1.name = t2.name
  """
}
{
  "schema": [
    {
      "name": "name",
      "type": "text"
    },
    {
      "name": "balance",
      "type": "long"
    },
    {
      "name": "age",
      "type": "long"
    },
    {
      "name": "city",
      "type": "text"
    }
  ],
  "datarows": [
    [
      "John",
      10000,
      25,
      "Seattle"
    ],
    [
      "John",
      12000,
      25,
      "Seattle"
    ],
    [
      "Allen",
      25000,
      30,
      "Portland"
    ]
  ],
  "total": 3,
  "size": 3,
  "status": 200
}

POST _plugins/_sql
{
  "query": """
    SELECT
     t2.city,
     AVG(t1.balance) AS avgBal
    FROM account-test t1
    JOIN customer-test t2
    ON t1.name = t2.name
    GROUP BY t2.city
    ORDER BY avgBal
  """
}
{
  "schema": [
    {
      "name": "city",
      "type": "text"
    },
    {
      "name": "AVG(t1.balance)",
      "alias": "avgBal",
      "type": "double"
    }
  ],
  "datarows": [
    [
      null,
      11000.0
    ],
    [
      null,
      25000.0
    ]
  ],
  "total": 2,
  "size": 2,
  "status": 200
}

dai-chen · 2024-07-30T21:14:55Z

#2873

dai-chen added enhancement New feature or request SQL PPL Piped processing language Priority-High labels May 13, 2021

MaxKsyunz mentioned this issue Aug 30, 2022

[BUG] Runtime exception thrown if duplicate name in SELECT items #785

Open

anirudha removed the Priority-High label May 30, 2023

dai-chen mentioned this issue Jul 24, 2024

[FEATURE] Enhance materialized views to support streaming JOIN refresh opensearch-project/opensearch-spark#474

Open

LantaoJin mentioned this issue Aug 8, 2024

[RFC] Support join command in PPL #2913

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JOIN support improvement #49

JOIN support improvement #49

dai-chen commented May 13, 2021 •

edited

Loading

chloe-zh commented Jul 26, 2021

dai-chen commented Sep 27, 2022

dai-chen commented Oct 6, 2022

dai-chen commented Dec 9, 2022 •

edited

Loading

dai-chen commented Jul 30, 2024

JOIN support improvement #49

JOIN support improvement #49

Comments

dai-chen commented May 13, 2021 • edited Loading

chloe-zh commented Jul 26, 2021

dai-chen commented Sep 27, 2022

dai-chen commented Oct 6, 2022

dai-chen commented Dec 9, 2022 • edited Loading

dai-chen commented Jul 30, 2024

dai-chen commented May 13, 2021 •

edited

Loading

dai-chen commented Dec 9, 2022 •

edited

Loading