Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(streaming): plan asof join #18683

Merged
merged 11 commits into from
Sep 30, 2024
Merged

feat(streaming): plan asof join #18683

merged 11 commits into from
Sep 30, 2024

Conversation

yuhao-su
Copy link
Contributor

@yuhao-su yuhao-su commented Sep 24, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

#17765

implement parser to parse new join types AsOfInner and AsofLeftOuter.
implement StreamAsOfJoin plan node, which should be created from LogicalJoin::to_stream_asof_join.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

ASOF JOIN

Having a set of event data, to find and join the nearest record in a reference table by the event time or any ordered properties is call ASOF JOIN.

Inner join

CREATE MATERIALIZED VIEW mv AS SELECT t1.v1 t1_v1 FROM t1 ASOF JOIN t2 ON t1.v1 = t2.v1 and t1.v2 <= t2.v2;

Outer join

CREATE MATERIALIZED VIEW mv AS SELECT t1.v1 t1_v1 FROM t1 ASOF LEFT JOIN t2 ON t1.v1 = t2.v1 and t1.v2 <= t2.v2;

Constraint

The join condition must contains at least 1 equal condition (e.g. t1.v1 = t2.v1) and exactly 1 inequality condition (>=, >, <=, <). The inequality condition applies for all data types that support inequality comparison while a time related type is more commonly used.

Note: only streaming AsOf join is supported by now. The batch asof join is WIP.

Example Scenario:

We have two tables:

  • stock_prices: Contains stock price data at certain timestamps.
  • market_data: Contains market sentiment data at different timestamps.

We want to join the stock prices with the nearest preceding market sentiment for each stock price based on time.

Tables:

stock_prices

stock_name stock_time price
TSLA 2024-09-24 09:30:00 250
TSLA 2024-09-24 10:30:00 252
TSLA 2024-09-24 11:30:00 255
AMZN 2024-09-24 09:30:00 3300
AMZN 2024-09-24 10:30:00 3310
AMZN 2024-09-24 11:30:00 3320
GOOG 2024-09-24 09:30:00 1400
GOOG 2024-09-24 10:30:00 1410
GOOG 2024-09-24 11:30:00 1420

market_data

stock_name market_time sentiment
TSLA 2024-09-24 09:00:00 0.7
TSLA 2024-09-24 10:00:00 0.8
TSLA 2024-09-24 11:00:00 0.9
AMZN 2024-09-24 09:00:00 0.6
AMZN 2024-09-24 10:00:00 0.65
AMZN 2024-09-24 11:00:00 0.7
NVDA 2024-09-24 09:00:00 0.55
NVDA 2024-09-24 10:00:00 0.6
NVDA 2024-09-24 11:00:00 0.65

We can use a ASOF JOIN to find the latest matching record in market_data where the market_time is less than or equal to the stock_time:

SELECT sp.stock_name, sp.stock_time, sp.price, md.sentiment
FROM stock_prices sp
ASOF JOIN market_data md 
ON sp.stock_name = md.stock_name 
AND md.market_time <= sp.stock_time;
stock_name stock_time price sentiment
TSLA 2024-09-24 09:30:00 250 0.7
TSLA 2024-09-24 10:30:00 252 0.8
TSLA 2024-09-24 11:30:00 255 0.9
AMZN 2024-09-24 09:30:00 3300 0.6
AMZN 2024-09-24 10:30:00 3310 0.65
AMZN 2024-09-24 11:30:00 3320 0.7

We can use a ASOF LEFT JOIN to output records in the left table that have no matches in the right table.

SELECT sp.stock_name, sp.stock_time, sp.price, md.sentiment
FROM stock_prices sp
ASOF LEFT JOIN market_data md 
ON sp.stock_name = md.stock_name 
AND md.market_time <= sp.stock_time;

Result:

stock_name stock_time price sentiment
TSLA 2024-09-24 09:30:00 250 0.7
TSLA 2024-09-24 10:30:00 252 0.8
TSLA 2024-09-24 11:30:00 255 0.9
AMZN 2024-09-24 09:30:00 3300 0.6
AMZN 2024-09-24 10:30:00 3310 0.65
AMZN 2024-09-24 11:30:00 3320 0.7
GOOG 2024-09-24 09:30:00 1400 NULL
GOOG 2024-09-24 10:30:00 1410 NULL
GOOG 2024-09-24 11:30:00 1420 NULL

explanation:

TSLA and AMZN have matching records in market_data, so they show the closest preceding sentiment.
GOOG has no corresponding data in market_data, so the sentiment column is NULL.

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@yuhao-su yuhao-su marked this pull request as ready for review September 25, 2024 06:51
@graphite-app graphite-app bot requested a review from a team September 25, 2024 07:09
@kwannoel
Copy link
Contributor

Can you add a PR description about semantics of ASOF JOIN, key implementation details etc...

@stdrc
Copy link
Member

stdrc commented Sep 25, 2024

Please add some description, thx.

@yuhao-su yuhao-su added the user-facing-changes Contains changes that are visible to users label Sep 25, 2024
create table t2 (v1 int, v2 int, v3 int primary key);

statement ok
create materialized view mv1 as SELECT t1.v1 t1_v1, t1.v2 t1_v2, t1.v3 t1_v3, t2.v1 t2_v1, t2.v2 t2_v2, t2.v3 t2_v3 FROM t1 ASOF LEFT JOIN t2 ON t1.v1 = t2.v1 and t1.v2 > t2.v2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to use "LEFT ASOF JOIN" instead of ASOF LEFT JOIN, because clickhouse uses this syntax.

https://clickhouse.com/docs/en/sql-reference/statements/select/join#:~:text=ASOF%20JOIN%20and-,LEFT%20ASOF%20JOIN,-%2C%20joining%20sequences%20with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But duckdb use ASOF LEFT JOIN https://duckdb.org/docs/guides/sql_features/asof_join.html. @st1page any insight on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, then I have no preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both can be easily supported, so no need to block merging on this 😆 . We can always support the other syntax later.

@chenzl25
Copy link
Contributor

Can we enhance the example for the release note? The current example contains only one id which is not enough to illustrate the ASOF join behavior. BTW, for inner and outer join, I think we can use the same example to illustrate the difference as how duckdb document does.

Copy link
Contributor

@kwannoel kwannoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kwannoel
Copy link
Contributor

Could we also provide the output of the query in the release note? It will be referenced by users as an example.

SELECT sp.stock_name, sp.stock_time, sp.price, md.sentiment
FROM stock_prices sp
ASOF LEFT JOIN market_data md 
ON sp.stock_name = md.stock_name 
AND md.market_time <= sp.stock_time;

Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM. Would you please add more test for those unsupported cases?
I found it will panic when their is only one inequal condition but no equal condition. We expect a error here.

Panicked when handling the request: assertion failed: predicate.has_eq()

@yuhao-su yuhao-su requested a review from stdrc September 30, 2024 09:47
Copy link
Member

@stdrc stdrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

use crate::utils::ColIndexMappingRewriteExt;
use crate::PlanRef;

pub struct StreamJoinCommon;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh seems a module is enough.

@yuhao-su yuhao-su added this pull request to the merge queue Sep 30, 2024
Merged via the queue into main with commit e82932f Sep 30, 2024
30 of 32 checks passed
@yuhao-su yuhao-su deleted the yuhao/plan-asof-join branch September 30, 2024 10:35
@st1page st1page mentioned this pull request Oct 8, 2024
4 tasks
Comment on lines +108 to +115
if left_input_ref.index() < left_input_len && right_input_ref.index() >= left_input_len
{
Ok(AsOfJoinDesc {
left_idx: left_input_ref.index() as u32,
right_idx: (right_input_ref.index() - left_input_len) as u32,
inequality_type: Self::expr_type_to_comparison_type(expr_type)?.into(),
})
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the expression must be like left_col [<|<=|>|>=] right_col rather than right_col [<|<=|>|>=] left_col? Here left/right_col means a column from left/right side of the join.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both should be valid expression

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants