-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASOF join support / Specialize Range Joins #318
Comments
Related issue solved by duckdb: duckdb/duckdb#3219 and duckdb/duckdb#3342 |
I haven’t evaluated the polar join_asof solution yet, but I believe it is also pertinent. |
There is a blog post about this from duckdbL https://duckdb.org/2022/05/27/iejoin.html |
We can treat asof join as a special kind of iejoin of course. Furthermore AsOf can stop searching when it finds the first match because there is at most one match.This leaves us room for optimization. I think it's ok to support asofjoin without iejoin (which can be done later). Here is another blog about AsOf join https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-lookups. I am willing to have a try if nobody is working on this. |
I don't know of anyone working on this. Maybe @Dandandan @liukun4515 or @ozankabak know of others who are (or are interested in helping) |
IMO the At Synnada, we handle time-series use cases like the ones |
BTW I think "Range Join" is pretty similar in spirit and does not require special SQL syntax https://www.vertica.com/blog/what-is-a-range-join-and-why-is-it-so-fastba-p223413/ I think there is some non trivial overlap with the notion of symmetric hash join that we already have (maybe we could extend the analysis used there) |
Looking forward to your sharing. |
This is on my mind, this week is very busy but I will circle back as soon as possible. |
If we plan to add ASOF JOIN /IEJOIN, I would like to keep an eye on the issue and be involved in its implementation. |
I would like to take part in this one. |
@ozankabak Can you please elaborate on that a bit? Which of the ~10 existing ASOF join implementations are you considering as not well-thought out and why?
It doesn't "try" to address anything, it solves a pain point, that is joining two datasets without matching timestamps efficiently. I'd love to see your "standard/consistent" SQL version though. |
This topic came up at the recent DataFusion meetup in San Franciso as something of interest. Is there anyone willing to help drive / lead this project? I can't offer to do it myself, but I can certainly offer to help review designs and implementations Thinking about how to break down the work, I think we will need:
Reading https://duckdb.org/docs/guides/sql_features/asof_join.html SELECT *
FROM holdings h
ASOF JOIN prices p USING (ticker, "when"); Given the limited semantics of this join (match the row with the largest value of "when") I suspect we could use a modified hash join and modify the "does the row match" clause to also check for "is when greater than the currently matched row" Another way to implement this would be with a slightly modified merge join
The idea being that you could drastically simplify the join if the input was already sorted by ticker as you just need to hold on to the most recently seen value of This is probably slower (as you have to sort the entire input including many values that will be filtered) but would allow for other behaviors other than just "the most rececnt value) |
@jirislav I'm not sure if you chose your tone intentionally to be this way but I have to say I didn't really appreciate the way it reads. Nevertheless I will try to answer as best as I can. My concern about the Consider a scenario where we are computing the most recent values of each position in a stock portfolio (akin to @alamb's example). Let's create some sample data via: CREATE TABLE holdings (
ts TIMESTAMP,
ticker VARCHAR(10),
amount INT
);
INSERT INTO holdings (ts, ticker, amount) VALUES
('2024-06-27 09:00:00', 'AAPL', '50'),
('2024-06-27 09:00:00', 'GOOGL', '30'),
('2024-06-27 09:00:00', 'MSFT', '20'),
('2024-06-28 09:00:00', 'AAPL', '60'),
('2024-06-28 09:00:00', 'GOOGL', '35'),
('2024-06-28 09:00:00', 'MSFT', '25');
CREATE TABLE prices (
ts TIMESTAMP,
ticker VARCHAR(10),
price DECIMAL(10, 2)
);
INSERT INTO prices (ts, ticker, price) VALUES
('2024-06-27 09:00:00', 'AAPL', '145.50'),
('2024-06-27 09:00:00', 'GOOGL', '2754.36'),
('2024-06-27 09:00:00', 'MSFT', '301.22'),
('2024-06-28 09:00:00', 'AAPL', '146.20'),
('2024-06-28 09:00:00', 'GOOGL', '2770.01'),
('2024-06-28 09:00:00', 'MSFT', '305.50'); Consider the following standard query: SELECT h.*, h.amount * LAST_VALUE(p.price ORDER BY p.ts)
FROM holdings as h, prices as p
WHERE h.ts >= p.ts AND h.ticker = p.ticker
GROUP BY h.ts, h.ticker, h.amount
ORDER BY h.ts, h.ticker We get the following (expected) result:
This solves the problem, it generalizes to arbitrary join conditions, and it already works in Given that we can solve this problem in a general way without any special syntax, I do not see any real benefit of adding another syntax for an already-solved problem in upstream DF. We are also improving built-in join operators constantly and adding new ones, so our plans will also get ever more efficient over time. Downstream users are obviously free to customize DF in any way they wish. Let's keep collecting ideas, creating examples and comparing approaches. At the end of the day, my current position may turn out to be wrong and adding this to upstream DF may prove to be the right choice. That is OK, we will all learn something regardless of the outcome of these discussions so let's keep contributing to it in good faith. I hope this helps. [1] @alamb, writing up this example makes me think it could be a good idea to file ticket to support order specification for memory tables too. |
FWIW this is the approach we took at Vertica many years ago (a specialized join implementation rather than specialized syntax). The basic idea is described in https://blogs.opentext.com/what-is-a-range-join-and-why-is-it-so-fastba-p223413/ |
I agree -- that would be a good improvement |
@ozankabak I'm deeply sorry, I didn't realize the tone is inappropriate. Thanks for pointing that out! Just to share my point of view — I didn't understand why is SQL standard in DataFusion more important than "proper" time-series support and it seemed to me like this issue was blocked quite some time by waiting for your reaction. It's just something I have to get used to in open-source projects I guess. Anyway, thanks for sharing the example! Though I see a few issues with the example you have shared:
SELECT *, amount * price
FROM holdings
ASOF JOIN prices USING (ticker, ts)
ORDER BY ts, ticker
|
We could contemplate implementing both 🤔 Specifically, if we got the DataFusion engine to be able to identify range joins in general and |
Thank you @jirislav, I'm sure we will find a reasonable way forward and make DF and its ecosystem even better for everybody.
I agree that some people may think this way, but this sounds like a conceptual issue on the part of the analyst. The query is declarative and doesn't specify how the calculation will be made. If someone is worried about the how, they can take a quick look at the plan to verify whether it is efficient enough for their needs. Obviously, the plan of the query today is not as efficient as it could be, but I'm optimistic that we will get there soon.
Right. The query is correct but I could have chosen my sample data better. -- I like @alamb's suggestion of starting with operators first and then thinking about various syntactic sugars. I anticipate that there will be other interesting constructs besides
|
FWIW this came up at InfluxData recently and we are considering investing more in this area. I will keep this ticket updated |
Leaving aside the asof join syntax. For range join, which contains non-equivalent join conditions, e.g. We can solve some of these scenarios by introducing iejoin, you can refer to this paper https://wayang.apache.org/assets/pdf/paper/iejoin.pdf, which I have previously implemented in databend, can achieve impressive performance gains, databendlabs/databend#11412. If you plan to implement this, feel free to invite me to participate in the review! |
There seems to be a draft PR for iejoin here #12754. It looks like the author is ready for an initial review |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In many timeseries workloads, the need to join one record to another based on recency or ordering is quite prevalent. This can be with two sensors that maintain different sampling rates, in handling market data (such as situations where you want to find the most recent quote for a given trade), or any other scenario where an ordering of events can be applied. In these situations, it is quite frequent that one set of data is at a much larger scale than the other, typically starting at an order of magnitude. It is also quite common that you would want to actually store your data in the appropriate order so as to minimize the effort when performing this kind of join operation (though that is not strictly the problem itself).
Describe the solution you'd like
Essentially the solution implemented by Clickhouse and KDB (please note that the Clickhouse solution allows for the full breadth of ordering conditions for closest match).
Describe alternatives you've considered
There is a general subquery solution that can be used to achieve the desired outcome, but it is typically not performant and can be fairly awkward to express. Provided implementation of #141 and sufficient query optimization of the subquery solution into sort-merge join, it may not be necessary as direct syntax.
Additional context
This likely depends on #141
The text was updated successfully, but these errors were encountered: