Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range/inequality joins are slow #8393

Open
simonvandel opened this issue Dec 1, 2023 · 10 comments · Fixed by #9676
Open

Range/inequality joins are slow #8393

simonvandel opened this issue Dec 1, 2023 · 10 comments · Fixed by #9676
Labels
bug Something isn't working

Comments

@simonvandel
Copy link
Contributor

Describe the bug

Joins where the ON filter are not equality, but rather inequalities like <, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct "competitor".

The main difference between the DuckDB and Datafusion plans seem to be that Datafusion uses a NestedLoopJoinExec, while DuckDB uses a IEJoin.

Note that the query could be written better with a ASOF-join, but Datafusion does not support that (see issue #318).

To Reproduce

Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB:

CREATE
OR REPLACE TABLE pricing AS
SELECT
    t,
    RANDOM() as v
FROM
    range(
        '2022-01-01' :: TIMESTAMP,
        '2023-01-01' :: TIMESTAMP,
        INTERVAL 30 DAY
    ) ts(t);

COPY pricing to 'pricing.parquet' (format 'parquet');

CREATE
OR REPLACE TABLE timestamps AS
SELECT
    t
FROM
    range(
        '2022-01-01' :: TIMESTAMP,
        '2023-01-01' :: TIMESTAMP,
        INTERVAL 10 SECOND
    ) ts(t);

COPY timestamps to 'timestamps.parquet' (format 'parquet');
$ duckdb < repro-dataset.sql

We will compare the performance of the following query in DuckDB and Datafusion. The query is saved as repro-range-query.sql.

WITH pricing_state AS (
    SELECT
        t as valid_from,
        COALESCE(
            LEAD(t, 1) OVER (
                ORDER BY
                    t
            ),
            '9999-12-31'
        ) as valid_to,
        v
    FROM
        'pricing.parquet'
)
SELECT
    t.t,
    p.v
FROM
    pricing_state p
    LEFT JOIN 'timestamps.parquet' t ON t.t BETWEEN p.valid_from
    AND p.valid_to;

DuckDB performance:

$ time duckdb < repro-range-query.sql
...
real    0m0.999s
user    0m6.070s
sys     0m3.600s

Datafusion performance:

$ time datafusion-cli -f repro-range-query.sql
...
real    0m8.269s
user    0m6.358s
sys     0m1.907s

Expected behavior

It would be nice if the above query (or something equivalent) would be faster in Datafusion.

If someone knows of a better way to express the query, then that could also be a workaround for me.

Additional context

Machine tested on:
CPU:Ryzen 3900x
OS: Ubuntu 22.04

Versions used:

$ duckdb --version
v0.9.2 3c695d7ba9
$ datafusion-cli --version
datafusion-cli 33.0.0
@simonvandel simonvandel added the bug Something isn't working label Dec 1, 2023
@simonvandel
Copy link
Contributor Author

I just noticed that what I really want is to actually do a RIGHT join. That is, if there is no matching pricing for a timestamp, it should give null.

Changing the query to that, Datafusion is much faster. I believe it's because with a RIGHT join, pricing becomes the outer table (single partition), while timestamps becomes the inner table (unspecified partitioning), which allows for greater parallelism (see https://github.com/apache/arrow-datafusion/blob/e19c669855baa8b78ff86755803944d2ddf65536/datafusion/physical-plan/src/joins/nested_loop_join.rs#L72-L77C4)

But I think the issue should still be open - the LEFT join is still slower

@alamb
Copy link
Contributor

alamb commented Dec 1, 2023

I think IEJoin is a form of RangeJoin (https://duckdb.org/2022/05/27/iejoin.html) -- I agree it would be neat to make this fast in DataFusion, but I think it is a pretty major project (it typically requires a specialized operator, as described in the DuckDB blog)

@alamb
Copy link
Contributor

alamb commented Dec 1, 2023

I stared trying to collect a list of various join improvments on #8398

@my-vegetable-has-exploded
Copy link
Contributor

I am interested in this ticket. Since it is a pretty major project, I will write a proposal first.

@alamb
Copy link
Contributor

alamb commented Mar 8, 2024

Thank you @my-vegetable-has-exploded -- that is a great idea

cc @korowa / @viirya / @metesynnada who have been involved in Join implementations recently and who may be interested as well

@korowa
Copy link
Contributor

korowa commented Mar 11, 2024

Disregarding IEJoin -- time output from the issue description seems to show that both DuckDB and DF spend +- same cputime (user + system) and the only difference is parallelism (shown by real time), which, how @simonvandel noticed, depends on left/right input + join type) -- this makes me think that the x8 slowdown is not related to how join performed internally, but more like caused by physical optimizer skipping join reordering for NestedLoopJoin.

So, if i'm not mistaken, this issue is mostly about covering NLJoin in join_selection.rs.

UPD: in addition, to make join reordering useful, it's also required to modify NLJoin, since currently it chooses build-side based on logical join type.

@my-vegetable-has-exploded
Copy link
Contributor

So, if i'm not mistaken, this issue is mostly about covering NLJoin in join_selection.rs.

I think it is a good idea to improve performance in this scenario. Your pr is also good for me. But I think it is also ok to keep old parallelism strategy. In my opinion, the old paralleism strategy should works, but the check in enforce_distribution.rs block the reparition of it whick would check the row number. In this query, pricing_state 's row numbers is less than batch_size and the RepartitionExec also just works for a batch a time.

https://github.com/apache/arrow-datafusion/blob/ad8d552b9f150c3c066b0764e84f72b667a649ff/datafusion/core/src/physical_optimizer/enforce_distribution.rs#L1099-L1106

I think it may another way to write a new enforce_distribution strategy for NestLoopJoin and CrossJoin. We can check repartition_beneficial_stats by the left table size multiply right partition size rather than just right partition size (take RIGHTJOIN for example).

@korowa
Copy link
Contributor

korowa commented Mar 20, 2024

the old paralleism strategy should works, but the check in enforce_distribution.rs block the reparition

I don't think it's proper way to go -- it'll give some benefits in terms of runtime, but it will be suboptimal in terms of memory utilization, and cputime (as we'll need to perform BuildSideRows * NumberOfPartitions filter evaluations instead of BuildSideRows * 1, where 1 is probe side input batches)

@Dandandan
Copy link
Contributor

I don't think this issue should be closed.

#9676 seems to take care of ordering but I think it doesn't improve range/inequality joins much?

@Dandandan Dandandan reopened this Apr 22, 2024
@korowa
Copy link
Contributor

korowa commented Apr 23, 2024

My intention was to fix NLJoin parallelism issue due to fixed build-side choice (since right join instead of left had acceptable performance, as it was claimed above), and in the same time we also have #318 for specialized operator implementation, so, I supposed #9676 to be enough.

Don't mind to keep it open, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants