Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Added a number of physical planning join benchmarks #13085

Merged
merged 2 commits into from
Oct 24, 2024

Conversation

mnorfolk03
Copy link
Contributor

Which issue does this PR close?

Based on discussion in #94 but unsure if it would close it.

Rationale for this change

Lack of join benchmarks for physical planning.

What changes are included in this PR?

Added a number of physical planning benchmarks mostly focused on JOIN queries.

  • Query plan should consider sort merge join
  • Theta join (a < b) where query plan should consider sort merge join
  • Query plan consisting of many self joins
  • Nested query which may be unnested to become a regular join query
  • Query plan using intersection which is equivalent to a join (and another for the same equivalent query plan but written using explicit join)

Are these changes tested?

n/a

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Oct 24, 2024
@mnorfolk03
Copy link
Contributor Author

This is my first PR for datafusion & first time writing in Rust, so any general feedback or comments would be greatly appreciated.

@mnorfolk03
Copy link
Contributor Author

@alamb if you have some time & could add this to your list of PRs to review that would be great. Additionally, if you could trigger the CI tasks that would be greatly appreciated!

@Dandandan
Copy link
Contributor

Hi @mnorfolk03
Thanks for the PR.

I think it will not close the issue. I see you added a number of benchmarks for planning queries with joins, however it seems the linked issue is about executing joins.

I suggest to:

  • create a new benchmark (e.g. join) that executes against some generated data
  • create a number of join queries (what would be interesting to benchmark at some point is):
    • different join types (inner, left, right, full, etc.)
    • different datatype of join keys
    • with/without filter
    • with/without equijoin condition (hash join/nested loop join)
    • compare hash join/sortmergejoin etc.)

@Dandandan
Copy link
Contributor

Let me know if you need any further help

@Dandandan
Copy link
Contributor

Ah I see @alamb suggested to start with physical planning in the ticket first, sorry for the noise :)

I do think some focused join execution benchmarks can be very useful too.

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

Ah I see @alamb suggested to start with physical planning in the ticket first, sorry for the noise :)

I do think some focused join execution benchmarks can be very useful too.

Yes indeed -- I think getting some better benchmark coverage for planning will help us make that area of DataFusion better.

Getting some better / more focused benchmarks for join execution would also be aweseom

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

Thank you @mnorfolk03! Welcome to DataFusion (and rust!) I think the CI test here

https://github.com/apache/datafusion/actions/runs/11490928733/job/31986697253?pr=13085

Would be fixed by running cargo fmt and checking in the result

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mnorfolk03 -- this is a nice improvement

I ran these benchmarks and they look good.

cargo bench --bench sql_planner -- physical_join_distinct

I also made some flamegraphs to see what might be going on which was quite instructive:

sudo flamegraph -- target/release/deps/sql_planner-65bd83b445d4016c --bench physical_join_distinct

flamegraph2

Screenshot 2024-10-24 at 7 35 28 AM

sudo flamegraph -- target/release/deps/sql_planner-65bd83b445d4016c --bench physical_many_self_joins

flamegraph3

Screenshot 2024-10-24 at 7 37 38 AM

So TLDR is I think these are a significant improvement to our coverage. Thank you 🙏

One thing that might be helpful to improve more would be adding a ParquetExec as well as queries that have sortedness

So the equivalent of planning against tables like this (docs here): https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table

CREATE EXTERNAL TABLE foo STORED AS PARQUET LOCATION '..'
CREATE EXTERNAL TABLE test (
    c1  VARCHAR NOT NULL,
    c2  INT NOT NULL,
    c3  SMALLINT NOT NULL,
    c4  SMALLINT NOT NULL,
    c5  INT NOT NULL,
    c6  BIGINT NOT NULL,
    c7  SMALLINT NOT NULL,
    c8  INT NOT NULL,
    c9  BIGINT NOT NULL,
    c10 VARCHAR NOT NULL,
    c11 FLOAT NOT NULL,
    c12 DOUBLE NOT NULL,
    c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH ORDER (c2 ASC, c5 + c8 DESC NULL FIRST)
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');

@mnorfolk03
Copy link
Contributor Author

mnorfolk03 commented Oct 24, 2024

@alamb I've run cargo fmt.

So TLDR is I think these are a significant improvement to our coverage. Thank you 🙏

One thing that might be helpful to improve more would be adding a ParquetExec as well as queries that have sortedness

So the equivalent of planning against tables like this (docs here): https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table

CREATE EXTERNAL TABLE foo STORED AS PARQUET LOCATION '..'
CREATE EXTERNAL TABLE test (
    c1  VARCHAR NOT NULL,
    c2  INT NOT NULL,
    c3  SMALLINT NOT NULL,
    c4  SMALLINT NOT NULL,
    c5  INT NOT NULL,
    c6  BIGINT NOT NULL,
    c7  SMALLINT NOT NULL,
    c8  INT NOT NULL,
    c9  BIGINT NOT NULL,
    c10 VARCHAR NOT NULL,
    c11 FLOAT NOT NULL,
    c12 DOUBLE NOT NULL,
    c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH ORDER (c2 ASC, c5 + c8 DESC NULL FIRST)
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');

I'll leave this for a future PR since I've run out of free time this week. Should I open an issue for this in that case?

Thanks!

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

I'll leave this for a future PR since I've run out of free time this week. Should I open an issue for this in that case?

Thanks for your help @mnorfolk03 -- that is great. Indeed an issue would be great -- I filed #13098 to track the idea.

We appreciate your help!

@alamb alamb merged commit 31701b8 into apache:main Oct 24, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants