-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Added a number of physical planning join benchmarks #13085
Conversation
This is my first PR for datafusion & first time writing in Rust, so any general feedback or comments would be greatly appreciated. |
@alamb if you have some time & could add this to your list of PRs to review that would be great. Additionally, if you could trigger the CI tasks that would be greatly appreciated! |
Hi @mnorfolk03 I think it will not close the issue. I see you added a number of benchmarks for planning queries with joins, however it seems the linked issue is about executing joins. I suggest to:
|
Let me know if you need any further help |
Ah I see @alamb suggested to start with physical planning in the ticket first, sorry for the noise :) I do think some focused join execution benchmarks can be very useful too. |
Yes indeed -- I think getting some better benchmark coverage for planning will help us make that area of DataFusion better. Getting some better / more focused benchmarks for join execution would also be aweseom |
Thank you @mnorfolk03! Welcome to DataFusion (and rust!) I think the CI test here https://github.com/apache/datafusion/actions/runs/11490928733/job/31986697253?pr=13085 Would be fixed by running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mnorfolk03 -- this is a nice improvement
I ran these benchmarks and they look good.
cargo bench --bench sql_planner -- physical_join_distinct
I also made some flamegraphs to see what might be going on which was quite instructive:
sudo flamegraph -- target/release/deps/sql_planner-65bd83b445d4016c --bench physical_join_distinct
sudo flamegraph -- target/release/deps/sql_planner-65bd83b445d4016c --bench physical_many_self_joins
So TLDR is I think these are a significant improvement to our coverage. Thank you 🙏
One thing that might be helpful to improve more would be adding a ParquetExec as well as queries that have sortedness
So the equivalent of planning against tables like this (docs here): https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
CREATE EXTERNAL TABLE foo STORED AS PARQUET LOCATION '..'
CREATE EXTERNAL TABLE test (
c1 VARCHAR NOT NULL,
c2 INT NOT NULL,
c3 SMALLINT NOT NULL,
c4 SMALLINT NOT NULL,
c5 INT NOT NULL,
c6 BIGINT NOT NULL,
c7 SMALLINT NOT NULL,
c8 INT NOT NULL,
c9 BIGINT NOT NULL,
c10 VARCHAR NOT NULL,
c11 FLOAT NOT NULL,
c12 DOUBLE NOT NULL,
c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH ORDER (c2 ASC, c5 + c8 DESC NULL FIRST)
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');
@alamb I've run cargo fmt.
I'll leave this for a future PR since I've run out of free time this week. Should I open an issue for this in that case? Thanks! |
Thanks for your help @mnorfolk03 -- that is great. Indeed an issue would be great -- I filed #13098 to track the idea. We appreciate your help! |
Which issue does this PR close?
Based on discussion in #94 but unsure if it would close it.
Rationale for this change
Lack of join benchmarks for physical planning.
What changes are included in this PR?
Added a number of physical planning benchmarks mostly focused on JOIN queries.
Are these changes tested?
n/a
Are there any user-facing changes?
No