-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Update to arrow/parquet 55.2.0 #16575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Update to released version
01)CoalesceBatchesExec: target_batch_size=8192 | ||
02)--FilterExec: column1@0 != 42 | ||
03)----DataSourceExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..137], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:137..274], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:274..411], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:411..547]]}, projection=[column1], file_type=parquet, predicate=column1@0 != 42, pruning_predicate=column1_null_count@2 != row_count@3 AND (column1_min@0 != 42 OR 42 != column1_max@1), required_guarantees=[column1 not in (42)] | ||
03)----DataSourceExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..141], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:141..282], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:282..423], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:423..563]]}, projection=[column1], file_type=parquet, predicate=column1@0 != 42, pruning_predicate=column1_null_count@2 != row_count@3 AND (column1_min@0 != 42 OR 42 != column1_max@1), required_guarantees=[column1 not in (42)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some files can get (slightly) larger due to
07)│ DataSourceExec │ | ||
08)│ -------------------- │ | ||
09)│ bytes: 1072 │ | ||
09)│ bytes: 1040 │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these files are now slightly smaller due to
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @alamb , there are some performance regression for clickbench, but i may be noise?
Hmm looks non trivial -- I will rerun |
This comment was marked as outdated.
This comment was marked as outdated.
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
Hmm, next up will be to try and reproduce the slowdown locally 🤔 |
🤖: Benchmark completed Details
|
I am preparing to compare locally with 3649dc8 (the (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ git checkout `git merge-base alamb/alamb/arrow_55.2.0_upgrade_real apache/main`
Note: switching to '3649dc80cef2cdcd2df750e90e6ffd94ae6ae808'.
....
cargo build --profile=profiling -p datafusion-cli && cp target/profiling/datafusion-cli ~/Downloads/datafusion-cli-arrow-55.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @alamb lgtm
and thanks for test diff explanation
I couldn't really reproduce anything locally but the variation is pretty noisy with q21
When looking at that query however, I found some other performance improvement: |
I also don't really see any difference with q4 (another that was reported 10% slower) (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ ./datafusion-cli-alamb_arrow_55.2.0_upgrade_real -f q4-times-10.sql | grep Elapsed
Elapsed 0.312 seconds.
Elapsed 0.271 seconds.
Elapsed 0.266 seconds.
Elapsed 0.265 seconds.
Elapsed 0.273 seconds.
Elapsed 0.277 seconds.
Elapsed 0.267 seconds.
Elapsed 0.273 seconds.
Elapsed 0.269 seconds.
Elapsed 0.276 seconds.
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ ./datafusion-cli-arrow-55.1.0 -f q4-times-10.sql |grep Elapsed
Elapsed 0.324 seconds.
Elapsed 0.289 seconds.
Elapsed 0.266 seconds.
Elapsed 0.269 seconds.
Elapsed 0.263 seconds.
Elapsed 0.272 seconds.
Elapsed 0.267 seconds.
Elapsed 0.275 seconds.
Elapsed 0.268 seconds.
Elapsed 0.262 seconds. I will next see if I can reproduce the same thing on the actual benchmark machine |
Here are my benchmark results on the same benchmark machine: q4:
q21
(basically no change ...) So I am stumped |
An update here is I am working on some scripts to start measuring performance over time of datafusion so we can get a better handle on how performance is changing over time |
This is great for us to monitor performance! |
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
I am going to merge this PR in to unblock other things (like the PR for external parquet indexes) I am not really sure about the performance impact as I see some non trivial fluctuation. However, I have been working on a system to record and report DataFusion performance overtime. See #5504 for more detail or get a sneak peek here: https://github.com/alamb/datafusion-benchmarking If we have lost any performance let's plan to get it back over time |
🚀 |
Which issue does this PR close?
55.2.2
#16492Rationale for this change
I would like to use new features in this minor version of arrow
What changes are included in this PR?
Are these changes tested?
By existing CI tests
Are there any user-facing changes?
No