-
Notifications
You must be signed in to change notification settings - Fork 22
Blog post for DataFusion 51.0.0 #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
39b4091
Add blog post for DataFusion 51.0.0
alamb 1d6fad9
Rough draft from codex
alamb e3bae12
add credits
alamb 28628a9
Updates
alamb 7cf0cfe
update
alamb f64f760
updates
alamb 13d4793
update
alamb 78ecb7c
update
alamb f65d9e9
updates
alamb 6bcdeb2
more
alamb ffeba66
comments
alamb eb632ff
Apply suggestions from code review
alamb 0967256
Update performance chart
alamb 84315f0
another pass
alamb 780ad11
update
alamb 2fb719f
tweaks
alamb 33e4375
Consolidate redundant sections
alamb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,333 @@ | ||
| --- | ||
| layout: post | ||
| title: Apache DataFusion 51.0.0 Released | ||
| date: 2025-11-25 | ||
| author: pmc | ||
| categories: [release] | ||
| --- | ||
|
|
||
| <!-- | ||
| {% comment %} | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to you under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| {% endcomment %} | ||
| --> | ||
|
|
||
| [TOC] | ||
|
|
||
| ## Introduction | ||
|
|
||
| We are proud to announce the release of [DataFusion 51.0.0]. This post highlights | ||
| some of the major improvements since [DataFusion 50.0.0]. The complete list of | ||
| changes is available in the [changelog]. Thanks to the [128 contributors] for | ||
| making this release possible. | ||
|
|
||
| [DataFusion 51.0.0]: https://crates.io/crates/datafusion/51.0.0 | ||
| [DataFusion 50.0.0]: https://datafusion.apache.org/blog/2025/09/29/datafusion-50.0.0/ | ||
| [changelog]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md | ||
| [128 contributors]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md#credits | ||
|
|
||
| ## Performance Improvements 🚀 | ||
| We continue to make significant performance improvements in DataFusion, both in | ||
| the core engine and in the Parquet reader. | ||
|
|
||
| <img | ||
| src="/blog/images/datafusion-51.0.0/performance_over_time_clickbench.png" | ||
| width="100%" | ||
| class="img-responsive" | ||
| alt="Performance over time" | ||
| /> | ||
|
|
||
| **Figure 1**: Average and median normalized query execution times for ClickBench queries for DataFusion 51.0.0 compared to previous releases. | ||
| Query times are normalized using the ClickBench definition. See the | ||
| [DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/) | ||
| for more details. | ||
|
|
||
| ### Faster `CASE` expression evaluation | ||
|
|
||
| This release builds on the [CASE performance epic] with significant improvements. | ||
| Expressions short‑circuit earlier, reuse partial results, and avoid unnecessary | ||
| scattering, speeding up common ETL patterns. Thanks to [pepijnve], [chenkovsky], | ||
| and [petern48] for leading this effort. We hope to share more details on our | ||
| implementation in a future post. | ||
|
|
||
| [pepijnve]: https://github.com/pepijnve | ||
| [chenkovsky]: https://github.com/chenkovsky | ||
| [petern48]: https://github.com/petern48 | ||
|
|
||
| ### Better Defaults for Remote Parquet Reads | ||
|
|
||
| By default, DataFusion now always fetches the last 512KB (configurable) of [Apache Parquet] files | ||
| which usually includes the footer and metadata ([#18118]). This | ||
| change typically avoids 2 I/O requests for each Parquet. While this | ||
| setting has existed in DataFusion for many years, it was not previously enabled | ||
| by default. Users can tune the number of bytes fetched in the initial I/O | ||
| request via the `datafusion.execution.parquet.metadata_size_hint` [config setting]. Thanks to | ||
| [zhuqi-lucas] for leading this effort. | ||
|
|
||
| [config setting]: https://datafusion.apache.org/user-guide/configs.html | ||
| [apache parquet]: https://parquet.apache.org/ | ||
|
|
||
| ### Faster Parquet metadata parsing | ||
|
|
||
| DataFusion 51 also includes the latest Parquet reader from | ||
| [Arrow Rust 57.0.0], which parses Parquet metadata significantly faster. This is | ||
| especially beneficial for workloads with many small Parquet files and scenarios | ||
| where startup time or low latency is important. You can read more about the upstream work by | ||
| [etseidl] and [jhorstmann] that enabled these improvements in the [Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser] blog. | ||
|
|
||
| <img | ||
| src="/blog/images/datafusion-51.0.0/arrow-57-metadata-parsing.png" | ||
| width="100%" | ||
| class="img-responsive" | ||
| alt="Metadata Parsing Performance Improvements in Arrow/Parquet 57" | ||
| /> | ||
|
|
||
| **Figure 2**: Metadata parsing performance improvements in Arrow/Parquet 57.0.0. | ||
|
|
||
| [Arrow Rust 57.0.0]: https://arrow.apache.org/blog/2025/10/30/arrow-rs-57.0.0/ | ||
| [Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser]: https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/ | ||
|
|
||
|
|
||
|
|
||
| ## New Features ✨ | ||
|
|
||
| ### Decimal32/Decimal64 support | ||
|
|
||
| The new Arrow types `Decimal32` and `Decimal64` are now supported in DataFusion | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh nice! |
||
| ([#17501]), including aggregations such as `SUM`, `AVG`, `MIN/MAX`, and window | ||
| functions. Thanks to [AdamGS] for leading this effort. | ||
|
|
||
|
|
||
| ### SQL Pipe Operators | ||
|
|
||
| DataFusion now supports the SQL pipe operator syntax | ||
| ([#17278]), enabling inline transforms such as: | ||
|
|
||
| ```sql | ||
| SELECT * FROM t | ||
| |> WHERE a > 10 | ||
| |> ORDER BY b | ||
| |> LIMIT 5; | ||
| ``` | ||
|
|
||
| This syntax, [popularized by Google BigQuery], keeps multi-step transformations concise while preserving regular | ||
| SQL semantics. Thanks to [simonvandel] for leading this effort. | ||
|
|
||
| [popularized by Google BigQuery]: https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/pipe-syntax | ||
|
|
||
| ### I/O Profiling in `datafusion-cli` | ||
|
|
||
| [datafusion-cli] now has built-in instrumentation to trace object store calls | ||
| ([#17207]). Toggle profiling | ||
| with the [\object_store_profiling command] and inspect the exact `GET`/`LIST` requests issued during | ||
| query execution: | ||
|
|
||
| [datafusion-cli]: https://datafusion.apache.org/user-guide/cli/ | ||
| [\object_store_profiling command]: https://datafusion.apache.org/user-guide/cli/usage.html#commands | ||
|
|
||
| ```sql | ||
| DataFusion CLI v51.0.0 | ||
| > \object_store_profiling trace | ||
| ObjectStore Profile mode set to Trace | ||
| > select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; | ||
| +----------+ | ||
| | count(*) | | ||
| +----------+ | ||
| | 1000000 | | ||
| +----------+ | ||
| 1 row(s) fetched. | ||
| Elapsed 0.367 seconds. | ||
|
|
||
| Object Store Profiling | ||
| Instrumented Object Store: instrument_mode: Trace, inner: HttpStore | ||
| 2025-11-19T21:10:43.476121+00:00 operation=Head duration=0.069763s path=hits_compatible/athena_partitioned/hits_1.parquet | ||
| 2025-11-19T21:10:43.545903+00:00 operation=Head duration=0.025859s path=hits_compatible/athena_partitioned/hits_1.parquet | ||
| 2025-11-19T21:10:43.571768+00:00 operation=Head duration=0.025684s path=hits_compatible/athena_partitioned/hits_1.parquet | ||
| 2025-11-19T21:10:43.597463+00:00 operation=Get duration=0.034194s size=524288 range: bytes=174440756-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet | ||
| 2025-11-19T21:10:43.705821+00:00 operation=Head duration=0.022029s path=hits_compatible/athena_partitioned/hits_1.parquet | ||
|
|
||
| Summaries: | ||
| +-----------+----------+-----------+-----------+-----------+-----------+-------+ | ||
| | Operation | Metric | min | max | avg | sum | count | | ||
| +-----------+----------+-----------+-----------+-----------+-----------+-------+ | ||
| | Get | duration | 0.034194s | 0.034194s | 0.034194s | 0.034194s | 1 | | ||
| | Get | size | 524288 B | 524288 B | 524288 B | 524288 B | 1 | | ||
| | Head | duration | 0.022029s | 0.069763s | 0.035834s | 0.143335s | 4 | | ||
| | Head | size | | | | | 4 | | ||
| +-----------+----------+-----------+-----------+-----------+-----------+-------+ | ||
| ``` | ||
|
|
||
| This makes it far easier to diagnose slow remote scans and validate caching | ||
| strategies. Thanks to [BlakeOrth] for leading this effort. | ||
|
|
||
| ### `DESCRIBE <query>` | ||
|
|
||
| `DESCRIBE` now works on arbitrary queries, returning the schema instead | ||
| of being an alias for `EXPLAIN` ([#18234](https://github.com/apache/datafusion/issues/18234)). This brings DataFusion in line with engines | ||
| like DuckDB and makes it easy to inspect the output schema of queries | ||
| without executing them. Thanks to [djanderson] for leading this effort. | ||
|
|
||
| [djanderson]: https://github.com/djanderson | ||
|
|
||
| For example: | ||
|
|
||
| ```sql | ||
| DataFusion CLI v51.0.0 | ||
| > create table t(a int, b varchar, c float) as values (1, 'a', 2.0); | ||
| 0 row(s) fetched. | ||
| Elapsed 0.002 seconds. | ||
|
|
||
| > DESCRIBE SELECT a, b, SUM(c) FROM t GROUP BY a, b; | ||
|
|
||
| +-------------+-----------+-------------+ | ||
| | column_name | data_type | is_nullable | | ||
| +-------------+-----------+-------------+ | ||
| | a | Int32 | YES | | ||
| | b | Utf8View | YES | | ||
| | sum(t.c) | Float64 | YES | | ||
| +-------------+-----------+-------------+ | ||
| 3 row(s) fetched. | ||
| ``` | ||
|
|
||
|
|
||
| ### Named arguments in SQL functions | ||
|
|
||
| DataFusion now understands [PostgreSQL-style named arguments] (`param => value`) | ||
| for scalar, aggregate, and window functions ([#17379](https://github.com/apache/datafusion/issues/17379)). You can mix positional and named | ||
| arguments in any order, and error messages now list parameter names to make | ||
| diagnostics clearer. UDF authors can also expose parameter names so their | ||
| functions benefit from the same syntax. Thanks to [timsaucer] and [bubulalabu] for leading this effort. | ||
|
|
||
| [PostgreSQL-style named arguments]: https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html | ||
|
|
||
| For example, you can pass arguments to functions like this: | ||
| ```sql | ||
| SELECT power(exponent => 3.0, base => 2.0); | ||
| ``` | ||
|
|
||
| [timsaucer]: https://github.com/timsaucer | ||
| [bubulalabu]: https://github.com/bubulalabu | ||
|
|
||
| ### Metrics improvements | ||
|
|
||
| The output of [EXPLAIN ANALYZE] has been improved to include more metrics | ||
| about execution time and memory usage of each operator ([#18217]). | ||
| You can learn more about these new metrics in the [metrics user guide]. Thanks to | ||
| [2010YOUY01] for leading this effort. | ||
|
|
||
|
|
||
| [#18217]: https://github.com/apache/datafusion/issues/18217 | ||
| [2010YOUY01]: https://github.com/2010YOUY01 | ||
|
|
||
| The `51.0.0` release adds: | ||
|
|
||
| - **Configuration**: adds a new option `datafusion.explain.analyze_level`, which can be set to `summary` for a concise output or `dev` for the full set of metrics (the previous default). | ||
| - **For all major operators**: adds `output_bytes`, reporting how many bytes of data each operator produces. | ||
| - **FilterExec**: adds a `selectivity` metric (`output_rows / input_rows`) to show how effective the filter is. | ||
| - **AggregateExec**: | ||
| - adds detailed timing metrics for group-ID computation, aggregate argument evaluation, aggregation work, and emitting final results. | ||
| - adds a `reduction_factor` metric (`output_rows / input_rows`) to show how much grouping reduces the data. | ||
| - **NestedLoopJoinExec**: adds a `selectivity` metric (`output_rows / (left_rows * right_rows)`) to show how many combinations actually pass the join condition. | ||
| - Several display formatting improvements were added to make `EXPLAIN ANALYZE` output easier to read. | ||
|
|
||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| [EXPLAIN ANALYZE]: https://datafusion.apache.org/user-guide/sql/explain.html#explain-analyze | ||
| [metrics user guide]: https://datafusion.apache.org/user-guide/metrics.html | ||
|
|
||
| For example, the following query: | ||
| ```sql | ||
| set datafusion.explain.analyze_level = summary | ||
|
|
||
| explain analyze | ||
| select count(*) | ||
| from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' | ||
| where "URL" <> ''; | ||
| ``` | ||
|
|
||
| Now shows easier-to-understand metrics such as: | ||
|
|
||
| ```text | ||
| metrics=[ | ||
| output_rows=1000000, | ||
| elapsed_compute=16ns, | ||
| output_bytes=222.5 MB, | ||
| files_ranges_pruned_statistics=16 total → 16 matched, | ||
| row_groups_pruned_statistics=3 total → 3 matched, | ||
| row_groups_pruned_bloom_filter=3 total → 3 matched, | ||
| page_index_rows_pruned=0 total → 0 matched, | ||
| bytes_scanned=33661364, | ||
| metadata_load_time=4.243098ms, | ||
| ] | ||
| ``` | ||
|
|
||
| ## Upgrade Guide and Changelog | ||
|
|
||
| Upgrading to 51.0.0 should be straightforward for most users. Please review the | ||
| [Upgrade Guide] | ||
| for details on breaking changes and code snippets to help with the transition. | ||
| For a comprehensive list of all changes, please refer to the [changelog]. | ||
|
|
||
| ## About DataFusion | ||
|
|
||
| [Apache DataFusion] is an extensible query engine, written in [Rust], that uses | ||
| [Apache Arrow] as its in-memory format. DataFusion is used by developers to | ||
| create new, fast, data-centric systems such as databases, dataframe libraries, | ||
| and machine learning and streaming applications. While [DataFusion’s primary | ||
| design goal] is to accelerate the creation of other data-centric systems, it | ||
| provides a reasonable experience directly out of the box as a [dataframe | ||
| library], [Python library], and [command-line SQL tool]. | ||
|
|
||
| [apache datafusion]: https://datafusion.apache.org/ | ||
| [rust]: https://www.rust-lang.org/ | ||
| [apache arrow]: https://arrow.apache.org | ||
| [DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals | ||
| [dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html | ||
| [python library]: https://datafusion.apache.org/python/ | ||
| [command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/ | ||
| [Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html | ||
| [zhuqi-lucas]: https://github.com/zhuqi-lucas | ||
| [AdamGS]: https://github.com/AdamGS | ||
| [simonvandel]: https://github.com/simonvandel | ||
| [BlakeOrth]: https://github.com/BlakeOrth | ||
| [CASE performance epic]: https://github.com/apache/datafusion/issues/18075 | ||
| [#18118]: https://github.com/apache/datafusion/issues/18118 | ||
| [#17501]: https://github.com/apache/datafusion/pull/17501 | ||
| [#17278]: https://github.com/apache/datafusion/pull/17278 | ||
| [#17207]: https://github.com/apache/datafusion/issues/17207 | ||
| [#17379]: https://github.com/apache/datafusion/issues/17379 | ||
| [etseidl]: https://github.com/etseidl | ||
| [jhorstmann]: https://github.com/jhorstmann | ||
|
|
||
| DataFusion's core thesis is that, as a community, together we can build much | ||
| more advanced technology than any of us as individuals or companies could build | ||
| alone. Without DataFusion, highly performant vectorized query engines would | ||
| remain the domain of a few large companies and world-class research | ||
| institutions. With DataFusion, we can all build on top of a shared foundation | ||
| and focus on what makes our projects unique. | ||
|
|
||
| ## How to Get Involved | ||
|
|
||
| DataFusion is not a project built or driven by a single person, company, or | ||
| foundation. Rather, our community of users and contributors works together to | ||
| build a shared technology that none of us could have built alone. | ||
|
|
||
| If you are interested in joining us, we would love to have you. You can try out | ||
| DataFusion on some of your own data and projects and let us know how it goes, | ||
| contribute suggestions, documentation, bug reports, or a PR with documentation, | ||
| tests, or code. A list of open issues suitable for beginners is [here], and you | ||
| can find out how to reach us on the [communication doc]. | ||
|
|
||
| [here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 | ||
| [communication doc]: https://datafusion.apache.org/contributor-guide/communication.html | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+60.5 KB
content/images/datafusion-51.0.0/performance_over_time_clickbench.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
128, wow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed -- I think this is the part of this blog post I am most proud of