Skip to content
333 changes: 333 additions & 0 deletions content/blog/2025-11-25-datafusion-51.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,333 @@
---
layout: post
title: Apache DataFusion 51.0.0 Released
date: 2025-11-25
author: pmc
categories: [release]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

[TOC]

## Introduction

We are proud to announce the release of [DataFusion 51.0.0]. This post highlights
some of the major improvements since [DataFusion 50.0.0]. The complete list of
changes is available in the [changelog]. Thanks to the [128 contributors] for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128, wow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed -- I think this is the part of this blog post I am most proud of

making this release possible.

[DataFusion 51.0.0]: https://crates.io/crates/datafusion/51.0.0
[DataFusion 50.0.0]: https://datafusion.apache.org/blog/2025/09/29/datafusion-50.0.0/
[changelog]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md
[128 contributors]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md#credits

## Performance Improvements 🚀
We continue to make significant performance improvements in DataFusion, both in
the core engine and in the Parquet reader.

<img
src="/blog/images/datafusion-51.0.0/performance_over_time_clickbench.png"
width="100%"
class="img-responsive"
alt="Performance over time"
/>

**Figure 1**: Average and median normalized query execution times for ClickBench queries for DataFusion 51.0.0 compared to previous releases.
Query times are normalized using the ClickBench definition. See the
[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/)
for more details.

### Faster `CASE` expression evaluation

This release builds on the [CASE performance epic] with significant improvements.
Expressions short‑circuit earlier, reuse partial results, and avoid unnecessary
scattering, speeding up common ETL patterns. Thanks to [pepijnve], [chenkovsky],
and [petern48] for leading this effort. We hope to share more details on our
implementation in a future post.

[pepijnve]: https://github.com/pepijnve
[chenkovsky]: https://github.com/chenkovsky
[petern48]: https://github.com/petern48

### Better Defaults for Remote Parquet Reads

By default, DataFusion now always fetches the last 512KB (configurable) of [Apache Parquet] files
which usually includes the footer and metadata ([#18118]). This
change typically avoids 2 I/O requests for each Parquet. While this
setting has existed in DataFusion for many years, it was not previously enabled
by default. Users can tune the number of bytes fetched in the initial I/O
request via the `datafusion.execution.parquet.metadata_size_hint` [config setting]. Thanks to
[zhuqi-lucas] for leading this effort.

[config setting]: https://datafusion.apache.org/user-guide/configs.html
[apache parquet]: https://parquet.apache.org/

### Faster Parquet metadata parsing

DataFusion 51 also includes the latest Parquet reader from
[Arrow Rust 57.0.0], which parses Parquet metadata significantly faster. This is
especially beneficial for workloads with many small Parquet files and scenarios
where startup time or low latency is important. You can read more about the upstream work by
[etseidl] and [jhorstmann] that enabled these improvements in the [Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser] blog.

<img
src="/blog/images/datafusion-51.0.0/arrow-57-metadata-parsing.png"
width="100%"
class="img-responsive"
alt="Metadata Parsing Performance Improvements in Arrow/Parquet 57"
/>

**Figure 2**: Metadata parsing performance improvements in Arrow/Parquet 57.0.0.

[Arrow Rust 57.0.0]: https://arrow.apache.org/blog/2025/10/30/arrow-rs-57.0.0/
[Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser]: https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/



## New Features ✨

### Decimal32/Decimal64 support

The new Arrow types `Decimal32` and `Decimal64` are now supported in DataFusion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice!

([#17501]), including aggregations such as `SUM`, `AVG`, `MIN/MAX`, and window
functions. Thanks to [AdamGS] for leading this effort.


### SQL Pipe Operators

DataFusion now supports the SQL pipe operator syntax
([#17278]), enabling inline transforms such as:

```sql
SELECT * FROM t
|> WHERE a > 10
|> ORDER BY b
|> LIMIT 5;
```

This syntax, [popularized by Google BigQuery], keeps multi-step transformations concise while preserving regular
SQL semantics. Thanks to [simonvandel] for leading this effort.

[popularized by Google BigQuery]: https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/pipe-syntax

### I/O Profiling in `datafusion-cli`

[datafusion-cli] now has built-in instrumentation to trace object store calls
([#17207]). Toggle profiling
with the [\object_store_profiling command] and inspect the exact `GET`/`LIST` requests issued during
query execution:

[datafusion-cli]: https://datafusion.apache.org/user-guide/cli/
[\object_store_profiling command]: https://datafusion.apache.org/user-guide/cli/usage.html#commands

```sql
DataFusion CLI v51.0.0
> \object_store_profiling trace
ObjectStore Profile mode set to Trace
> select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row(s) fetched.
Elapsed 0.367 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
2025-11-19T21:10:43.476121+00:00 operation=Head duration=0.069763s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.545903+00:00 operation=Head duration=0.025859s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.571768+00:00 operation=Head duration=0.025684s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.597463+00:00 operation=Get duration=0.034194s size=524288 range: bytes=174440756-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.705821+00:00 operation=Head duration=0.022029s path=hits_compatible/athena_partitioned/hits_1.parquet

Summaries:
+-----------+----------+-----------+-----------+-----------+-----------+-------+
| Operation | Metric | min | max | avg | sum | count |
+-----------+----------+-----------+-----------+-----------+-----------+-------+
| Get | duration | 0.034194s | 0.034194s | 0.034194s | 0.034194s | 1 |
| Get | size | 524288 B | 524288 B | 524288 B | 524288 B | 1 |
| Head | duration | 0.022029s | 0.069763s | 0.035834s | 0.143335s | 4 |
| Head | size | | | | | 4 |
+-----------+----------+-----------+-----------+-----------+-----------+-------+
```

This makes it far easier to diagnose slow remote scans and validate caching
strategies. Thanks to [BlakeOrth] for leading this effort.

### `DESCRIBE <query>`

`DESCRIBE` now works on arbitrary queries, returning the schema instead
of being an alias for `EXPLAIN` ([#18234](https://github.com/apache/datafusion/issues/18234)). This brings DataFusion in line with engines
like DuckDB and makes it easy to inspect the output schema of queries
without executing them. Thanks to [djanderson] for leading this effort.

[djanderson]: https://github.com/djanderson

For example:

```sql
DataFusion CLI v51.0.0
> create table t(a int, b varchar, c float) as values (1, 'a', 2.0);
0 row(s) fetched.
Elapsed 0.002 seconds.

> DESCRIBE SELECT a, b, SUM(c) FROM t GROUP BY a, b;

+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| a | Int32 | YES |
| b | Utf8View | YES |
| sum(t.c) | Float64 | YES |
+-------------+-----------+-------------+
3 row(s) fetched.
```


### Named arguments in SQL functions

DataFusion now understands [PostgreSQL-style named arguments] (`param => value`)
for scalar, aggregate, and window functions ([#17379](https://github.com/apache/datafusion/issues/17379)). You can mix positional and named
arguments in any order, and error messages now list parameter names to make
diagnostics clearer. UDF authors can also expose parameter names so their
functions benefit from the same syntax. Thanks to [timsaucer] and [bubulalabu] for leading this effort.

[PostgreSQL-style named arguments]: https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html

For example, you can pass arguments to functions like this:
```sql
SELECT power(exponent => 3.0, base => 2.0);
```

[timsaucer]: https://github.com/timsaucer
[bubulalabu]: https://github.com/bubulalabu

### Metrics improvements

The output of [EXPLAIN ANALYZE] has been improved to include more metrics
about execution time and memory usage of each operator ([#18217]).
You can learn more about these new metrics in the [metrics user guide]. Thanks to
[2010YOUY01] for leading this effort.


[#18217]: https://github.com/apache/datafusion/issues/18217
[2010YOUY01]: https://github.com/2010YOUY01

The `51.0.0` release adds:

- **Configuration**: adds a new option `datafusion.explain.analyze_level`, which can be set to `summary` for a concise output or `dev` for the full set of metrics (the previous default).
- **For all major operators**: adds `output_bytes`, reporting how many bytes of data each operator produces.
- **FilterExec**: adds a `selectivity` metric (`output_rows / input_rows`) to show how effective the filter is.
- **AggregateExec**:
- adds detailed timing metrics for group-ID computation, aggregate argument evaluation, aggregation work, and emitting final results.
- adds a `reduction_factor` metric (`output_rows / input_rows`) to show how much grouping reduces the data.
- **NestedLoopJoinExec**: adds a `selectivity` metric (`output_rows / (left_rows * right_rows)`) to show how many combinations actually pass the join condition.
- Several display formatting improvements were added to make `EXPLAIN ANALYZE` output easier to read.

[EXPLAIN ANALYZE]: https://datafusion.apache.org/user-guide/sql/explain.html#explain-analyze
[metrics user guide]: https://datafusion.apache.org/user-guide/metrics.html

For example, the following query:
```sql
set datafusion.explain.analyze_level = summary

explain analyze
select count(*)
from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
where "URL" <> '';
```

Now shows easier-to-understand metrics such as:

```text
metrics=[
output_rows=1000000,
elapsed_compute=16ns,
output_bytes=222.5 MB,
files_ranges_pruned_statistics=16 total → 16 matched,
row_groups_pruned_statistics=3 total → 3 matched,
row_groups_pruned_bloom_filter=3 total → 3 matched,
page_index_rows_pruned=0 total → 0 matched,
bytes_scanned=33661364,
metadata_load_time=4.243098ms,
]
```

## Upgrade Guide and Changelog

Upgrading to 51.0.0 should be straightforward for most users. Please review the
[Upgrade Guide]
for details on breaking changes and code snippets to help with the transition.
For a comprehensive list of all changes, please refer to the [changelog].

## About DataFusion

[Apache DataFusion] is an extensible query engine, written in [Rust], that uses
[Apache Arrow] as its in-memory format. DataFusion is used by developers to
create new, fast, data-centric systems such as databases, dataframe libraries,
and machine learning and streaming applications. While [DataFusion’s primary
design goal] is to accelerate the creation of other data-centric systems, it
provides a reasonable experience directly out of the box as a [dataframe
library], [Python library], and [command-line SQL tool].

[apache datafusion]: https://datafusion.apache.org/
[rust]: https://www.rust-lang.org/
[apache arrow]: https://arrow.apache.org
[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals
[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
[python library]: https://datafusion.apache.org/python/
[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/
[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html
[zhuqi-lucas]: https://github.com/zhuqi-lucas
[AdamGS]: https://github.com/AdamGS
[simonvandel]: https://github.com/simonvandel
[BlakeOrth]: https://github.com/BlakeOrth
[CASE performance epic]: https://github.com/apache/datafusion/issues/18075
[#18118]: https://github.com/apache/datafusion/issues/18118
[#17501]: https://github.com/apache/datafusion/pull/17501
[#17278]: https://github.com/apache/datafusion/pull/17278
[#17207]: https://github.com/apache/datafusion/issues/17207
[#17379]: https://github.com/apache/datafusion/issues/17379
[etseidl]: https://github.com/etseidl
[jhorstmann]: https://github.com/jhorstmann

DataFusion's core thesis is that, as a community, together we can build much
more advanced technology than any of us as individuals or companies could build
alone. Without DataFusion, highly performant vectorized query engines would
remain the domain of a few large companies and world-class research
institutions. With DataFusion, we can all build on top of a shared foundation
and focus on what makes our projects unique.

## How to Get Involved

DataFusion is not a project built or driven by a single person, company, or
foundation. Rather, our community of users and contributors works together to
build a shared technology that none of us could have built alone.

If you are interested in joining us, we would love to have you. You can try out
DataFusion on some of your own data and projects and let us know how it goes,
contribute suggestions, documentation, bug reports, or a PR with documentation,
tests, or code. A list of open issues suitable for beginners is [here], and you
can find out how to reach us on the [communication doc].

[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.