From 68abcfb57d7251c7ca52919b0c1e842b9be3144b Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 8 Jan 2026 17:41:52 -0500 Subject: [PATCH 01/26] Initial draft (coded with codex) --- content/blog/2026-01-08-datafusion-52.0.0.md | 337 +++++++++++++++++++ 1 file changed, 337 insertions(+) create mode 100644 content/blog/2026-01-08-datafusion-52.0.0.md diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md new file mode 100644 index 00000000..52ff7fa7 --- /dev/null +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -0,0 +1,337 @@ +--- +layout: post +title: Apache DataFusion 52.0.0 Released +date: 2026-01-08 +author: pmc +categories: [release] +--- + + + +[TOC] + +## Introduction + +We are proud to announce the release of [DataFusion 52.0.0]. This post highlights +some of the major improvements since [DataFusion 51.0.0]. The complete list of +changes is available in the [changelog]. Thanks to the [120 contributors] for +making this release possible. + +TODO: confirm the release date for 52.0.0 and update the front matter if needed. + +[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0 +[DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md +[120 contributors]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits + +## Performance Improvements 🚀 + +We continue to make significant performance improvements in DataFusion, both in +the core engine and in the Parquet reader. This release includes faster `CASE` +expressions, better hash performance for string types, and continued string +function optimizations. + +### Performance Chart (TODO) + +TODO: add the 52.0.0 performance chart and update the caption. + + + +**Figure 1**: TODO: update caption for 52.0.0 benchmarking results. + +## Major Features ✨ + +### Arrow IPC Stream file support + +DataFusion can now read Arrow IPC stream files ([#18457]). This expands +interoperability with systems that emit Arrow streams directly, making it +simpler to ingest Arrow-native data without conversion. + +Example (TODO: confirm exact syntax for IPC stream format selection): + +```sql +-- TODO: confirm whether the format name is `arrow`, `ipc_stream`, or implicit. +CREATE EXTERNAL TABLE ipc_events +STORED AS ARROW +LOCATION 's3://bucket/events.arrow'; +``` + +Related PRs: [#18457] + +[#18457]: https://github.com/apache/datafusion/pull/18457 + +### Faster `CASE` expression evaluation + +DataFusion 52 completes major work from the CASE performance epic ([#18075]). +Lookup-table based evaluation avoids repeated expression evaluation and reduces +branching overhead, accelerating common ETL patterns. + +Example: + +```sql +SELECT + CASE + WHEN status IN ('NEW', 'READY', 'STAGED') THEN 'PENDING' + WHEN status IN ('DONE', 'COMPLETE') THEN 'FINISHED' + ELSE 'OTHER' + END AS status_bucket, + count(*) +FROM jobs +GROUP BY 1; +``` + +Related PRs: [#18183] + +[#18075]: https://github.com/apache/datafusion/issues/18075 +[#18183]: https://github.com/apache/datafusion/pull/18183 + +### Extensible SQL planning with relation planner extensions + +DataFusion now supports relation planner extensions for custom SQL syntax and +planning logic ([#17824], [#17843]). This lets downstream projects inject their +own planning behavior without forking the SQL planner, which is critical for +dialect extensions and custom table references. + +Diagram: + +``` +SQL text + | (custom relation planner extension) + v +Logical plan + | (DataFusion optimizers) + v +Physical plan +``` + +TODO: include a short Rust snippet showing how to register a relation planner +extension once the final API example is confirmed. + +Related PRs: [#17843] + +[#17824]: https://github.com/apache/datafusion/issues/17824 +[#17843]: https://github.com/apache/datafusion/pull/17843 + +### ListingTable object store usage improvements + +ListingTable improvements continue to reduce object store I/O and planning +latency for partitioned datasets ([#17214]). DataFusion now normalizes partition +and flat listings, enables a memory-bound list-files cache by default, and +makes the cache prefix-aware for partition pruning. + +Diagram: + +``` +Object store LIST + | (normalized listing + cache) + v +Partitioned files + | (planner) + v +Execution plan +``` + +Related PRs: [#18146], [#18855], [#19366], [#19298], [#18971] + +[#17214]: https://github.com/apache/datafusion/issues/17214 +[#18146]: https://github.com/apache/datafusion/pull/18146 +[#18855]: https://github.com/apache/datafusion/pull/18855 +[#19366]: https://github.com/apache/datafusion/pull/19366 +[#19298]: https://github.com/apache/datafusion/pull/19298 +[#18971]: https://github.com/apache/datafusion/pull/18971 + +### Statistics cache improvements + +The statistics cache has been improved to make pruning and planning more +reliable in repeated workloads ([#19051]). DataFusion now exposes a +`statistics_cache` function and improves cache memory behavior for listing +workflows, making it easier to diagnose cache contents and reduce repeated I/O. + +Example (TODO: confirm the function signature and output schema): + +```sql +-- TODO: confirm the function name and arguments. +SELECT * FROM statistics_cache('my_table'); +``` + +Related PRs: [#19054], [#18855], [#18971] + +[#19051]: https://github.com/apache/datafusion/issues/19051 +[#19054]: https://github.com/apache/datafusion/pull/19054 + +### Pushdown expression evaluation via PhysicalExprAdapter + +DataFusion now pushes down expression evaluation into TableProviders using the +PhysicalExprAdapter, replacing the older SchemaAdapter approach ([#14993], +[#16800]). This enables richer pushdown (expressions and projections) and +improves consistency between logical and physical planning. + +Diagram: + +``` +SQL filter/projection + | (PhysicalExprAdapter) + v +TableProvider pushdown + | (scan) + v +Reduced data +``` + +Related PRs: [#18998], [#19345] + +[#14993]: https://github.com/apache/datafusion/issues/14993 +[#16800]: https://github.com/apache/datafusion/issues/16800 +[#18998]: https://github.com/apache/datafusion/pull/18998 +[#19345]: https://github.com/apache/datafusion/pull/19345 + +### Hash join build-side pushdown + +DataFusion can now push down build-side hash tables from HashJoinExec into scans +([#17171]). When the build side is small, DataFusion converts the hash table to +an `IN` list or hash lookup that can be evaluated during scans, reducing the +join input size early. + +Example: + +```sql +SELECT * +FROM orders o +JOIN small_dim d +ON o.dim_id = d.id; +``` + +TODO: include a physical plan snippet that shows the pushdown filter once a +canonical example is selected. + +Related PRs: [#18393] + +[#17171]: https://github.com/apache/datafusion/issues/17171 +[#18393]: https://github.com/apache/datafusion/pull/18393 + +### Sort pushdown to sources + +DataFusion now supports sort pushdown into data sources, allowing scans to +return sorted data or leverage reversed row groups when possible ([#10433], +[#19064]). This reduces memory pressure and can eliminate explicit sort stages +for partitioned or pre-sorted data. + +Example: + +```sql +SELECT * +FROM parquet_table +ORDER BY event_time DESC; +``` + +Related PRs: [#19064] + +[#10433]: https://github.com/apache/datafusion/issues/10433 +[#19064]: https://github.com/apache/datafusion/pull/19064 + +### DELETE/UPDATE hooks in TableProvider + +TableProvider now includes DELETE and UPDATE hooks, with MemTable providing the +first implementation ([#19142]). This is an important step toward fully +featured DML support and enables downstream storage engines to plug in their +own mutation logic. + +Example: + +```sql +DELETE FROM mem_table WHERE status = 'obsolete'; +``` + +Related PRs: [#19142] + +[#19142]: https://github.com/apache/datafusion/pull/19142 + +### CoalesceBatchesExec removal and integrated batch coalescing + +DataFusion continues the work to remove the standalone CoalesceBatchesExec +operator ([#18779]). Batch coalescing is now integrated into multiple operators, +reducing plan complexity and avoiding unnecessary batch materialization. + +Diagram: + +``` +Before: + Scan -> CoalesceBatches -> Filter -> CoalesceBatches -> Join + +After: + Scan -> Filter (coalesce inline) -> Join (coalesce inline) +``` + +Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239] + +[#18779]: https://github.com/apache/datafusion/issues/18779 +[#18540]: https://github.com/apache/datafusion/pull/18540 +[#18604]: https://github.com/apache/datafusion/pull/18604 +[#18630]: https://github.com/apache/datafusion/pull/18630 +[#18972]: https://github.com/apache/datafusion/pull/18972 +[#19002]: https://github.com/apache/datafusion/pull/19002 +[#19342]: https://github.com/apache/datafusion/pull/19342 +[#19239]: https://github.com/apache/datafusion/pull/19239 + +## Upgrade Guide and Changelog + +Upgrading to 52.0.0 should be straightforward for most users. Please review the +[Upgrade Guide] +for details on breaking changes and code snippets to help with the transition. +For a comprehensive list of all changes, please refer to the [changelog]. + +## About DataFusion + +[Apache DataFusion] is an extensible query engine, written in [Rust], that uses +[Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast, data-centric systems such as databases, dataframe libraries, +and machine learning and streaming applications. While [DataFusion's primary +design goal] is to accelerate the creation of other data-centric systems, it +provides a reasonable experience directly out of the box as a [dataframe +library], [Python library], and [command-line SQL tool]. + +[apache datafusion]: https://datafusion.apache.org/ +[rust]: https://www.rust-lang.org/ +[apache arrow]: https://arrow.apache.org +[DataFusion's primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals +[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html +[python library]: https://datafusion.apache.org/python/ +[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/ +[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html + +## How to Get Involved + +DataFusion is not a project built or driven by a single person, company, or +foundation. Rather, our community of users and contributors works together to +build a shared technology that none of us could have built alone. + +If you are interested in joining us, we would love to have you. You can try out +DataFusion on some of your own data and projects and let us know how it goes, +contribute suggestions, documentation, bug reports, or a PR with documentation, +tests, or code. A list of open issues suitable for beginners is [here], and you +can find out how to reach us on the [communication doc]. + +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html From 1686945ee443f9aaec5c3d5e07697f5a9ee9da0e Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 10 Jan 2026 14:55:49 -0500 Subject: [PATCH 02/26] updates --- content/blog/2026-01-08-datafusion-52.0.0.md | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 52ff7fa7..7fdf9016 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -270,9 +270,14 @@ Related PRs: [#19142] ### CoalesceBatchesExec removal and integrated batch coalescing -DataFusion continues the work to remove the standalone CoalesceBatchesExec -operator ([#18779]). Batch coalescing is now integrated into multiple operators, -reducing plan complexity and avoiding unnecessary batch materialization. +DataFusion continues the work from the CoalesceBatchesExec epic ([#18779]). The +standalone `CoalesceBatchesExec` operator existed to ensure batches were large +enough for vectorized execution, and it was inserted after filter-like +operators such as `FilterExec`, `HashJoinExec`, and `RepartitionExec`. However, +it also blocked other optimizations (like pushing limits through joins) and +made optimizer rules more complex. This release integrates coalescing into the +operators themselves and relies on Arrow's coalesce kernels, reducing plan +complexity while keeping batch sizes efficient. Diagram: @@ -285,6 +290,8 @@ After: ``` Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239] +Thanks to [Tim-53], [Dandandan], [jizezhang], and [feniljain] for implementing +this feature. [#18779]: https://github.com/apache/datafusion/issues/18779 [#18540]: https://github.com/apache/datafusion/pull/18540 @@ -294,6 +301,10 @@ Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239 [#19002]: https://github.com/apache/datafusion/pull/19002 [#19342]: https://github.com/apache/datafusion/pull/19342 [#19239]: https://github.com/apache/datafusion/pull/19239 +[Tim-53]: https://github.com/Tim-53 +[Dandandan]: https://github.com/Dandandan +[jizezhang]: https://github.com/jizezhang +[feniljain]: https://github.com/feniljain ## Upgrade Guide and Changelog From 3c7dd6a5e9715a015f0d0307ee57a4c387424b37 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 18 Jan 2026 22:30:44 -0500 Subject: [PATCH 03/26] updates --- content/blog/2026-01-08-datafusion-52.0.0.md | 195 ++++++++++++------- 1 file changed, 120 insertions(+), 75 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 7fdf9016..5a7f720b 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -27,11 +27,9 @@ limitations under the License. [TOC] -## Introduction - We are proud to announce the release of [DataFusion 52.0.0]. This post highlights some of the major improvements since [DataFusion 51.0.0]. The complete list of -changes is available in the [changelog]. Thanks to the [120 contributors] for +changes is available in the [changelog]. Thanks to the [121 contributors] for making this release possible. TODO: confirm the release date for 52.0.0 and update the front matter if needed. @@ -43,10 +41,12 @@ TODO: confirm the release date for 52.0.0 and update the front matter if needed. ## Performance Improvements 🚀 -We continue to make significant performance improvements in DataFusion, both in -the core engine and in the Parquet reader. This release includes faster `CASE` -expressions, better hash performance for string types, and continued string -function optimizations. +We continue to make significant performance improvements in DataFusion. This +release includes faster `CASE` expressions (see below), a new SortMergeJoin, +automatic caching of metadata, statistics, and listing results for ListingTable, +improved hashing and grouping performance for string types, and string function +optimizations. + ### Performance Chart (TODO) @@ -61,26 +61,6 @@ alt="Performance over time" **Figure 1**: TODO: update caption for 52.0.0 benchmarking results. -## Major Features ✨ - -### Arrow IPC Stream file support - -DataFusion can now read Arrow IPC stream files ([#18457]). This expands -interoperability with systems that emit Arrow streams directly, making it -simpler to ingest Arrow-native data without conversion. - -Example (TODO: confirm exact syntax for IPC stream format selection): - -```sql --- TODO: confirm whether the format name is `arrow`, `ipc_stream`, or implicit. -CREATE EXTERNAL TABLE ipc_events -STORED AS ARROW -LOCATION 's3://bucket/events.arrow'; -``` - -Related PRs: [#18457] - -[#18457]: https://github.com/apache/datafusion/pull/18457 ### Faster `CASE` expression evaluation @@ -106,80 +86,145 @@ Related PRs: [#18183] [#18075]: https://github.com/apache/datafusion/issues/18075 [#18183]: https://github.com/apache/datafusion/pull/18183 +[#18487]: https://github.com/apache/datafusion/issues/18487 +[#18875]: https://github.com/apache/datafusion/pull/18875 -### Extensible SQL planning with relation planner extensions +### Rewritten merge join -DataFusion now supports relation planner extensions for custom SQL syntax and -planning logic ([#17824], [#17843]). This lets downstream projects inject their -own planning behavior without forking the SQL planner, which is critical for -dialect extensions and custom table references. +DataFusion 52 includes a rewrite of the sort-merge join output buffering to +avoid excessive `concat_batches` work and to use `BatchCoalescer` internally and +for final output. This change targets pathological slowdowns like the reported +LeftAnti join case in [#18487], which also affected Comet workloads that rely on +SMJ. Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from +minutes to milliseconds) while leaving most other queries unchanged or modestly +faster, and the update is fully internal with no user-facing API changes. -Diagram: -``` -SQL text - | (custom relation planner extension) - v -Logical plan - | (DataFusion optimizers) - v -Physical plan -``` +### Caching Improvements -TODO: include a short Rust snippet showing how to register a relation planner -extension once the final API example is confirmed. +DataFusion also includes several additional caching improvements in this release. -Related PRs: [#17843] +First it includes a new statistics cache for Parquet Metadata that avoids repeatedly +calculating statistics for Parquet backed files. This significantly improves +planning time for certain queries. You can see the contents of the new cache using the +[statistics_cache] function in the CLI: -[#17824]: https://github.com/apache/datafusion/issues/17824 -[#17843]: https://github.com/apache/datafusion/pull/17843 +[statistics_cache]: https://datafusion.apache.org/user-guide/cli/functions.html#statistics-cache -### ListingTable object store usage improvements -ListingTable improvements continue to reduce object store I/O and planning -latency for partitioned datasets ([#17214]). DataFusion now normalizes partition -and flat listings, enables a memory-bound list-files cache by default, and -makes the cache prefix-aware for partition pruning. +```sql +select * from statistics_cache(); ++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+ +| path | file_modified | file_size_bytes | e_tag | version | num_rows | num_columns | table_size_bytes | statistics_size_bytes | ++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+ +| .../hits.parquet | 2022-06-25T22:22:22 | 14779976446 | 0-5e24d1ee16380-370f48 | NULL | Exact(99997497) | 105 | Exact(36445943240) | 0 | ++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+ +``` +Related PRs: [#18971], [#19054] -Diagram: +[#18971]: https://github.com/apache/datafusion/pull/18971 +[#19054]: https://github.com/apache/datafusion/pull/19054 -``` -Object store LIST - | (normalized listing + cache) - v -Partitioned files - | (planner) - v -Execution plan + +DataFusion and includes a memory-bound, prefix aware list-files cache by +default. You can see the contents of the new cache using the [list_files_cache] +function in the CLI: + +[list_files_cache]: https://datafusion.apache.org/user-guide/cli/functions.html#list-files-cache + +```sql +create external table overturemaps +stored as parquet +location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure'; +0 row(s) fetched. +> select table, path, metadata_size_bytes, expires_in, unnest(metadata_list)['file_size_bytes'] as file_size_bytes, unnest(metadata_list)['e_tag'] as e_tag from list_files_cache() limit 10; ++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+ +| table | path | metadata_size_bytes | expires_in | file_size_bytes | e_tag | ++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+ +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 999055952 | "35fc8fbe8400960b54c66fbb408c48e8-60" | +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 975592768 | "8a16e10b722681cdc00242564b502965-59" | +... +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 1016732378 | "6d70857a0473ed9ed3fc6e149814168b-61" | +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 991363784 | "c9cafb42fcbb413f851691c895dd7c2b-60" | +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 1032469715 | "7540252d0d67158297a67038a3365e0f-62" | ++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+ ``` -Related PRs: [#18146], [#18855], [#19366], [#19298], [#18971] +Related PRs: [#18146], [#18855], [#19366], [#19298], -[#17214]: https://github.com/apache/datafusion/issues/17214 +[Epic #17214]: https://github.com/apache/datafusion/issues/17214 [#18146]: https://github.com/apache/datafusion/pull/18146 [#18855]: https://github.com/apache/datafusion/pull/18855 [#19366]: https://github.com/apache/datafusion/pull/19366 [#19298]: https://github.com/apache/datafusion/pull/19298 -[#18971]: https://github.com/apache/datafusion/pull/18971 -### Statistics cache improvements -The statistics cache has been improved to make pruning and planning more -reliable in repeated workloads ([#19051]). DataFusion now exposes a -`statistics_cache` function and improves cache memory behavior for listing -workflows, making it easier to diagnose cache contents and reduce repeated I/O. -Example (TODO: confirm the function signature and output schema): +## Major Features ✨ + +### Arrow IPC Stream file support + +DataFusion can now read Arrow IPC stream files ([#18457]). This expands +interoperability with systems that emit Arrow streams directly, making it +simpler to ingest Arrow-native data without conversion. Thanks to [corasaurus-hex] +for implementing this feature. ```sql --- TODO: confirm the function name and arguments. -SELECT * FROM statistics_cache('my_table'); +CREATE EXTERNAL TABLE ipc_events +STORED AS ARROW +LOCATION 's3://bucket/events.arrow'; ``` -Related PRs: [#19054], [#18855], [#18971] +Related PRs: [#18457] -[#19051]: https://github.com/apache/datafusion/issues/19051 -[#19054]: https://github.com/apache/datafusion/pull/19054 +[#18457]: https://github.com/apache/datafusion/pull/18457 +[corasaurus-hex]: https://github.com/corasaurus-hex + +### Extensible SQL planning with relation planner extensions + +DataFusion now supports relation planner extensions for custom SQL syntax and +planning logic ([#17824], [#17843]). This lets downstream projects inject their +own planning behavior without forking the SQL planner. As explained in the +[Extending SQL in DataFusion Blog], you can now customize DataFusion with +support for almost any SQL syntax, such as: + +```sql +-- Postgres-style JSON operators +SELECT payload->'user'->>'id' FROM logs; +-- MySQL-specific types +SELECT DATETIME '2001-01-01 18:00:00'; +-- Statistical sampling +SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT); +``` +[Extending SQL in DataFusion Blog]: https://datafusion.apache.org/blog/2026/01/12/extending-sql/ + +Thanks to [geoffreyclaude] for implementing relation planner extensions, and to +[theirix], [alamb], [NGA-TRAN], and [gabotechs] for reviews and feedback that +shaped the design. + + +
+ DataFusion SQL processing pipeline: SQL String flows through Parser to AST, then SqlToRel (with Extension Planners) to LogicalPlan, then PhysicalPlanner to ExecutionPlan +
+ Figure 1: + SQL processing pipeline with relation planner extensions from the + Extending SQL in DataFusion Blog. +
+
+ + + + + +Related PRs: [#17843] + +[#17824]: https://github.com/apache/datafusion/issues/17824 +[#17843]: https://github.com/apache/datafusion/pull/17843 +[geoffreyclaude]: https://github.com/geoffreyclaude +[theirix]: https://github.com/theirix +[alamb]: https://github.com/alamb +[NGA-TRAN]: https://github.com/NGA-TRAN +[gabotechs]: https://github.com/gabotechs ### Pushdown expression evaluation via PhysicalExprAdapter From 2cc1f59b5d82c0984cd90035b816699759fc6b87 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Mon, 19 Jan 2026 13:48:26 -0500 Subject: [PATCH 04/26] Update sql planning --- content/blog/2026-01-08-datafusion-52.0.0.md | 25 +++++++------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 5a7f720b..2464d909 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -182,11 +182,10 @@ Related PRs: [#18457] ### Extensible SQL planning with relation planner extensions -DataFusion now supports relation planner extensions for custom SQL syntax and -planning logic ([#17824], [#17843]). This lets downstream projects inject their -own planning behavior without forking the SQL planner. As explained in the -[Extending SQL in DataFusion Blog], you can now customize DataFusion with -support for almost any SQL syntax, such as: +DataFusion now has an API for extending the SQL planner for relations. As +explained in the [Extending SQL in DataFusion Blog], with this new API you can +now customize DataFusion with support for almost any SQL syntax, such as: + ```sql -- Postgres-style JSON operators @@ -198,11 +197,6 @@ SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT); ``` [Extending SQL in DataFusion Blog]: https://datafusion.apache.org/blog/2026/01/12/extending-sql/ -Thanks to [geoffreyclaude] for implementing relation planner extensions, and to -[theirix], [alamb], [NGA-TRAN], and [gabotechs] for reviews and feedback that -shaped the design. - -
DataFusion SQL processing pipeline: SQL String flows through Parser to AST, then SqlToRel (with Extension Planners) to LogicalPlan, then PhysicalPlanner to ExecutionPlan
@@ -212,13 +206,10 @@ shaped the design.
+Thanks to [geoffreyclaude] for implementing relation planner extensions, and to +[theirix], [alamb], [NGA-TRAN], and [gabotechs] for reviews and feedback on the +design. Related PRs: [#17843] - - - -Related PRs: [#17843] - -[#17824]: https://github.com/apache/datafusion/issues/17824 [#17843]: https://github.com/apache/datafusion/pull/17843 [geoffreyclaude]: https://github.com/geoffreyclaude [theirix]: https://github.com/theirix @@ -226,7 +217,7 @@ Related PRs: [#17843] [NGA-TRAN]: https://github.com/NGA-TRAN [gabotechs]: https://github.com/gabotechs -### Pushdown expression evaluation via PhysicalExprAdapter +### Pushdown Expression Evaluation via PhysicalExprAdapter DataFusion now pushes down expression evaluation into TableProviders using the PhysicalExprAdapter, replacing the older SchemaAdapter approach ([#14993], From ccc5d4296951810f48e133fe70948d34c4b4f9bd Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Tue, 20 Jan 2026 12:25:03 -0500 Subject: [PATCH 05/26] Apply suggestions from code review Co-authored-by: Matt Butrovich --- content/blog/2026-01-08-datafusion-52.0.0.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 5a7f720b..51b4693d 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -42,7 +42,7 @@ TODO: confirm the release date for 52.0.0 and update the front matter if needed. ## Performance Improvements 🚀 We continue to make significant performance improvements in DataFusion. This -release includes faster `CASE` expressions (see below), a new SortMergeJoin, +release includes faster `CASE` expressions (see below), SortMergeJoin buffering optimizations, automatic caching of metadata, statistics, and listing results for ListingTable, improved hashing and grouping performance for string types, and string function optimizations. @@ -64,7 +64,7 @@ alt="Performance over time" ### Faster `CASE` expression evaluation -DataFusion 52 completes major work from the CASE performance epic ([#18075]). +DataFusion 52 completes major work from the `CASE` performance epic ([#18075]). Lookup-table based evaluation avoids repeated expression evaluation and reduces branching overhead, accelerating common ETL patterns. @@ -91,7 +91,7 @@ Related PRs: [#18183] ### Rewritten merge join -DataFusion 52 includes a rewrite of the sort-merge join output buffering to +DataFusion 52 includes a rewrite of the sort-merge join (SMJ) output buffering to avoid excessive `concat_batches` work and to use `BatchCoalescer` internally and for final output. This change targets pathological slowdowns like the reported LeftAnti join case in [#18487], which also affected Comet workloads that rely on From 81954c512cefe8a9f9a2b16d45e059b88e1832d5 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 22 Jan 2026 18:45:06 -0500 Subject: [PATCH 06/26] Updates --- content/blog/2026-01-08-datafusion-52.0.0.md | 158 ++++++++++--------- 1 file changed, 80 insertions(+), 78 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 2464d909..acc12b6c 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -41,71 +41,54 @@ TODO: confirm the release date for 52.0.0 and update the front matter if needed. ## Performance Improvements 🚀 -We continue to make significant performance improvements in DataFusion. This -release includes faster `CASE` expressions (see below), a new SortMergeJoin, -automatic caching of metadata, statistics, and listing results for ListingTable, -improved hashing and grouping performance for string types, and string function -optimizations. - - -### Performance Chart (TODO) - -TODO: add the 52.0.0 performance chart and update the caption. - - - -**Figure 1**: TODO: update caption for 52.0.0 benchmarking results. - +We continue to make significant performance improvements in DataFusion as explained below. ### Faster `CASE` expression evaluation -DataFusion 52 completes major work from the CASE performance epic ([#18075]). -Lookup-table based evaluation avoids repeated expression evaluation and reduces -branching overhead, accelerating common ETL patterns. - -Example: +DataFusion 52 now has lookup-table based evaluation for certain `CASE` +expressions to avoid repeated evaluation for accelerating common ETL patterns such +as ```sql -SELECT - CASE - WHEN status IN ('NEW', 'READY', 'STAGED') THEN 'PENDING' - WHEN status IN ('DONE', 'COMPLETE') THEN 'FINISHED' - ELSE 'OTHER' - END AS status_bucket, - count(*) -FROM jobs -GROUP BY 1; +CASE company + WHEN 1 THEN 'Apple' + WHEN 5 THEN 'Samsung' + WHEN 2 THEN 'Motorola' + WHEN 3 THEN 'LG' + ELSE 'Other' +END ``` -Related PRs: [#18183] +This is the final contains the final work in our CASE performance epic +([#18075]). Related PRs [#18183]. Thanks to [rluvaton] and [pepijnve] for +the implementation. + +[rluvaton]: https://github.com/rluvaton +[pepijnve]: https://github.com/pepijnve + [#18075]: https://github.com/apache/datafusion/issues/18075 [#18183]: https://github.com/apache/datafusion/pull/18183 -[#18487]: https://github.com/apache/datafusion/issues/18487 -[#18875]: https://github.com/apache/datafusion/pull/18875 -### Rewritten merge join +### New Merge Join -DataFusion 52 includes a rewrite of the sort-merge join output buffering to -avoid excessive `concat_batches` work and to use `BatchCoalescer` internally and -for final output. This change targets pathological slowdowns like the reported -LeftAnti join case in [#18487], which also affected Comet workloads that rely on -SMJ. Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from -minutes to milliseconds) while leaving most other queries unchanged or modestly -faster, and the update is fully internal with no user-facing API changes. +DataFusion 52 includes a rewrite of the sort-merge join, leading to speedups of three orders of magnitude in some cases. +This change targets pathological slowdowns like the reported +LeftAnti join case in [#18487], which also affected [Apache Comet] workloads. +Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from +minutes to milliseconds) while leaving other queries unchanged or modestly +faster. +[#18487]: https://github.com/apache/datafusion/issues/18487 +[#18875]: https://github.com/apache/datafusion/pull/18875 +[Apache Comet]: https://datafusion.apache.org/comet/ ### Caching Improvements -DataFusion also includes several additional caching improvements in this release. +This release also includes several additional caching improvements. First it includes a new statistics cache for Parquet Metadata that avoids repeatedly -calculating statistics for Parquet backed files. This significantly improves +(re) calculating statistics for Parquet backed files. This significantly improves planning time for certain queries. You can see the contents of the new cache using the [statistics_cache] function in the CLI: @@ -126,9 +109,19 @@ Related PRs: [#18971], [#19054] [#19054]: https://github.com/apache/datafusion/pull/19054 -DataFusion and includes a memory-bound, prefix aware list-files cache by -default. You can see the contents of the new cache using the [list_files_cache] -function in the CLI: +It also includes a prefix aware list-files cache by default which accelerates +evaluating partition predicates for HIVE partitioned tables. + +```sql +-- Read the hive partitioned dataset from Overture Maps (100s of Parquet files) +CREATE EXTERNAL TABLE overturemaps +STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/'; +-- Find all files where the path contains `theme=base without requiring another LIST call +select count(*) from overturemaps where theme='base'; +``` + +You can see the +contents of the new cache using the [list_files_cache] function in the CLI: [list_files_cache]: https://datafusion.apache.org/user-guide/cli/functions.html#list-files-cache @@ -159,6 +152,34 @@ Related PRs: [#18146], [#18855], [#19366], [#19298], [#19298]: https://github.com/apache/datafusion/pull/19298 +### Filtering with Hash Join pushdown + +DataFusion can now push down build-side hash tables from HashJoinExec into scans +([#17171]). This is sometimes referred to "sideways information passing" (TODO +reference). The initial DataFusion implement supports pushdown when the build +side is small . In this case, DataFusion converts the hash table to an `IN` list +that is pushed down to TableProviders reducing the join input size early. + +Example: + +```sql +SELECT * +FROM orders o +JOIN small_dim d +ON o.dim_id = d.id; +``` + +TODO: include a physical plan snippet that shows the pushdown filter once a +canonical example is selected. + +Thanks to [adriangb] for implementing hash join pushdown. +Related PRs: [#18393] + +[#17171]: https://github.com/apache/datafusion/issues/17171 +[#18393]: https://github.com/apache/datafusion/pull/18393 +[adriangb]: https://github.com/adriangb + + ## Major Features ✨ @@ -182,10 +203,10 @@ Related PRs: [#18457] ### Extensible SQL planning with relation planner extensions -DataFusion now has an API for extending the SQL planner for relations. As -explained in the [Extending SQL in DataFusion Blog], with this new API you can -now customize DataFusion with support for almost any SQL syntax, such as: - +DataFusion now has an API for extending the SQL planner for relations, as +explained in the [Extending SQL in DataFusion Blog]. With this new API, you can +customize DataFusion to support almost any SQL syntax, such as the following +(which are not supported by default) : ```sql -- Postgres-style JSON operators @@ -236,6 +257,7 @@ TableProvider pushdown Reduced data ``` +Thanks to [adriangb] for implementing PhysicalExprAdapter pushdown support. Related PRs: [#18998], [#19345] [#14993]: https://github.com/apache/datafusion/issues/14993 @@ -243,30 +265,6 @@ Related PRs: [#18998], [#19345] [#18998]: https://github.com/apache/datafusion/pull/18998 [#19345]: https://github.com/apache/datafusion/pull/19345 -### Hash join build-side pushdown - -DataFusion can now push down build-side hash tables from HashJoinExec into scans -([#17171]). When the build side is small, DataFusion converts the hash table to -an `IN` list or hash lookup that can be evaluated during scans, reducing the -join input size early. - -Example: - -```sql -SELECT * -FROM orders o -JOIN small_dim d -ON o.dim_id = d.id; -``` - -TODO: include a physical plan snippet that shows the pushdown filter once a -canonical example is selected. - -Related PRs: [#18393] - -[#17171]: https://github.com/apache/datafusion/issues/17171 -[#18393]: https://github.com/apache/datafusion/pull/18393 - ### Sort pushdown to sources DataFusion now supports sort pushdown into data sources, allowing scans to @@ -282,10 +280,12 @@ FROM parquet_table ORDER BY event_time DESC; ``` +Thanks to [zhuqi-lucas] for implementing sort pushdown. Related PRs: [#19064] [#10433]: https://github.com/apache/datafusion/issues/10433 [#19064]: https://github.com/apache/datafusion/pull/19064 +[zhuqi-lucas]: https://github.com/zhuqi-lucas ### DELETE/UPDATE hooks in TableProvider @@ -300,9 +300,11 @@ Example: DELETE FROM mem_table WHERE status = 'obsolete'; ``` +Thanks to [ethan-tyler] for implementing the TableProvider DELETE/UPDATE hooks. Related PRs: [#19142] [#19142]: https://github.com/apache/datafusion/pull/19142 +[ethan-tyler]: https://github.com/ethan-tyler ### CoalesceBatchesExec removal and integrated batch coalescing From 781cd621f7e37d9ca89318e7b78cf598b576f485 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 22 Jan 2026 18:47:21 -0500 Subject: [PATCH 07/26] acknowledgments --- content/blog/2026-01-08-datafusion-52.0.0.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index acc12b6c..e895c052 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -78,10 +78,12 @@ LeftAnti join case in [#18487], which also affected [Apache Comet] workloads. Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from minutes to milliseconds) while leaving other queries unchanged or modestly faster. +Thanks to [mbutrovich] for implementing the merge join rewrite. [#18487]: https://github.com/apache/datafusion/issues/18487 [#18875]: https://github.com/apache/datafusion/pull/18875 [Apache Comet]: https://datafusion.apache.org/comet/ +[mbutrovich]: https://github.com/mbutrovich ### Caching Improvements @@ -103,10 +105,13 @@ select * from statistics_cache(); | .../hits.parquet | 2022-06-25T22:22:22 | 14779976446 | 0-5e24d1ee16380-370f48 | NULL | Exact(99997497) | 105 | Exact(36445943240) | 0 | +------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+ ``` +Thanks to [bharath-techie] and [nuno-faria] for implementing the statistics cache. Related PRs: [#18971], [#19054] [#18971]: https://github.com/apache/datafusion/pull/18971 [#19054]: https://github.com/apache/datafusion/pull/19054 +[bharath-techie]: https://github.com/bharath-techie +[nuno-faria]: https://github.com/nuno-faria It also includes a prefix aware list-files cache by default which accelerates @@ -143,6 +148,7 @@ location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infra +--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+ ``` +Thanks to [BlakeOrth] and [Yuvraj-cyborg] for implementing the list-files cache work. Related PRs: [#18146], [#18855], [#19366], [#19298], [Epic #17214]: https://github.com/apache/datafusion/issues/17214 @@ -150,6 +156,8 @@ Related PRs: [#18146], [#18855], [#19366], [#19298], [#18855]: https://github.com/apache/datafusion/pull/18855 [#19366]: https://github.com/apache/datafusion/pull/19366 [#19298]: https://github.com/apache/datafusion/pull/19298 +[BlakeOrth]: https://github.com/BlakeOrth +[Yuvraj-cyborg]: https://github.com/Yuvraj-cyborg ### Filtering with Hash Join pushdown @@ -317,15 +325,6 @@ made optimizer rules more complex. This release integrates coalescing into the operators themselves and relies on Arrow's coalesce kernels, reducing plan complexity while keeping batch sizes efficient. -Diagram: - -``` -Before: - Scan -> CoalesceBatches -> Filter -> CoalesceBatches -> Join - -After: - Scan -> Filter (coalesce inline) -> Join (coalesce inline) -``` Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239] Thanks to [Tim-53], [Dandandan], [jizezhang], and [feniljain] for implementing From 790b65859731309059a0d821d479c0029f99ba3d Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 22 Jan 2026 19:02:55 -0500 Subject: [PATCH 08/26] update --- content/blog/2026-01-08-datafusion-52.0.0.md | 33 +++++++------------- 1 file changed, 12 insertions(+), 21 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index e895c052..dbe2117b 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -160,35 +160,26 @@ Related PRs: [#18146], [#18855], [#19366], [#19298], [Yuvraj-cyborg]: https://github.com/Yuvraj-cyborg -### Filtering with Hash Join pushdown +### Improved Hash Join Filter Pushdown -DataFusion can now push down build-side hash tables from HashJoinExec into scans -([#17171]). This is sometimes referred to "sideways information passing" (TODO -reference). The initial DataFusion implement supports pushdown when the build -side is small . In this case, DataFusion converts the hash table to an `IN` list -that is pushed down to TableProviders reducing the join input size early. +Starting in DataFusion 51, filtering information from `HashJoinExec` is passed +dynamically to scans, as explained in the [Dynamic Filtering blog post] using a +technique referred to as [Sideways Information Passing] in Database research +literature. The initial implementation passed min/max values for the join keys. +DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the +build size is small such as when the join is very selective. The `IN` list is +pushed down to the probe side scan and is used to prune files, row groups, and +individual rows. Thanks to [adriangb] for implementing this feature. -Example: -```sql -SELECT * -FROM orders o -JOIN small_dim d -ON o.dim_id = d.id; -``` - -TODO: include a physical plan snippet that shows the pushdown filter once a -canonical example is selected. - -Thanks to [adriangb] for implementing hash join pushdown. -Related PRs: [#18393] +[Sideways Information Passing]: https://dl.acm.org/doi/10.1109/ICDE.2008.4497486 +[Dynamic Filtering blog post]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters [#17171]: https://github.com/apache/datafusion/issues/17171 [#18393]: https://github.com/apache/datafusion/pull/18393 [adriangb]: https://github.com/adriangb - ## Major Features ✨ ### Arrow IPC Stream file support @@ -314,7 +305,7 @@ Related PRs: [#19142] [#19142]: https://github.com/apache/datafusion/pull/19142 [ethan-tyler]: https://github.com/ethan-tyler -### CoalesceBatchesExec removal and integrated batch coalescing +### `CoalesceBatchesExec` Removed DataFusion continues the work from the CoalesceBatchesExec epic ([#18779]). The standalone `CoalesceBatchesExec` operator existed to ensure batches were large From d63a4de797b1c1a7ef48be2eb593dd8cd9620545 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 22 Jan 2026 19:20:16 -0500 Subject: [PATCH 09/26] updates --- content/blog/2026-01-08-datafusion-52.0.0.md | 111 +++++++++---------- 1 file changed, 52 insertions(+), 59 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index dbe2117b..11bf96a5 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -78,7 +78,8 @@ LeftAnti join case in [#18487], which also affected [Apache Comet] workloads. Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from minutes to milliseconds) while leaving other queries unchanged or modestly faster. -Thanks to [mbutrovich] for implementing the merge join rewrite. +Thanks to [mbutrovich] for implementing the merge join rewrite, with reviews +from [Dandandan]. [#18487]: https://github.com/apache/datafusion/issues/18487 [#18875]: https://github.com/apache/datafusion/pull/18875 @@ -105,13 +106,16 @@ select * from statistics_cache(); | .../hits.parquet | 2022-06-25T22:22:22 | 14779976446 | 0-5e24d1ee16380-370f48 | NULL | Exact(99997497) | 105 | Exact(36445943240) | 0 | +------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+ ``` -Thanks to [bharath-techie] and [nuno-faria] for implementing the statistics cache. +Thanks to [bharath-techie] and [nuno-faria] for implementing the statistics cache, +with reviews from [martin-g], [alamb], and [alchemist51]. Related PRs: [#18971], [#19054] [#18971]: https://github.com/apache/datafusion/pull/18971 [#19054]: https://github.com/apache/datafusion/pull/19054 [bharath-techie]: https://github.com/bharath-techie [nuno-faria]: https://github.com/nuno-faria +[martin-g]: https://github.com/martin-g +[alchemist51]: https://github.com/alchemist51 It also includes a prefix aware list-files cache by default which accelerates @@ -148,7 +152,8 @@ location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infra +--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+ ``` -Thanks to [BlakeOrth] and [Yuvraj-cyborg] for implementing the list-files cache work. +Thanks to [BlakeOrth] and [Yuvraj-cyborg] for implementing the list-files cache work, +with reviews from [gabotechs], [alamb], [alchemist51], [martin-g], and [BlakeOrth]. Related PRs: [#18146], [#18855], [#19366], [#19298], [Epic #17214]: https://github.com/apache/datafusion/issues/17214 @@ -163,21 +168,25 @@ Related PRs: [#18146], [#18855], [#19366], [#19298], ### Improved Hash Join Filter Pushdown Starting in DataFusion 51, filtering information from `HashJoinExec` is passed -dynamically to scans, as explained in the [Dynamic Filtering blog post] using a +dynamically to scans, as explained in the [Dynamic Filtering Blog] using a technique referred to as [Sideways Information Passing] in Database research literature. The initial implementation passed min/max values for the join keys. DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the build size is small such as when the join is very selective. The `IN` list is pushed down to the probe side scan and is used to prune files, row groups, and -individual rows. Thanks to [adriangb] for implementing this feature. +individual rows. Thanks to [adriangb] for implementing this feature, with +reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. [Sideways Information Passing]: https://dl.acm.org/doi/10.1109/ICDE.2008.4497486 -[Dynamic Filtering blog post]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters +[Dynamic Filtering blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters [#17171]: https://github.com/apache/datafusion/issues/17171 [#18393]: https://github.com/apache/datafusion/pull/18393 [adriangb]: https://github.com/adriangb +[LiaCastaneda]: https://github.com/LiaCastaneda +[asolimando]: https://github.com/asolimando +[comphead]: https://github.com/comphead ## Major Features ✨ @@ -199,8 +208,12 @@ Related PRs: [#18457] [#18457]: https://github.com/apache/datafusion/pull/18457 [corasaurus-hex]: https://github.com/corasaurus-hex +[Jefffrey]: https://github.com/Jefffrey +[jdcasale]: https://github.com/jdcasale +[2010YOUY01]: https://github.com/2010YOUY01 +[timsaucer]: https://github.com/timsaucer -### Extensible SQL planning with relation planner extensions +### More Extensible SQL Planning with `RelationPlanner` DataFusion now has an API for extending the SQL planner for relations, as explained in the [Extending SQL in DataFusion Blog]. With this new API, you can @@ -217,15 +230,6 @@ SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT); ``` [Extending SQL in DataFusion Blog]: https://datafusion.apache.org/blog/2026/01/12/extending-sql/ -
- DataFusion SQL processing pipeline: SQL String flows through Parser to AST, then SqlToRel (with Extension Planners) to LogicalPlan, then PhysicalPlanner to ExecutionPlan -
- Figure 1: - SQL processing pipeline with relation planner extensions from the - Extending SQL in DataFusion Blog. -
-
- Thanks to [geoffreyclaude] for implementing relation planner extensions, and to [theirix], [alamb], [NGA-TRAN], and [gabotechs] for reviews and feedback on the design. Related PRs: [#17843] @@ -237,25 +241,13 @@ design. Related PRs: [#17843] [NGA-TRAN]: https://github.com/NGA-TRAN [gabotechs]: https://github.com/gabotechs -### Pushdown Expression Evaluation via PhysicalExprAdapter +### Expression Evaluation Pushdown to Scans DataFusion now pushes down expression evaluation into TableProviders using the PhysicalExprAdapter, replacing the older SchemaAdapter approach ([#14993], [#16800]). This enables richer pushdown (expressions and projections) and improves consistency between logical and physical planning. -Diagram: - -``` -SQL filter/projection - | (PhysicalExprAdapter) - v -TableProvider pushdown - | (scan) - v -Reduced data -``` - Thanks to [adriangb] for implementing PhysicalExprAdapter pushdown support. Related PRs: [#18998], [#19345] @@ -263,35 +255,30 @@ Related PRs: [#18998], [#19345] [#16800]: https://github.com/apache/datafusion/issues/16800 [#18998]: https://github.com/apache/datafusion/pull/18998 [#19345]: https://github.com/apache/datafusion/pull/19345 +[kosiew]: https://github.com/kosiew -### Sort pushdown to sources - -DataFusion now supports sort pushdown into data sources, allowing scans to -return sorted data or leverage reversed row groups when possible ([#10433], -[#19064]). This reduces memory pressure and can eliminate explicit sort stages -for partitioned or pre-sorted data. +### Sort Pushdown to Scans -Example: - -```sql -SELECT * -FROM parquet_table -ORDER BY event_time DESC; -``` - -Thanks to [zhuqi-lucas] for implementing sort pushdown. -Related PRs: [#19064] +DataFusion can now push sorts all the way data sources, allowing scans to return +sorted data or leverage reversed row groups when possible ([#10433], [#19064]). +This allows table provider implementations to take better advantage of existing sort +information such as to reorder files or row groups to satisfy `LIMIT` clauses more +efficiently. Thanks to [zhuqi-lucas] for this feature. [#10433]: https://github.com/apache/datafusion/issues/10433 [#19064]: https://github.com/apache/datafusion/pull/19064 [zhuqi-lucas]: https://github.com/zhuqi-lucas -### DELETE/UPDATE hooks in TableProvider +### `TableProvider` supports `DELETE` and `UPDATE` statements + +The [TableProvider] trait now includes hooks for `DELETE` and `UPDATE` +statements and the basic MemTable implements them ([#19142]). This lets +downstream implementations and storage engines plug in their own mutation logic. +See [TableProvider::delete_from] and [TableProvider::update] for more details -TableProvider now includes DELETE and UPDATE hooks, with MemTable providing the -first implementation ([#19142]). This is an important step toward fully -featured DML support and enables downstream storage engines to plug in their -own mutation logic. +[TableProvider]: https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html +[TableProvider::delete_from]: https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html#method.delete_from +[TableProvider::update]: https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html#method.update Example: @@ -299,7 +286,8 @@ Example: DELETE FROM mem_table WHERE status = 'obsolete'; ``` -Thanks to [ethan-tyler] for implementing the TableProvider DELETE/UPDATE hooks. +Thanks to [ethan-tyler] for implementing the TableProvider DELETE/UPDATE hooks, +with reviews from [alamb] and [adriangb]. Related PRs: [#19142] [#19142]: https://github.com/apache/datafusion/pull/19142 @@ -307,19 +295,21 @@ Related PRs: [#19142] ### `CoalesceBatchesExec` Removed -DataFusion continues the work from the CoalesceBatchesExec epic ([#18779]). The -standalone `CoalesceBatchesExec` operator existed to ensure batches were large -enough for vectorized execution, and it was inserted after filter-like +The standalone `CoalesceBatchesExec` operator existed to ensure batches were +large enough for vectorized execution, and it was inserted after filter-like operators such as `FilterExec`, `HashJoinExec`, and `RepartitionExec`. However, -it also blocked other optimizations (like pushing limits through joins) and -made optimizer rules more complex. This release integrates coalescing into the -operators themselves and relies on Arrow's coalesce kernels, reducing plan -complexity while keeping batch sizes efficient. +it also blocked other optimizations (like pushing limits through joins) and made +optimizer rules more complex. In this release, we have completed integrating +coalescing into the operators themselves ([#18779]) using Arrow's [coalesce kernel], +reducing plan complexity while keeping batch sizes efficient. This also allows +additional focused optimization work in the kernel, such as [Dandandan]'s recent +work with filtering in [arrow-rs/#8951]. Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239] Thanks to [Tim-53], [Dandandan], [jizezhang], and [feniljain] for implementing -this feature. +this feature, with reviews from [Jefffrey], [alamb], [martin-g], +[geoffreyclaude], [milenkovicm], and [jizezhang]. [#18779]: https://github.com/apache/datafusion/issues/18779 [#18540]: https://github.com/apache/datafusion/pull/18540 @@ -333,6 +323,9 @@ this feature. [Dandandan]: https://github.com/Dandandan [jizezhang]: https://github.com/jizezhang [feniljain]: https://github.com/feniljain +[milenkovicm]: https://github.com/milenkovicm +[coalesce kernel]: https://docs.rs/arrow/57.2.0/arrow/compute/kernels/coalesce/ +[arrow-rs/#8951]: https://github.com/apache/arrow-rs/pull/8951 ## Upgrade Guide and Changelog From d38d99f153633ca3352c2c1db54d8c6111119e91 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 22 Jan 2026 19:25:15 -0500 Subject: [PATCH 10/26] update --- content/blog/2026-01-08-datafusion-52.0.0.md | 32 +++++++++----------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 11bf96a5..cd06c39c 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -43,11 +43,10 @@ TODO: confirm the release date for 52.0.0 and update the front matter if needed. We continue to make significant performance improvements in DataFusion as explained below. -### Faster `CASE` expression evaluation +### Faster `CASE` Expressions -DataFusion 52 now has lookup-table based evaluation for certain `CASE` -expressions to avoid repeated evaluation for accelerating common ETL patterns such -as +DataFusion 52 has lookup-table based evaluation for certain `CASE` expressions +to avoid repeated evaluation for accelerating common ETL patterns such as ```sql CASE company @@ -59,9 +58,9 @@ CASE company END ``` -This is the final contains the final work in our CASE performance epic -([#18075]). Related PRs [#18183]. Thanks to [rluvaton] and [pepijnve] for -the implementation. +This is the final work in our `CASE` performance epic ([#18075]) which has +improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to +[rluvaton] and [pepijnve] for the implementation. [rluvaton]: https://github.com/rluvaton [pepijnve]: https://github.com/pepijnve @@ -72,14 +71,12 @@ the implementation. ### New Merge Join -DataFusion 52 includes a rewrite of the sort-merge join, leading to speedups of three orders of magnitude in some cases. -This change targets pathological slowdowns like the reported -LeftAnti join case in [#18487], which also affected [Apache Comet] workloads. -Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from -minutes to milliseconds) while leaving other queries unchanged or modestly -faster. -Thanks to [mbutrovich] for implementing the merge join rewrite, with reviews -from [Dandandan]. +DataFusion 52 includes a rewrite of the sort-merge join operator, with speedups +of three orders of magnitude in some pathological cases such the case in +[#18487], which also affected [Apache Comet] workloads. Benchmarks in [#18875] +show dramatic gains for TPC-H Q21 (minutes to milliseconds) while leaving other +queries unchanged or modestly faster. Thanks to [mbutrovich] for the implementation +and reviews from [Dandandan]. [#18487]: https://github.com/apache/datafusion/issues/18487 [#18875]: https://github.com/apache/datafusion/pull/18875 @@ -196,7 +193,8 @@ reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. DataFusion can now read Arrow IPC stream files ([#18457]). This expands interoperability with systems that emit Arrow streams directly, making it simpler to ingest Arrow-native data without conversion. Thanks to [corasaurus-hex] -for implementing this feature. +for implementing this feature, with reviews from [martin-g], [Jefffrey], +[jdcasale], [2010YOUY01], and [timsaucer]. ```sql CREATE EXTERNAL TABLE ipc_events @@ -329,7 +327,7 @@ this feature, with reviews from [Jefffrey], [alamb], [martin-g], ## Upgrade Guide and Changelog -Upgrading to 52.0.0 should be straightforward for most users. Please review the +As always, upgrading to 52.0.0 should be straightforward for most users. Please review the [Upgrade Guide] for details on breaking changes and code snippets to help with the transition. For a comprehensive list of all changes, please refer to the [changelog]. From b47c50dd8d927a65d8f0fbe505b84feec5c69c88 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 22 Jan 2026 19:27:23 -0500 Subject: [PATCH 11/26] typos --- content/blog/2026-01-08-datafusion-52.0.0.md | 28 ++++++++++---------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index cd06c39c..7784ea90 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -37,7 +37,7 @@ TODO: confirm the release date for 52.0.0 and update the front matter if needed. [DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0 [DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/ [changelog]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md -[120 contributors]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits +[121 contributors]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits ## Performance Improvements 🚀 @@ -45,7 +45,7 @@ We continue to make significant performance improvements in DataFusion as explai ### Faster `CASE` Expressions -DataFusion 52 has lookup-table based evaluation for certain `CASE` expressions +DataFusion 52 has lookup-table-based evaluation for certain `CASE` expressions to avoid repeated evaluation for accelerating common ETL patterns such as ```sql @@ -58,7 +58,7 @@ CASE company END ``` -This is the final work in our `CASE` performance epic ([#18075]) which has +This is the final work in our `CASE` performance epic ([#18075]), which has improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to [rluvaton] and [pepijnve] for the implementation. @@ -72,7 +72,7 @@ improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to ### New Merge Join DataFusion 52 includes a rewrite of the sort-merge join operator, with speedups -of three orders of magnitude in some pathological cases such the case in +of three orders of magnitude in some pathological cases such as the case in [#18487], which also affected [Apache Comet] workloads. Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (minutes to milliseconds) while leaving other queries unchanged or modestly faster. Thanks to [mbutrovich] for the implementation @@ -88,7 +88,7 @@ and reviews from [Dandandan]. This release also includes several additional caching improvements. First it includes a new statistics cache for Parquet Metadata that avoids repeatedly -(re) calculating statistics for Parquet backed files. This significantly improves +(re)calculating statistics for Parquet backed files. This significantly improves planning time for certain queries. You can see the contents of the new cache using the [statistics_cache] function in the CLI: @@ -115,8 +115,8 @@ Related PRs: [#18971], [#19054] [alchemist51]: https://github.com/alchemist51 -It also includes a prefix aware list-files cache by default which accelerates -evaluating partition predicates for HIVE partitioned tables. +It also includes a prefix-aware list-files cache by default which accelerates +evaluating partition predicates for Hive partitioned tables. ```sql -- Read the hive partitioned dataset from Overture Maps (100s of Parquet files) @@ -166,7 +166,7 @@ Related PRs: [#18146], [#18855], [#19366], [#19298], Starting in DataFusion 51, filtering information from `HashJoinExec` is passed dynamically to scans, as explained in the [Dynamic Filtering Blog] using a -technique referred to as [Sideways Information Passing] in Database research +technique referred to as [Sideways Information Passing] in Database research literature. The initial implementation passed min/max values for the join keys. DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the build size is small such as when the join is very selective. The `IN` list is @@ -216,7 +216,7 @@ Related PRs: [#18457] DataFusion now has an API for extending the SQL planner for relations, as explained in the [Extending SQL in DataFusion Blog]. With this new API, you can customize DataFusion to support almost any SQL syntax, such as the following -(which are not supported by default) : +(which are not supported by default): ```sql -- Postgres-style JSON operators @@ -230,7 +230,7 @@ SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT); Thanks to [geoffreyclaude] for implementing relation planner extensions, and to [theirix], [alamb], [NGA-TRAN], and [gabotechs] for reviews and feedback on the -design. Related PRs: [#17843] +design. Related PRs: [#17843] [#17843]: https://github.com/apache/datafusion/pull/17843 [geoffreyclaude]: https://github.com/geoffreyclaude @@ -239,7 +239,7 @@ design. Related PRs: [#17843] [NGA-TRAN]: https://github.com/NGA-TRAN [gabotechs]: https://github.com/gabotechs -### Expression Evaluation Pushdown to Scans +### Expression Evaluation Pushdown to Scans DataFusion now pushes down expression evaluation into TableProviders using the PhysicalExprAdapter, replacing the older SchemaAdapter approach ([#14993], @@ -257,7 +257,7 @@ Related PRs: [#18998], [#19345] ### Sort Pushdown to Scans -DataFusion can now push sorts all the way data sources, allowing scans to return +DataFusion can now push sorts all the way to data sources, allowing scans to return sorted data or leverage reversed row groups when possible ([#10433], [#19064]). This allows table provider implementations to take better advantage of existing sort information such as to reorder files or row groups to satisfy `LIMIT` clauses more @@ -272,7 +272,7 @@ efficiently. Thanks to [zhuqi-lucas] for this feature. The [TableProvider] trait now includes hooks for `DELETE` and `UPDATE` statements and the basic MemTable implements them ([#19142]). This lets downstream implementations and storage engines plug in their own mutation logic. -See [TableProvider::delete_from] and [TableProvider::update] for more details +See [TableProvider::delete_from] and [TableProvider::update] for more details. [TableProvider]: https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html [TableProvider::delete_from]: https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html#method.delete_from @@ -299,7 +299,7 @@ operators such as `FilterExec`, `HashJoinExec`, and `RepartitionExec`. However, it also blocked other optimizations (like pushing limits through joins) and made optimizer rules more complex. In this release, we have completed integrating coalescing into the operators themselves ([#18779]) using Arrow's [coalesce kernel], -reducing plan complexity while keeping batch sizes efficient. This also allows +reducing plan complexity while keeping batch sizes efficient. This also allows additional focused optimization work in the kernel, such as [Dandandan]'s recent work with filtering in [arrow-rs/#8951]. From 1f5b91e431917eea054e3f3f8a66504dbc35c741 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 22 Jan 2026 19:36:53 -0500 Subject: [PATCH 12/26] refine --- content/blog/2026-01-08-datafusion-52.0.0.md | 43 ++++++++++---------- 1 file changed, 21 insertions(+), 22 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 7784ea90..3c888053 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -241,27 +241,27 @@ design. Related PRs: [#17843] ### Expression Evaluation Pushdown to Scans -DataFusion now pushes down expression evaluation into TableProviders using the -PhysicalExprAdapter, replacing the older SchemaAdapter approach ([#14993], -[#16800]). This enables richer pushdown (expressions and projections) and -improves consistency between logical and physical planning. - -Thanks to [adriangb] for implementing PhysicalExprAdapter pushdown support. -Related PRs: [#18998], [#19345] +DataFusion now pushes down expression evaluation into TableProviders using +[PhysicalExprAdapter], replacing the older SchemaAdapter approach ([#14993], +[#16800]). This work means predicates and expressions can be customized for each +individual file schema, opening additional optimization such as support for +[Variant shredding]. Thanks to [adriangb] for implementing PhysicalExprAdapter +and reworking pushdown to use it. Related PRs: [#18998], [#19345] [#14993]: https://github.com/apache/datafusion/issues/14993 [#16800]: https://github.com/apache/datafusion/issues/16800 [#18998]: https://github.com/apache/datafusion/pull/18998 [#19345]: https://github.com/apache/datafusion/pull/19345 [kosiew]: https://github.com/kosiew +[Variant shredding]: https://github.com/apache/datafusion/issues/16116 +[PhysicalExprAdapter]: https://docs.rs/datafusion/52.0.0/datafusion/physical_expr_adapter/trait.PhysicalExprAdapter.html ### Sort Pushdown to Scans -DataFusion can now push sorts all the way to data sources, allowing scans to return -sorted data or leverage reversed row groups when possible ([#10433], [#19064]). +DataFusion can now push sorts all the way to data sources ([#10433], [#19064]). This allows table provider implementations to take better advantage of existing sort information such as to reorder files or row groups to satisfy `LIMIT` clauses more -efficiently. Thanks to [zhuqi-lucas] for this feature. +efficiently. Thanks to [zhuqi-lucas] for this feature. [#10433]: https://github.com/apache/datafusion/issues/10433 [#19064]: https://github.com/apache/datafusion/pull/19064 @@ -284,9 +284,8 @@ Example: DELETE FROM mem_table WHERE status = 'obsolete'; ``` -Thanks to [ethan-tyler] for implementing the TableProvider DELETE/UPDATE hooks, -with reviews from [alamb] and [adriangb]. -Related PRs: [#19142] +Thanks to [ethan-tyler] for the implementation and [alamb] and [adriangb] for +reviews. [#19142]: https://github.com/apache/datafusion/pull/19142 [ethan-tyler]: https://github.com/ethan-tyler @@ -294,15 +293,15 @@ Related PRs: [#19142] ### `CoalesceBatchesExec` Removed The standalone `CoalesceBatchesExec` operator existed to ensure batches were -large enough for vectorized execution, and it was inserted after filter-like -operators such as `FilterExec`, `HashJoinExec`, and `RepartitionExec`. However, -it also blocked other optimizations (like pushing limits through joins) and made -optimizer rules more complex. In this release, we have completed integrating -coalescing into the operators themselves ([#18779]) using Arrow's [coalesce kernel], -reducing plan complexity while keeping batch sizes efficient. This also allows -additional focused optimization work in the kernel, such as [Dandandan]'s recent -work with filtering in [arrow-rs/#8951]. - +large enough for subsequent vectorized execution, and was inserted after +filter-like operators such as `FilterExec`, `HashJoinExec`, and +`RepartitionExec`. However, using a separate operator also blocks other +optimizations such as pushing `LIMIT` through joins and made optimizer rules +more complex. In this release, we integrated the coalescing into the operators +themselves ([#18779]) using Arrow's [coalesce kernel]. This reduces plan +complexity while keeping batch sizes efficient, and allows additional focused +optimization work in the Arrow kernel, such as [Dandandan]'s recent work with +filtering in [arrow-rs/#8951]. Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239] Thanks to [Tim-53], [Dandandan], [jizezhang], and [feniljain] for implementing From 2823de5e3f3fa502c0b1975e09192f256002501e Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 22 Jan 2026 19:50:18 -0500 Subject: [PATCH 13/26] clean --- content/blog/2026-01-08-datafusion-52.0.0.md | 23 ++++++++++---------- 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 6d394db9..c4dba3f4 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -80,10 +80,8 @@ the implementation and reviews from [Dandandan]. [#18487]: https://github.com/apache/datafusion/issues/18487 [#18875]: https://github.com/apache/datafusion/pull/18875 -<<<<<<< HEAD [Apache Comet]: https://datafusion.apache.org/comet/ [mbutrovich]: https://github.com/mbutrovich -======= ### Rewritten merge join @@ -95,15 +93,14 @@ SMJ. Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from minutes to milliseconds) while leaving most other queries unchanged or modestly faster, and the update is fully internal with no user-facing API changes. ->>>>>>> ccc5d4296951810f48e133fe70948d34c4b4f9bd ### Caching Improvements This release also includes several additional caching improvements. -First it includes a new statistics cache for Parquet Metadata that avoids repeatedly -(re)calculating statistics for Parquet backed files. This significantly improves -planning time for certain queries. You can see the contents of the new cache using the +A new statistics cache for Parquet Metadata avoids repeatedly (re)calculating +statistics for Parquet backed files. This significantly improves planning time +for certain queries. You can see the contents of the new cache using the [statistics_cache] function in the CLI: [statistics_cache]: https://datafusion.apache.org/user-guide/cli/functions.html#statistics-cache @@ -129,8 +126,8 @@ Related PRs: [#18971], [#19054] [alchemist51]: https://github.com/alchemist51 -It also includes a prefix-aware list-files cache by default which accelerates -evaluating partition predicates for Hive partitioned tables. +A prefix-aware list-files cache accelerates evaluating partition predicates for +Hive partitioned tables. ```sql -- Read the hive partitioned dataset from Overture Maps (100s of Parquet files) @@ -257,7 +254,7 @@ design. Related PRs: [#17843] DataFusion now pushes down expression evaluation into TableProviders using [PhysicalExprAdapter], replacing the older SchemaAdapter approach ([#14993], -[#16800]). This work means predicates and expressions can be customized for each +[#16800]). Predicates and expressions can now be customized for each individual file schema, opening additional optimization such as support for [Variant shredding]. Thanks to [adriangb] for implementing PhysicalExprAdapter and reworking pushdown to use it. Related PRs: [#18998], [#19345] @@ -272,14 +269,16 @@ and reworking pushdown to use it. Related PRs: [#18998], [#19345] ### Sort Pushdown to Scans -DataFusion can now push sorts all the way to data sources ([#10433], [#19064]). +DataFusion can now push sorts into data sources ([#10433], [#19064]). This allows table provider implementations to take better advantage of existing sort -information such as to reorder files or row groups to satisfy `LIMIT` clauses more -efficiently. Thanks to [zhuqi-lucas] for this feature. +information based on the query pattern, such as to reorder files or row groups to +satisfy `LIMIT` clauses more +efficiently. Thanks to [zhuqi-lucas] and [xudong963] for this feature. [#10433]: https://github.com/apache/datafusion/issues/10433 [#19064]: https://github.com/apache/datafusion/pull/19064 [zhuqi-lucas]: https://github.com/zhuqi-lucas +[xudong963]: https://github.com/xudong963 ### `TableProvider` supports `DELETE` and `UPDATE` statements From 1c9dadf5ee06acfd22ab4b779209f2299b1879c5 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jan 2026 06:59:03 -0500 Subject: [PATCH 14/26] Update content/blog/2026-01-08-datafusion-52.0.0.md Co-authored-by: Martin Grigorov --- content/blog/2026-01-08-datafusion-52.0.0.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index c4dba3f4..f3c134e7 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -32,7 +32,6 @@ some of the major improvements since [DataFusion 51.0.0]. The complete list of changes is available in the [changelog]. Thanks to the [121 contributors] for making this release possible. -TODO: confirm the release date for 52.0.0 and update the front matter if needed. [DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0 [DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/ From 615affddb5ee93070c0999893d9ae514b01b8d3b Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jan 2026 07:02:52 -0500 Subject: [PATCH 15/26] Clarify RelationPlanner syntax --- content/blog/2026-01-08-datafusion-52.0.0.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index c4dba3f4..7f944c36 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -224,10 +224,14 @@ Related PRs: [#18457] ### More Extensible SQL Planning with `RelationPlanner` + + DataFusion now has an API for extending the SQL planner for relations, as -explained in the [Extending SQL in DataFusion Blog]. With this new API, you can -customize DataFusion to support almost any SQL syntax, such as the following -(which are not supported by default): +explained in the [Extending SQL in DataFusion Blog]. In addition to the existing +expression and types extension points, this new API now allows extending `FROM` +clauses. Using these APIs it is straightforward to provide SQL support for +almost any dialect, including vendor-specific syntax. Example use cases include: + ```sql -- Postgres-style JSON operators From 1345bfba5d1d2e91f976b8c55052dc5735e244d5 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jan 2026 07:05:20 -0500 Subject: [PATCH 16/26] remove extra section --- content/blog/2026-01-08-datafusion-52.0.0.md | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index bec481bc..37c7d5c5 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -82,16 +82,6 @@ the implementation and reviews from [Dandandan]. [Apache Comet]: https://datafusion.apache.org/comet/ [mbutrovich]: https://github.com/mbutrovich -### Rewritten merge join - -DataFusion 52 includes a rewrite of the sort-merge join (SMJ) output buffering to -avoid excessive `concat_batches` work and to use `BatchCoalescer` internally and -for final output. This change targets pathological slowdowns like the reported -LeftAnti join case in [#18487], which also affected Comet workloads that rely on -SMJ. Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from -minutes to milliseconds) while leaving most other queries unchanged or modestly -faster, and the update is fully internal with no user-facing API changes. - ### Caching Improvements @@ -223,8 +213,6 @@ Related PRs: [#18457] ### More Extensible SQL Planning with `RelationPlanner` - - DataFusion now has an API for extending the SQL planner for relations, as explained in the [Extending SQL in DataFusion Blog]. In addition to the existing expression and types extension points, this new API now allows extending `FROM` From 63b571c4948712d820f0ab330522f64f2bbcb2eb Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jan 2026 07:06:26 -0500 Subject: [PATCH 17/26] Update content/blog/2026-01-08-datafusion-52.0.0.md Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com> --- content/blog/2026-01-08-datafusion-52.0.0.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 37c7d5c5..3e05f822 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -169,9 +169,12 @@ dynamically to scans, as explained in the [Dynamic Filtering Blog] using a technique referred to as [Sideways Information Passing] in Database research literature. The initial implementation passed min/max values for the join keys. DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the -build size is small such as when the join is very selective. The `IN` list is -pushed down to the probe side scan and is used to prune files, row groups, and -individual rows. Thanks to [adriangb] for implementing this feature, with +build size is small such as when the join is very selective or a reference to the build side hash map when the build side is larger. +These new expressions are pushed down to the probe side scan and is used to prune files, row groups, and +individual rows. +When the build side is small enough (<=20 rows but configurable) the pushed down filters can even participate in statistics pruning to avoid even reading the join keys from row groups that will not match. + +Thanks to [adriangb] for implementing this feature, with reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. From 7984011d33abeed667907a5b2565fef75e9d6243 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jan 2026 07:14:30 -0500 Subject: [PATCH 18/26] Refine wording --- content/blog/2026-01-08-datafusion-52.0.0.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 3e05f822..89c9f7f9 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -168,18 +168,20 @@ Starting in DataFusion 51, filtering information from `HashJoinExec` is passed dynamically to scans, as explained in the [Dynamic Filtering Blog] using a technique referred to as [Sideways Information Passing] in Database research literature. The initial implementation passed min/max values for the join keys. -DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the -build size is small such as when the join is very selective or a reference to the build side hash map when the build side is larger. -These new expressions are pushed down to the probe side scan and is used to prune files, row groups, and +DataFusion 52 extends the optimization ([#17171] / [#18393]) to pass the contents +of the build side hash map. +These filters are evaluated on the probe side scan to prune files, row groups, and individual rows. -When the build side is small enough (<=20 rows but configurable) the pushed down filters can even participate in statistics pruning to avoid even reading the join keys from row groups that will not match. - +When the build side contains `20` or fewer rows (configurable) the contents of the hash map are +transformed to an `IN` expression and used for [statistics-based pruning] which +can avoid reading entire files or row groups that contain no matching join keys. Thanks to [adriangb] for implementing this feature, with reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. [Sideways Information Passing]: https://dl.acm.org/doi/10.1109/ICDE.2008.4497486 [Dynamic Filtering blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters +[statistics-based pruning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html [#17171]: https://github.com/apache/datafusion/issues/17171 [#18393]: https://github.com/apache/datafusion/pull/18393 From 66a9dcbaee8d58816480e5b19c4f8223108f5a18 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jan 2026 07:14:43 -0500 Subject: [PATCH 19/26] reflow --- content/blog/2026-01-08-datafusion-52.0.0.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 89c9f7f9..8977dce1 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -168,15 +168,14 @@ Starting in DataFusion 51, filtering information from `HashJoinExec` is passed dynamically to scans, as explained in the [Dynamic Filtering Blog] using a technique referred to as [Sideways Information Passing] in Database research literature. The initial implementation passed min/max values for the join keys. -DataFusion 52 extends the optimization ([#17171] / [#18393]) to pass the contents -of the build side hash map. -These filters are evaluated on the probe side scan to prune files, row groups, and -individual rows. -When the build side contains `20` or fewer rows (configurable) the contents of the hash map are +DataFusion 52 extends the optimization ([#17171] / [#18393]) to pass the +contents of the build side hash map. These filters are evaluated on the probe +side scan to prune files, row groups, and individual rows. When the build side +contains `20` or fewer rows (configurable) the contents of the hash map are transformed to an `IN` expression and used for [statistics-based pruning] which can avoid reading entire files or row groups that contain no matching join keys. -Thanks to [adriangb] for implementing this feature, with -reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. +Thanks to [adriangb] for implementing this feature, with reviews from +[LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. [Sideways Information Passing]: https://dl.acm.org/doi/10.1109/ICDE.2008.4497486 From e9308d49687e89fc216f02ed4d255623c0eed5de Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jan 2026 07:16:39 -0500 Subject: [PATCH 20/26] Metadata cache is general --- content/blog/2026-01-08-datafusion-52.0.0.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 8977dce1..95961f2a 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -87,8 +87,8 @@ the implementation and reviews from [Dandandan]. This release also includes several additional caching improvements. -A new statistics cache for Parquet Metadata avoids repeatedly (re)calculating -statistics for Parquet backed files. This significantly improves planning time +A new statistics cache for File Metadata avoids repeatedly (re)calculating +statistics for files. This significantly improves planning time for certain queries. You can see the contents of the new cache using the [statistics_cache] function in the CLI: From 4e24b1ff4317ebd0d436e84ef3a5553cf66f8814 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jan 2026 07:53:35 -0500 Subject: [PATCH 21/26] Add section on Min/max dynamic filters --- content/blog/2026-01-08-datafusion-52.0.0.md | 31 ++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 95961f2a..d3048f90 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -68,6 +68,37 @@ improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to [#18075]: https://github.com/apache/datafusion/issues/18075 [#18183]: https://github.com/apache/datafusion/pull/18183 +### `MIN`/`MAX` Aggregate Dynamic Filters + +DataFusion now creates dynamic filters for queries with `MIN`/`MAX` aggregates +that have filters, but no `GROUP BY`. These dynamic filters are used during scan +to prune files and rows as tighter bounds are discovered during execution, as +explained in the [Dynamic Filtering blog]. For example, the following query: + +```sql +SELECT min(l_shipdate) +FROM lineitem +WHERE l_returnflag = 'R'; +``` + +Is now executed like this +```sql +SELECT min(l_shipdate) +FROM lineitem +-- '__current_min' is updated dynamically during execution +WHERE l_returnflag = 'R' AND l_shipdate > __current_min; +``` + + +Thanks to [2010YOUY01] for implementing this feature, with reviews from +[martin-g], [adriangb], and [LiaCastaneda]. Related PRs: [#18644] + +[#18644]: https://github.com/apache/datafusion/pull/18644 +[2010YOUY01]: https://github.com/2010YOUY01 +[martin-g]: https://github.com/martin-g +[adriangb]: https://github.com/adriangb +[LiaCastaneda]: https://github.com/LiaCastaneda + ### New Merge Join DataFusion 52 includes a rewrite of the sort-merge join (SMJ) operator, with From d62dfab29dd514e8738dfc046198b30d17a761ba Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 25 Jan 2026 08:56:42 -0500 Subject: [PATCH 22/26] Update content/blog/2026-01-08-datafusion-52.0.0.md Co-authored-by: Yongting You <2010youy01@gmail.com> --- content/blog/2026-01-08-datafusion-52.0.0.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index d3048f90..7371fbae 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -86,7 +86,7 @@ Is now executed like this SELECT min(l_shipdate) FROM lineitem -- '__current_min' is updated dynamically during execution -WHERE l_returnflag = 'R' AND l_shipdate > __current_min; +WHERE l_returnflag = 'R' AND l_shipdate < __current_min; ``` From 1a77f710c5dc37fd3d576272b54074837a6d295b Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 25 Jan 2026 08:57:25 -0500 Subject: [PATCH 23/26] whitespace --- content/blog/2026-01-08-datafusion-52.0.0.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index d3048f90..b73670fe 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -89,7 +89,6 @@ FROM lineitem WHERE l_returnflag = 'R' AND l_shipdate > __current_min; ``` - Thanks to [2010YOUY01] for implementing this feature, with reviews from [martin-g], [adriangb], and [LiaCastaneda]. Related PRs: [#18644] From 37b12dd14255b8302a0325144d60378c17689794 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 25 Jan 2026 09:08:48 -0500 Subject: [PATCH 24/26] Improve sort pushdown description --- content/blog/2026-01-08-datafusion-52.0.0.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index a5e31fc3..33c1aceb 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -295,10 +295,15 @@ and reworking pushdown to use it. Related PRs: [#18998], [#19345] ### Sort Pushdown to Scans DataFusion can now push sorts into data sources ([#10433], [#19064]). -This allows table provider implementations to take better advantage of existing sort -information based on the query pattern, such as to reorder files or row groups to -satisfy `LIMIT` clauses more -efficiently. Thanks to [zhuqi-lucas] and [xudong963] for this feature. +This allows table provider implementations to optimize based on +sort knowledge for certain query patterns. For example, the provided Parquet +datasource now reverses the scan order of row groups and files when queried in +for the opposite of the file's natural sort (e.g. `DESC` when the files are sorted `ASC`). +This reversal, combined with dynamic filtering allows TopK queries with `LIMIT` +on pre-sorted data to find the requested rows very quickly, pruning more files and row groups +without even scanning them. We have seen a ~30x performance improvement on +benchmark queries with pre-sorted data. +Thanks to [zhuqi-lucas] and [xudong963] for this feature. [#10433]: https://github.com/apache/datafusion/issues/10433 [#19064]: https://github.com/apache/datafusion/pull/19064 From 91f71f3c420000f481595db1d7208d8ddb7e2eb9 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 25 Jan 2026 09:12:14 -0500 Subject: [PATCH 25/26] capitalizaton --- content/blog/2026-01-08-datafusion-52.0.0.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 33c1aceb..5781071a 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -73,7 +73,7 @@ improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to DataFusion now creates dynamic filters for queries with `MIN`/`MAX` aggregates that have filters, but no `GROUP BY`. These dynamic filters are used during scan to prune files and rows as tighter bounds are discovered during execution, as -explained in the [Dynamic Filtering blog]. For example, the following query: +explained in the [Dynamic Filtering Blog]. For example, the following query: ```sql SELECT min(l_shipdate) @@ -209,7 +209,7 @@ Thanks to [adriangb] for implementing this feature, with reviews from [Sideways Information Passing]: https://dl.acm.org/doi/10.1109/ICDE.2008.4497486 -[Dynamic Filtering blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters +[Dynamic Filtering Blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters [statistics-based pruning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html [#17171]: https://github.com/apache/datafusion/issues/17171 From 040b85c199c9226242d67ca4ae119b2d1f7b2922 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 25 Jan 2026 09:16:37 -0500 Subject: [PATCH 26/26] wordsmith --- content/blog/2026-01-08-datafusion-52.0.0.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md index 5781071a..bb2d7d0f 100644 --- a/content/blog/2026-01-08-datafusion-52.0.0.md +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -297,13 +297,14 @@ and reworking pushdown to use it. Related PRs: [#18998], [#19345] DataFusion can now push sorts into data sources ([#10433], [#19064]). This allows table provider implementations to optimize based on sort knowledge for certain query patterns. For example, the provided Parquet -datasource now reverses the scan order of row groups and files when queried in -for the opposite of the file's natural sort (e.g. `DESC` when the files are sorted `ASC`). -This reversal, combined with dynamic filtering allows TopK queries with `LIMIT` +data source now reverses the scan order of row groups and files when queried +for the opposite of the file's natural sort (e.g. `DESC` when the files are sorted `ASC`). +This reversal, combined with dynamic filtering, allows top-K queries with `LIMIT` on pre-sorted data to find the requested rows very quickly, pruning more files and row groups without even scanning them. We have seen a ~30x performance improvement on benchmark queries with pre-sorted data. -Thanks to [zhuqi-lucas] and [xudong963] for this feature. +Thanks to [zhuqi-lucas] and [xudong963] for this feature, with reviews from +[martin-g], [adriangb], and [alamb]. [#10433]: https://github.com/apache/datafusion/issues/10433 [#19064]: https://github.com/apache/datafusion/pull/19064