diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md new file mode 100644 index 00000000..bb2d7d0f --- /dev/null +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -0,0 +1,410 @@ +--- +layout: post +title: Apache DataFusion 52.0.0 Released +date: 2026-01-08 +author: pmc +categories: [release] +--- + + + +[TOC] + +We are proud to announce the release of [DataFusion 52.0.0]. This post highlights +some of the major improvements since [DataFusion 51.0.0]. The complete list of +changes is available in the [changelog]. Thanks to the [121 contributors] for +making this release possible. + + +[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0 +[DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md +[121 contributors]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits + +## Performance Improvements 🚀 + +We continue to make significant performance improvements in DataFusion as explained below. + +### Faster `CASE` Expressions + +DataFusion 52 has lookup-table-based evaluation for certain `CASE` expressions +to avoid repeated evaluation for accelerating common ETL patterns such as + +```sql +CASE company + WHEN 1 THEN 'Apple' + WHEN 5 THEN 'Samsung' + WHEN 2 THEN 'Motorola' + WHEN 3 THEN 'LG' + ELSE 'Other' +END +``` + +This is the final work in our `CASE` performance epic ([#18075]), which has +improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to +[rluvaton] and [pepijnve] for the implementation. + +[rluvaton]: https://github.com/rluvaton +[pepijnve]: https://github.com/pepijnve + + +[#18075]: https://github.com/apache/datafusion/issues/18075 +[#18183]: https://github.com/apache/datafusion/pull/18183 + +### `MIN`/`MAX` Aggregate Dynamic Filters + +DataFusion now creates dynamic filters for queries with `MIN`/`MAX` aggregates +that have filters, but no `GROUP BY`. These dynamic filters are used during scan +to prune files and rows as tighter bounds are discovered during execution, as +explained in the [Dynamic Filtering Blog]. For example, the following query: + +```sql +SELECT min(l_shipdate) +FROM lineitem +WHERE l_returnflag = 'R'; +``` + +Is now executed like this +```sql +SELECT min(l_shipdate) +FROM lineitem +-- '__current_min' is updated dynamically during execution +WHERE l_returnflag = 'R' AND l_shipdate < __current_min; +``` + +Thanks to [2010YOUY01] for implementing this feature, with reviews from +[martin-g], [adriangb], and [LiaCastaneda]. Related PRs: [#18644] + +[#18644]: https://github.com/apache/datafusion/pull/18644 +[2010YOUY01]: https://github.com/2010YOUY01 +[martin-g]: https://github.com/martin-g +[adriangb]: https://github.com/adriangb +[LiaCastaneda]: https://github.com/LiaCastaneda + +### New Merge Join + +DataFusion 52 includes a rewrite of the sort-merge join (SMJ) operator, with +speedups of three orders of magnitude in some pathological cases such as the +case in [#18487], which also affected [Apache Comet] workloads. Benchmarks in +[#18875] show dramatic gains for TPC-H Q21 (minutes to milliseconds) while +leaving other queries unchanged or modestly faster. Thanks to [mbutrovich] for +the implementation and reviews from [Dandandan]. + +[#18487]: https://github.com/apache/datafusion/issues/18487 +[#18875]: https://github.com/apache/datafusion/pull/18875 +[Apache Comet]: https://datafusion.apache.org/comet/ +[mbutrovich]: https://github.com/mbutrovich + + +### Caching Improvements + +This release also includes several additional caching improvements. + +A new statistics cache for File Metadata avoids repeatedly (re)calculating +statistics for files. This significantly improves planning time +for certain queries. You can see the contents of the new cache using the +[statistics_cache] function in the CLI: + +[statistics_cache]: https://datafusion.apache.org/user-guide/cli/functions.html#statistics-cache + + +```sql +select * from statistics_cache(); ++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+ +| path | file_modified | file_size_bytes | e_tag | version | num_rows | num_columns | table_size_bytes | statistics_size_bytes | ++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+ +| .../hits.parquet | 2022-06-25T22:22:22 | 14779976446 | 0-5e24d1ee16380-370f48 | NULL | Exact(99997497) | 105 | Exact(36445943240) | 0 | ++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+ +``` +Thanks to [bharath-techie] and [nuno-faria] for implementing the statistics cache, +with reviews from [martin-g], [alamb], and [alchemist51]. +Related PRs: [#18971], [#19054] + +[#18971]: https://github.com/apache/datafusion/pull/18971 +[#19054]: https://github.com/apache/datafusion/pull/19054 +[bharath-techie]: https://github.com/bharath-techie +[nuno-faria]: https://github.com/nuno-faria +[martin-g]: https://github.com/martin-g +[alchemist51]: https://github.com/alchemist51 + + +A prefix-aware list-files cache accelerates evaluating partition predicates for +Hive partitioned tables. + +```sql +-- Read the hive partitioned dataset from Overture Maps (100s of Parquet files) +CREATE EXTERNAL TABLE overturemaps +STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/'; +-- Find all files where the path contains `theme=base without requiring another LIST call +select count(*) from overturemaps where theme='base'; +``` + +You can see the +contents of the new cache using the [list_files_cache] function in the CLI: + +[list_files_cache]: https://datafusion.apache.org/user-guide/cli/functions.html#list-files-cache + +```sql +create external table overturemaps +stored as parquet +location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure'; +0 row(s) fetched. +> select table, path, metadata_size_bytes, expires_in, unnest(metadata_list)['file_size_bytes'] as file_size_bytes, unnest(metadata_list)['e_tag'] as e_tag from list_files_cache() limit 10; ++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+ +| table | path | metadata_size_bytes | expires_in | file_size_bytes | e_tag | ++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+ +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 999055952 | "35fc8fbe8400960b54c66fbb408c48e8-60" | +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 975592768 | "8a16e10b722681cdc00242564b502965-59" | +... +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 1016732378 | "6d70857a0473ed9ed3fc6e149814168b-61" | +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 991363784 | "c9cafb42fcbb413f851691c895dd7c2b-60" | +| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750 | 0 days 0 hours 0 mins 25.264 secs | 1032469715 | "7540252d0d67158297a67038a3365e0f-62" | ++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+ +``` + +Thanks to [BlakeOrth] and [Yuvraj-cyborg] for implementing the list-files cache work, +with reviews from [gabotechs], [alamb], [alchemist51], [martin-g], and [BlakeOrth]. +Related PRs: [#18146], [#18855], [#19366], [#19298], + +[Epic #17214]: https://github.com/apache/datafusion/issues/17214 +[#18146]: https://github.com/apache/datafusion/pull/18146 +[#18855]: https://github.com/apache/datafusion/pull/18855 +[#19366]: https://github.com/apache/datafusion/pull/19366 +[#19298]: https://github.com/apache/datafusion/pull/19298 +[BlakeOrth]: https://github.com/BlakeOrth +[Yuvraj-cyborg]: https://github.com/Yuvraj-cyborg + + +### Improved Hash Join Filter Pushdown + +Starting in DataFusion 51, filtering information from `HashJoinExec` is passed +dynamically to scans, as explained in the [Dynamic Filtering Blog] using a +technique referred to as [Sideways Information Passing] in Database research +literature. The initial implementation passed min/max values for the join keys. +DataFusion 52 extends the optimization ([#17171] / [#18393]) to pass the +contents of the build side hash map. These filters are evaluated on the probe +side scan to prune files, row groups, and individual rows. When the build side +contains `20` or fewer rows (configurable) the contents of the hash map are +transformed to an `IN` expression and used for [statistics-based pruning] which +can avoid reading entire files or row groups that contain no matching join keys. +Thanks to [adriangb] for implementing this feature, with reviews from +[LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. + + +[Sideways Information Passing]: https://dl.acm.org/doi/10.1109/ICDE.2008.4497486 +[Dynamic Filtering Blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters +[statistics-based pruning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html + +[#17171]: https://github.com/apache/datafusion/issues/17171 +[#18393]: https://github.com/apache/datafusion/pull/18393 +[adriangb]: https://github.com/adriangb +[LiaCastaneda]: https://github.com/LiaCastaneda +[asolimando]: https://github.com/asolimando +[comphead]: https://github.com/comphead + + +## Major Features ✨ + +### Arrow IPC Stream file support + +DataFusion can now read Arrow IPC stream files ([#18457]). This expands +interoperability with systems that emit Arrow streams directly, making it +simpler to ingest Arrow-native data without conversion. Thanks to [corasaurus-hex] +for implementing this feature, with reviews from [martin-g], [Jefffrey], +[jdcasale], [2010YOUY01], and [timsaucer]. + +```sql +CREATE EXTERNAL TABLE ipc_events +STORED AS ARROW +LOCATION 's3://bucket/events.arrow'; +``` + +Related PRs: [#18457] + +[#18457]: https://github.com/apache/datafusion/pull/18457 +[corasaurus-hex]: https://github.com/corasaurus-hex +[Jefffrey]: https://github.com/Jefffrey +[jdcasale]: https://github.com/jdcasale +[2010YOUY01]: https://github.com/2010YOUY01 +[timsaucer]: https://github.com/timsaucer + +### More Extensible SQL Planning with `RelationPlanner` + +DataFusion now has an API for extending the SQL planner for relations, as +explained in the [Extending SQL in DataFusion Blog]. In addition to the existing +expression and types extension points, this new API now allows extending `FROM` +clauses. Using these APIs it is straightforward to provide SQL support for +almost any dialect, including vendor-specific syntax. Example use cases include: + + +```sql +-- Postgres-style JSON operators +SELECT payload->'user'->>'id' FROM logs; +-- MySQL-specific types +SELECT DATETIME '2001-01-01 18:00:00'; +-- Statistical sampling +SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT); +``` +[Extending SQL in DataFusion Blog]: https://datafusion.apache.org/blog/2026/01/12/extending-sql/ + +Thanks to [geoffreyclaude] for implementing relation planner extensions, and to +[theirix], [alamb], [NGA-TRAN], and [gabotechs] for reviews and feedback on the +design. Related PRs: [#17843] + +[#17843]: https://github.com/apache/datafusion/pull/17843 +[geoffreyclaude]: https://github.com/geoffreyclaude +[theirix]: https://github.com/theirix +[alamb]: https://github.com/alamb +[NGA-TRAN]: https://github.com/NGA-TRAN +[gabotechs]: https://github.com/gabotechs + +### Expression Evaluation Pushdown to Scans + +DataFusion now pushes down expression evaluation into TableProviders using +[PhysicalExprAdapter], replacing the older SchemaAdapter approach ([#14993], +[#16800]). Predicates and expressions can now be customized for each +individual file schema, opening additional optimization such as support for +[Variant shredding]. Thanks to [adriangb] for implementing PhysicalExprAdapter +and reworking pushdown to use it. Related PRs: [#18998], [#19345] + +[#14993]: https://github.com/apache/datafusion/issues/14993 +[#16800]: https://github.com/apache/datafusion/issues/16800 +[#18998]: https://github.com/apache/datafusion/pull/18998 +[#19345]: https://github.com/apache/datafusion/pull/19345 +[kosiew]: https://github.com/kosiew +[Variant shredding]: https://github.com/apache/datafusion/issues/16116 +[PhysicalExprAdapter]: https://docs.rs/datafusion/52.0.0/datafusion/physical_expr_adapter/trait.PhysicalExprAdapter.html + +### Sort Pushdown to Scans + +DataFusion can now push sorts into data sources ([#10433], [#19064]). +This allows table provider implementations to optimize based on +sort knowledge for certain query patterns. For example, the provided Parquet +data source now reverses the scan order of row groups and files when queried +for the opposite of the file's natural sort (e.g. `DESC` when the files are sorted `ASC`). +This reversal, combined with dynamic filtering, allows top-K queries with `LIMIT` +on pre-sorted data to find the requested rows very quickly, pruning more files and row groups +without even scanning them. We have seen a ~30x performance improvement on +benchmark queries with pre-sorted data. +Thanks to [zhuqi-lucas] and [xudong963] for this feature, with reviews from +[martin-g], [adriangb], and [alamb]. + +[#10433]: https://github.com/apache/datafusion/issues/10433 +[#19064]: https://github.com/apache/datafusion/pull/19064 +[zhuqi-lucas]: https://github.com/zhuqi-lucas +[xudong963]: https://github.com/xudong963 + +### `TableProvider` supports `DELETE` and `UPDATE` statements + +The [TableProvider] trait now includes hooks for `DELETE` and `UPDATE` +statements and the basic MemTable implements them ([#19142]). This lets +downstream implementations and storage engines plug in their own mutation logic. +See [TableProvider::delete_from] and [TableProvider::update] for more details. + +[TableProvider]: https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html +[TableProvider::delete_from]: https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html#method.delete_from +[TableProvider::update]: https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html#method.update + +Example: + +```sql +DELETE FROM mem_table WHERE status = 'obsolete'; +``` + +Thanks to [ethan-tyler] for the implementation and [alamb] and [adriangb] for +reviews. + +[#19142]: https://github.com/apache/datafusion/pull/19142 +[ethan-tyler]: https://github.com/ethan-tyler + +### `CoalesceBatchesExec` Removed + +The standalone `CoalesceBatchesExec` operator existed to ensure batches were +large enough for subsequent vectorized execution, and was inserted after +filter-like operators such as `FilterExec`, `HashJoinExec`, and +`RepartitionExec`. However, using a separate operator also blocks other +optimizations such as pushing `LIMIT` through joins and made optimizer rules +more complex. In this release, we integrated the coalescing into the operators +themselves ([#18779]) using Arrow's [coalesce kernel]. This reduces plan +complexity while keeping batch sizes efficient, and allows additional focused +optimization work in the Arrow kernel, such as [Dandandan]'s recent work with +filtering in [arrow-rs/#8951]. + +Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239] +Thanks to [Tim-53], [Dandandan], [jizezhang], and [feniljain] for implementing +this feature, with reviews from [Jefffrey], [alamb], [martin-g], +[geoffreyclaude], [milenkovicm], and [jizezhang]. + +[#18779]: https://github.com/apache/datafusion/issues/18779 +[#18540]: https://github.com/apache/datafusion/pull/18540 +[#18604]: https://github.com/apache/datafusion/pull/18604 +[#18630]: https://github.com/apache/datafusion/pull/18630 +[#18972]: https://github.com/apache/datafusion/pull/18972 +[#19002]: https://github.com/apache/datafusion/pull/19002 +[#19342]: https://github.com/apache/datafusion/pull/19342 +[#19239]: https://github.com/apache/datafusion/pull/19239 +[Tim-53]: https://github.com/Tim-53 +[Dandandan]: https://github.com/Dandandan +[jizezhang]: https://github.com/jizezhang +[feniljain]: https://github.com/feniljain +[milenkovicm]: https://github.com/milenkovicm +[coalesce kernel]: https://docs.rs/arrow/57.2.0/arrow/compute/kernels/coalesce/ +[arrow-rs/#8951]: https://github.com/apache/arrow-rs/pull/8951 + +## Upgrade Guide and Changelog + +As always, upgrading to 52.0.0 should be straightforward for most users. Please review the +[Upgrade Guide] +for details on breaking changes and code snippets to help with the transition. +For a comprehensive list of all changes, please refer to the [changelog]. + +## About DataFusion + +[Apache DataFusion] is an extensible query engine, written in [Rust], that uses +[Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast, data-centric systems such as databases, dataframe libraries, +and machine learning and streaming applications. While [DataFusion's primary +design goal] is to accelerate the creation of other data-centric systems, it +provides a reasonable experience directly out of the box as a [dataframe +library], [Python library], and [command-line SQL tool]. + +[apache datafusion]: https://datafusion.apache.org/ +[rust]: https://www.rust-lang.org/ +[apache arrow]: https://arrow.apache.org +[DataFusion's primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals +[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html +[python library]: https://datafusion.apache.org/python/ +[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/ +[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html + +## How to Get Involved + +DataFusion is not a project built or driven by a single person, company, or +foundation. Rather, our community of users and contributors works together to +build a shared technology that none of us could have built alone. + +If you are interested in joining us, we would love to have you. You can try out +DataFusion on some of your own data and projects and let us know how it goes, +contribute suggestions, documentation, bug reports, or a PR with documentation, +tests, or code. A list of open issues suitable for beginners is [here], and you +can find out how to reach us on the [communication doc]. + +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html