chore: DataFusion 52 migration #3470

comphead · 2026-02-10T16:47:41Z

DataFusion 52 migration

Which issue does this PR close?

Closes #3046 .

This PR is on shared branch and replaces #3052

Rationale for this change

What changes are included in this PR?

How are these changes tested?

* DataFusion 52 migration

comphead · 2026-02-10T16:47:56Z

@andygrove @mbutrovich cc

…3471) DataFusion 52's arrow-arith kernels only support Date32 +/- Interval types, not raw integers. When Spark sends Date32 + Int8/Int16/Int32 arithmetic, the planner now routes these operations to the Spark date_add/date_sub UDFs which handle integer types directly. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

comphead · 2026-02-10T23:28:50Z

Some array functions tests fails on Cause: org.apache.comet.CometNativeException: index out of bounds: the len is 3 but the index is 3. The possible related issue #3338

…3485)

DataFusion 52's default PhysicalExprAdapter can fail when casting complex nested types (List<Struct>, Map) between physical and logical schemas. This adds a fallback path in SparkPhysicalExprAdapter that wraps type-mismatched columns with CometCastColumnExpr using spark_parquet_convert for the actual conversion. Changes to CometCastColumnExpr: - Add optional SparkParquetOptions for complex nested type conversions - Use == instead of equals_datatype to detect field name differences in nested types (Struct, List, Map) - Add relabel_array for types that differ only in field names (e.g., List element "item" vs "element", Map "key_value" vs "entries") - Fallback to spark_parquet_convert for structural nested type changes Changes to SparkPhysicalExprAdapter: - Try default adapter first, fall back to wrap_all_type_mismatches when it fails on complex nested types - Route Struct/List/Map casts to CometCastColumnExpr instead of Spark Cast, which doesn't handle nested type rewriting Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

andygrove · 2026-02-11T20:11:39Z

@sqlbenchmark run tpch --iterations 3

…3493) * fix: make relabel_array recursive for nested type mismatches The shallow ArrayData type swap in relabel_array caused panics when Arrow's ArrayData::build() validated child types recursively. This rebuilds arrays from typed constructors (ListArray, LargeListArray, MapArray, StructArray) so nested field name and metadata differences are handled correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: run cargo fmt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

andygrove · 2026-02-11T22:29:47Z

@sqlbenchmark run tpch --iterations 3

Benchmarks failed with OOM on q19

…_convert (#3494) INT96 Parquet timestamps are coerced to Timestamp(us, None) by DataFusion but the logical schema expects Timestamp(us, Some("UTC")). The schema adapter was routing this mismatch through Spark's Cast expression, which incorrectly treats None-timezone values as TimestampNTZ (local time) and applies a timezone conversion. This caused results to be shifted by the session timezone offset (e.g., -5h45m for Asia/Kathmandu). Route Timestamp->Timestamp mismatches through CometCastColumnExpr which delegates to spark_parquet_convert, handling this as a metadata-only timezone relabel without modifying the underlying values. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…apter (#3495) The DefaultPhysicalExprAdapter uses exact case-sensitive name matching (Arrow's field_with_name/index_of) to resolve columns. When a parquet file has lowercase "a" but the table schema has uppercase "A", the lookup fails and columns are filled with nulls. Fix by remapping physical schema field names to match logical names (case-insensitively) before passing to the default adapter, then restoring original physical names in the rewritten expressions so that downstream reassign_expr_columns can find columns in the actual parquet stream schema. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…3473) DataFusion 52 changed how FilterExec's batch coalescer works - streams now return Poll::Pending when accumulating input instead of blocking on a channel. Update test_unpack_dictionary_primitive and test_unpack_dictionary_string to poll the stream directly and send EOF on Pending, rather than using a separate mpsc channel/spawned task to feed batches. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

andygrove · 2026-02-12T14:44:43Z

Benchmarks comparing to main branch.

andygrove · 2026-02-12T15:57:11Z

CI Failure Analysis

Summary

17 CI jobs are failing across 6 distinct root causes. Here is the breakdown:

1. `width_bucket - with range data` — Arrow type cast error (8 jobs)

Failing test: width_bucket - with range data
Error: CometNativeException: could not cast array of type Int64 to arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::Int32Type>
Affected jobs: All [expressions] jobs on macOS (Spark 3.5, 4.0) and Ubuntu (Spark 3.5.5, 3.5.6, 3.5 native_datafusion, 3.5 native_iceberg_compat, 4.0)
Root cause: The width_bucket function returns Int64 (LongType) but the native code expects Int32. Likely a DataFusion 52 change in the return type of width_bucket. Also causes postgreSQL/numeric.sql to fail in spark-sql jobs (3 additional jobs) returning an empty schema.

2. Schema pruning — Native reader crash on complex nested types (3 jobs, ~44 tests each)

Failing tests: 44 schema pruning tests involving arrays/maps with IN clause predicates (e.g., select a single complex field array and in clause, select nested field from a complex map key using map_keys, SPARK-34638: nested column prune on generator output, etc.)
Error: SparkException: Encountered error while reading file ... .snappy.parquet through CometScanExec -> Native.executePlan
Affected jobs: spark-sql-auto-sql_core-2 on Spark 3.4.3 and 3.5.8, spark-sql-native_datafusion-sql_core-1 on Spark 3.5.8
Root cause: CometScanExec's native reader fails when reading Parquet files with pruned complex/nested schemas (arrays, maps, structs) combined with IN clause predicates.

3. SPARK-40819 — Timestamp nanos precision loss (4 jobs)

Failing test: SPARK-40819: parquet file with TIMESTAMP(NANOS, true) (with nanosAsLong=true)
Error: 1668537129123 did not equal 1668537129123534758 — nanosecond precision truncated to milliseconds
Affected jobs: spark-sql-auto-sql_core-1 on Spark 3.4.3, 3.5.8, 4.0.1 and spark-sql-auto-sql_core-2 on Spark 4.0.1
Root cause: Comet's Parquet reader loses the nanosecond portion of timestamps when nanosAsLong=true.

4. `select column with default value` — Fuzz test default value bug (1 job, 2 tests)

Failing tests: select column with default value (native shuffle) and (jvm shuffle)
Error: Expected [true] but got [null] for a boolean column with a default value
Affected jobs: ubuntu-latest/Spark 3.5 native_datafusion [fuzz]
Root cause: Default column values are not being materialized correctly — returning null instead of the expected default true.

5. Miri build failure (1 job)

Error: fatal error: file ... build-script-build contains outdated or invalid JSON; try cargo clean
Root cause: Stale Miri build cache. The quote v1.0.44 crate's build script artifact is corrupted/outdated. This is a CI infrastructure issue — not a code issue.

6. Spark 4.0-specific failures (2 jobs, 1 test each)

arrays and maps ignore shredding schema: Native reader doesn't support Spark 4.0's Parquet variant type shredding — CometScanExec fails reading shredded parquet files.
INSERT rows, ALTER TABLE ADD COLUMNS with DEFAULTs, then SELECT them: TASK_WRITE_FAILED / IOException: Failed to rename — appears to be a CI flaky test (staging file rename failure).

Priority Assessment

#	Issue	Jobs	Severity
1	`width_bucket` Int64→Int32 cast	11	High — likely DF52 API change
2	Schema pruning complex types	3	High — ~44 tests per job
3	Timestamp nanos precision	4	Medium
4	Default column values	1	Medium
5	Miri stale cache	1	Low — CI infra
6	Spark 4.0 shredding + flaky rename	2	Low

This analysis was generated with the assistance of AI (Claude Code). Failure logs were retrieved and analyzed programmatically — manual verification of root causes is recommended.

When Spark's `LEGACY_PARQUET_NANOS_AS_LONG=true` converts TIMESTAMP(NANOS) to LongType, the PhysicalExprAdapter detects a type mismatch between the file's Timestamp(Nanosecond) and the logical Int64. The DefaultAdapter creates a CastColumnExpr, which SparkPhysicalExprAdapter then replaces with Spark's Cast expression. Spark's Cast postprocess for Timestamp→Int64 unconditionally divides by MICROS_PER_SECOND (10^6), assuming microsecond precision. But the values are nanoseconds, so the raw value 1668537129123534758 becomes 1668537129123 — losing sub-millisecond precision. Fix: route Timestamp→Int64 casts through CometCastColumnExpr (which uses spark_parquet_convert → Arrow cast) instead of Spark Cast. Arrow's cast correctly reinterprets the raw i64 value without any division. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix: [df52] schema pruning crash on complex nested types When `data_schema` is provided but `projection_vector` is None (the NativeBatchReader / native_iceberg_compat path), the base schema was incorrectly set to the pruned `required_schema`. This caused DataFusion to think the table had only the pruned columns, leading to column index misalignment in PhysicalExprAdapter. For example, reading "friends" at logical index 0 would map to physical index 0 ("id") instead of the correct index 4. Fix: when `data_schema` is provided without a `projection_vector`, compute the projection by mapping required field names to their indices in the full data schema. Also harden `wrap_all_type_mismatches` to use name-based lookup for physical fields instead of positional index. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: handle field ID mapping in projection computation When computing a name-based projection from required_schema to data_schema, fall back to using required_schema directly when not all fields can be matched by name. This handles Parquet field ID mapping where column names differ between the read schema and file schema. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Add IgnoreCometSuite to ParquetVariantShreddingSuite in the 4.0.1 diff. VariantType shredding is a Spark 4.0 feature that Comet does not yet support (#2209). VariantShreddingSuite was already skipped but ParquetVariantShreddingSuite was missed, causing test failures in CI. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Move DF52 work to shared branch (#3469)

1366047

* DataFusion 52 migration

comphead changed the title ~~DataFusion 52 migration~~ chore: DataFusion 52 migration Feb 10, 2026

This was referenced Feb 10, 2026

DataFusion 52 migration #3052

Closed

chore: Make push CI to be triggered for main branch only #3474

Merged

andygrove and others added 2 commits February 10, 2026 12:21

Fix fuzz shuffle tests

ad18b1c

comphead and others added 3 commits February 11, 2026 09:21

chore: [df52] fix index out of bounds for native_datafusion scan (#…

2dd8909

…3485)

Merge remote-tracking branch 'apache/main' into df52

24f3e4f

This was referenced Feb 12, 2026

fix: [df52] route timestamp timezone mismatches through spark_parquet_convert #3494

Merged

fix: [df52] handle case-insensitive column matching in PhysicalExprAdapter #3495

Merged

fix: [df52] Fix/dictionary unpack tests #3473

Merged

andygrove and others added 6 commits February 12, 2026 06:57

upmerge

a72d277

fmt

c471e1a

clippy

dc2b9a4

comphead mentioned this pull request Feb 12, 2026

Spark width_bucket builtin functions wrong signature apache/datafusion#20320

Closed

andygrove and others added 2 commits February 12, 2026 10:00

Df52 migration - ignore width_bucket (#3501)

802b794

andygrove mentioned this pull request Feb 12, 2026

fix: [df52] schema pruning crash on complex nested types #3500

Merged

3 tasks

andygrove and others added 3 commits February 12, 2026 13:02

trigger CI

de0e853

Merge remote-tracking branch 'apache/main' into df52

156af5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: DataFusion 52 migration #3470

chore: DataFusion 52 migration #3470

Uh oh!

comphead commented Feb 10, 2026

Uh oh!

comphead commented Feb 10, 2026

Uh oh!

comphead commented Feb 10, 2026

Uh oh!

andygrove commented Feb 11, 2026

Uh oh!

andygrove commented Feb 11, 2026

Uh oh!

andygrove commented Feb 12, 2026

Uh oh!

andygrove commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore: DataFusion 52 migration #3470

Are you sure you want to change the base?

chore: DataFusion 52 migration #3470

Uh oh!

Conversation

comphead commented Feb 10, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead commented Feb 10, 2026

Uh oh!

comphead commented Feb 10, 2026

Uh oh!

andygrove commented Feb 11, 2026

Uh oh!

andygrove commented Feb 11, 2026

Uh oh!

andygrove commented Feb 12, 2026

Uh oh!

andygrove commented Feb 12, 2026

CI Failure Analysis

Summary

1. width_bucket - with range data — Arrow type cast error (8 jobs)

2. Schema pruning — Native reader crash on complex nested types (3 jobs, ~44 tests each)

3. SPARK-40819 — Timestamp nanos precision loss (4 jobs)

4. select column with default value — Fuzz test default value bug (1 job, 2 tests)

5. Miri build failure (1 job)

6. Spark 4.0-specific failures (2 jobs, 1 test each)

Priority Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `width_bucket - with range data` — Arrow type cast error (8 jobs)

4. `select column with default value` — Fuzz test default value bug (1 job, 2 tests)