-
Notifications
You must be signed in to change notification settings - Fork 285
chore: DataFusion 52 migration #3470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* DataFusion 52 migration
…3471) DataFusion 52's arrow-arith kernels only support Date32 +/- Interval types, not raw integers. When Spark sends Date32 + Int8/Int16/Int32 arithmetic, the planner now routes these operations to the Spark date_add/date_sub UDFs which handle integer types directly. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
Some array functions tests fails on |
DataFusion 52's default PhysicalExprAdapter can fail when casting complex nested types (List<Struct>, Map) between physical and logical schemas. This adds a fallback path in SparkPhysicalExprAdapter that wraps type-mismatched columns with CometCastColumnExpr using spark_parquet_convert for the actual conversion. Changes to CometCastColumnExpr: - Add optional SparkParquetOptions for complex nested type conversions - Use == instead of equals_datatype to detect field name differences in nested types (Struct, List, Map) - Add relabel_array for types that differ only in field names (e.g., List element "item" vs "element", Map "key_value" vs "entries") - Fallback to spark_parquet_convert for structural nested type changes Changes to SparkPhysicalExprAdapter: - Try default adapter first, fall back to wrap_all_type_mismatches when it fails on complex nested types - Route Struct/List/Map casts to CometCastColumnExpr instead of Spark Cast, which doesn't handle nested type rewriting Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
@sqlbenchmark run tpch --iterations 3 |
…3493) * fix: make relabel_array recursive for nested type mismatches The shallow ArrayData type swap in relabel_array caused panics when Arrow's ArrayData::build() validated child types recursively. This rebuilds arrays from typed constructors (ListArray, LargeListArray, MapArray, StructArray) so nested field name and metadata differences are handled correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: run cargo fmt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Benchmarks failed with OOM on q19 |
…_convert (#3494) INT96 Parquet timestamps are coerced to Timestamp(us, None) by DataFusion but the logical schema expects Timestamp(us, Some("UTC")). The schema adapter was routing this mismatch through Spark's Cast expression, which incorrectly treats None-timezone values as TimestampNTZ (local time) and applies a timezone conversion. This caused results to be shifted by the session timezone offset (e.g., -5h45m for Asia/Kathmandu). Route Timestamp->Timestamp mismatches through CometCastColumnExpr which delegates to spark_parquet_convert, handling this as a metadata-only timezone relabel without modifying the underlying values. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…apter (#3495) The DefaultPhysicalExprAdapter uses exact case-sensitive name matching (Arrow's field_with_name/index_of) to resolve columns. When a parquet file has lowercase "a" but the table schema has uppercase "A", the lookup fails and columns are filled with nulls. Fix by remapping physical schema field names to match logical names (case-insensitively) before passing to the default adapter, then restoring original physical names in the rewritten expressions so that downstream reassign_expr_columns can find columns in the actual parquet stream schema. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…3473) DataFusion 52 changed how FilterExec's batch coalescer works - streams now return Poll::Pending when accumulating input instead of blocking on a channel. Update test_unpack_dictionary_primitive and test_unpack_dictionary_string to poll the stream directly and send EOF on Pending, rather than using a separate mpsc channel/spawned task to feed batches. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
CI Failure AnalysisSummary17 CI jobs are failing across 6 distinct root causes. Here is the breakdown: 1.
|
| # | Issue | Jobs | Severity |
|---|---|---|---|
| 1 | width_bucket Int64→Int32 cast |
11 | High — likely DF52 API change |
| 2 | Schema pruning complex types | 3 | High — ~44 tests per job |
| 3 | Timestamp nanos precision | 4 | Medium |
| 4 | Default column values | 1 | Medium |
| 5 | Miri stale cache | 1 | Low — CI infra |
| 6 | Spark 4.0 shredding + flaky rename | 2 | Low |
This analysis was generated with the assistance of AI (Claude Code). Failure logs were retrieved and analyzed programmatically — manual verification of root causes is recommended.
When Spark's `LEGACY_PARQUET_NANOS_AS_LONG=true` converts TIMESTAMP(NANOS) to LongType, the PhysicalExprAdapter detects a type mismatch between the file's Timestamp(Nanosecond) and the logical Int64. The DefaultAdapter creates a CastColumnExpr, which SparkPhysicalExprAdapter then replaces with Spark's Cast expression. Spark's Cast postprocess for Timestamp→Int64 unconditionally divides by MICROS_PER_SECOND (10^6), assuming microsecond precision. But the values are nanoseconds, so the raw value 1668537129123534758 becomes 1668537129123 — losing sub-millisecond precision. Fix: route Timestamp→Int64 casts through CometCastColumnExpr (which uses spark_parquet_convert → Arrow cast) instead of Spark Cast. Arrow's cast correctly reinterprets the raw i64 value without any division. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: [df52] schema pruning crash on complex nested types
When `data_schema` is provided but `projection_vector` is None (the
NativeBatchReader / native_iceberg_compat path), the base schema was
incorrectly set to the pruned `required_schema`. This caused DataFusion
to think the table had only the pruned columns, leading to column index
misalignment in PhysicalExprAdapter. For example, reading "friends" at
logical index 0 would map to physical index 0 ("id") instead of the
correct index 4.
Fix: when `data_schema` is provided without a `projection_vector`,
compute the projection by mapping required field names to their indices
in the full data schema. Also harden `wrap_all_type_mismatches` to use
name-based lookup for physical fields instead of positional index.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: handle field ID mapping in projection computation
When computing a name-based projection from required_schema to
data_schema, fall back to using required_schema directly when not
all fields can be matched by name. This handles Parquet field ID
mapping where column names differ between the read schema and file
schema.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Add IgnoreCometSuite to ParquetVariantShreddingSuite in the 4.0.1 diff. VariantType shredding is a Spark 4.0 feature that Comet does not yet support (#2209). VariantShreddingSuite was already skipped but ParquetVariantShreddingSuite was missed, causing test failures in CI. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>


Which issue does this PR close?
Closes #3046 .
This PR is on shared branch and replaces #3052
Rationale for this change
What changes are included in this PR?
How are these changes tested?