Skip to content

[native_datafusion] [Spark SQL Tests] Schema incompatibility tests expect exceptions that native_datafusion handles gracefully #3311

@andygrove

Description

@andygrove

Summary

Several Spark SQL tests fail because they expect a SparkException when reading Parquet with mismatched schemas, but native_datafusion either handles type widening/mismatches gracefully or throws errors with different exception types/messages than Spark expects.

Failing Tests

Original 8 tests (from initial report)

Test Suite Root Cause Difficulty
SPARK-35640: read binary as timestamp should throw schema incompatible error ParquetIOSuite Native reader doesn't reject binary→timestamp Easy — add type check in schema adapter
SPARK-35640: int as long should throw schema incompatible error ParquetIOSuite Native reader allows INT32→INT64 widening Easy — add type check in schema adapter
SPARK-36182: can't read TimestampLTZ as TimestampNTZ ParquetQuerySuite INT96 timestamps don't carry timezone info in Parquet schema, so native reader can't detect LTZ→NTZ mismatch Hard — would need INT96-specific handling
SPARK-34212: Parquet should read decimals correctly ParquetQuerySuite Different decimal precision/scale handling Medium — needs decimal-specific validation
SPARK-45604 / schema mismatch failure error message ParquetSchemaSuite Tests check for SchemaColumnConvertNotSupportedException which is a Spark-internal type Ignore — can't match Spark-internal exception types from native
Spark native readers should respect spark.sql.caseSensitive FileBasedDataSourceSuite Tests check for SparkRuntimeException with specific error class Ignore — can't match Spark error classes from native
row group skipping doesn't overflow when reading into larger type ParquetReadSuite Already fixed on main N/A

Additional tests discovered during PR #3416

Test Suite Root Cause
SPARK-25207: exception when duplicate fields in case-insensitive mode ParquetFilterSuite (x2: V1 and V2) Exception wrapping differs — Comet wraps in SparkException but the cause is not RuntimeException as test expects
SPARK-26709: OptimizeMetadataOnlyQuery does not handle empty records correctly SQLQuerySuite Schema validation rejects INT32→INT64 coercion (empty file written with lit(1) as INT32, but range() writes INT64; combined read fails)

Analysis from PR #3416

PR #3416 attempted to add Spark-compatible schema validation in the native schema adapter. Key learnings:

Approach taken

  • Added validation in SparkSchemaAdapter.map_schema() (Rust) that rejects type coercions Spark's vectorized reader would reject
  • New config spark.comet.parquet.schemaValidation.enabled (default: true)
  • Wrapped native schema errors in SparkException with compatible error messages in CometExecIterator
  • Passed schema_evolution_enabled to native side via proto

What worked

  • 2 tests were unignored (SPARK-35640 binary→timestamp and int→long)
  • Schema validation correctly catches most type mismatches

What didn't work (and why the PR was closed)

  • More tests were ignored (3) than unignored (2), making it a net negative
  • INT96 timestamp limitation: Arrow-rs represents INT96 timestamps as Timestamp(Nanosecond, None) — same representation as TimestampNTZ. The native reader cannot distinguish "this was originally LTZ written as INT96" from "this is NTZ", making SPARK-36182 unfixable without INT96-specific metadata tracking
  • Exception type mismatch: Spark tests check for specific exception cause types (RuntimeException, SchemaColumnConvertNotSupportedException, SparkRuntimeException with error classes) that can't be produced from native Rust code
  • Schema evolution interaction: Some tests write data with mixed types across partitions (INT32 from lit(1) + INT64 from range()), and the validation correctly rejects INT32→INT64 on Spark 3.x where schema evolution is off — but this breaks tests that relied on silent widening

Bug found and fixed

The fileNotFoundPattern regex in CometExecIterator had a bug: it included the ^External: prefix in the pattern, but the code stripped that prefix before matching, so it could never match. This caused 4 HiveMetadataCacheSuite tests to fail because file-not-found errors weren't being wrapped in SparkException with FileNotFoundException. Fix: 320cc02.

Recommended approach

Rather than adding schema validation on the native side, it may be simpler to:

  1. Ignore tests that check Spark-internal exception types — these can never be matched from native code
  2. Add targeted type checks for the 2 easy cases (binary→timestamp, int→long) without a full validation framework
  3. Fix the CometExecIterator regex bug independently (the fix from PR fix: Add Spark-compatible schema validation for native_datafusion scan #3416 should be cherry-picked)
  4. Accept that INT96 timestamp LTZ/NTZ detection is a known limitation of the native reader

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions