-
Notifications
You must be signed in to change notification settings - Fork 285
Description
Summary
Several Spark SQL tests fail because they expect a SparkException when reading Parquet with mismatched schemas, but native_datafusion either handles type widening/mismatches gracefully or throws errors with different exception types/messages than Spark expects.
Failing Tests
Original 8 tests (from initial report)
| Test | Suite | Root Cause | Difficulty |
|---|---|---|---|
| SPARK-35640: read binary as timestamp should throw schema incompatible error | ParquetIOSuite |
Native reader doesn't reject binary→timestamp | Easy — add type check in schema adapter |
| SPARK-35640: int as long should throw schema incompatible error | ParquetIOSuite |
Native reader allows INT32→INT64 widening | Easy — add type check in schema adapter |
| SPARK-36182: can't read TimestampLTZ as TimestampNTZ | ParquetQuerySuite |
INT96 timestamps don't carry timezone info in Parquet schema, so native reader can't detect LTZ→NTZ mismatch | Hard — would need INT96-specific handling |
| SPARK-34212: Parquet should read decimals correctly | ParquetQuerySuite |
Different decimal precision/scale handling | Medium — needs decimal-specific validation |
| SPARK-45604 / schema mismatch failure error message | ParquetSchemaSuite |
Tests check for SchemaColumnConvertNotSupportedException which is a Spark-internal type |
Ignore — can't match Spark-internal exception types from native |
| Spark native readers should respect spark.sql.caseSensitive | FileBasedDataSourceSuite |
Tests check for SparkRuntimeException with specific error class |
Ignore — can't match Spark error classes from native |
| row group skipping doesn't overflow when reading into larger type | ParquetReadSuite |
Already fixed on main | N/A |
Additional tests discovered during PR #3416
| Test | Suite | Root Cause |
|---|---|---|
| SPARK-25207: exception when duplicate fields in case-insensitive mode | ParquetFilterSuite (x2: V1 and V2) |
Exception wrapping differs — Comet wraps in SparkException but the cause is not RuntimeException as test expects |
| SPARK-26709: OptimizeMetadataOnlyQuery does not handle empty records correctly | SQLQuerySuite |
Schema validation rejects INT32→INT64 coercion (empty file written with lit(1) as INT32, but range() writes INT64; combined read fails) |
Analysis from PR #3416
PR #3416 attempted to add Spark-compatible schema validation in the native schema adapter. Key learnings:
Approach taken
- Added validation in
SparkSchemaAdapter.map_schema()(Rust) that rejects type coercions Spark's vectorized reader would reject - New config
spark.comet.parquet.schemaValidation.enabled(default: true) - Wrapped native schema errors in
SparkExceptionwith compatible error messages inCometExecIterator - Passed
schema_evolution_enabledto native side via proto
What worked
- 2 tests were unignored (SPARK-35640 binary→timestamp and int→long)
- Schema validation correctly catches most type mismatches
What didn't work (and why the PR was closed)
- More tests were ignored (3) than unignored (2), making it a net negative
- INT96 timestamp limitation: Arrow-rs represents INT96 timestamps as
Timestamp(Nanosecond, None)— same representation as TimestampNTZ. The native reader cannot distinguish "this was originally LTZ written as INT96" from "this is NTZ", making SPARK-36182 unfixable without INT96-specific metadata tracking - Exception type mismatch: Spark tests check for specific exception cause types (
RuntimeException,SchemaColumnConvertNotSupportedException,SparkRuntimeExceptionwith error classes) that can't be produced from native Rust code - Schema evolution interaction: Some tests write data with mixed types across partitions (INT32 from
lit(1)+ INT64 fromrange()), and the validation correctly rejects INT32→INT64 on Spark 3.x where schema evolution is off — but this breaks tests that relied on silent widening
Bug found and fixed
The fileNotFoundPattern regex in CometExecIterator had a bug: it included the ^External: prefix in the pattern, but the code stripped that prefix before matching, so it could never match. This caused 4 HiveMetadataCacheSuite tests to fail because file-not-found errors weren't being wrapped in SparkException with FileNotFoundException. Fix: 320cc02.
Recommended approach
Rather than adding schema validation on the native side, it may be simpler to:
- Ignore tests that check Spark-internal exception types — these can never be matched from native code
- Add targeted type checks for the 2 easy cases (binary→timestamp, int→long) without a full validation framework
- Fix the
CometExecIteratorregex bug independently (the fix from PR fix: Add Spark-compatible schema validation for native_datafusion scan #3416 should be cherry-picked) - Accept that INT96 timestamp LTZ/NTZ detection is a known limitation of the native reader
Related
- PR fix: Add Spark-compatible schema validation for native_datafusion scan #3416 (closed): Full schema validation implementation
- Discovered in CI for feat: enable native_datafusion in auto scan mode [WIP] [IGNORE] #3307 (enable native_datafusion in auto scan mode)