[native_datafusion] [Spark SQL Tests] Schema incompatibility tests expect exceptions that native_datafusion handles gracefully

## Summary

Several Spark SQL tests fail because they expect a `SparkException` when reading Parquet with mismatched schemas, but `native_datafusion` either handles type widening/mismatches gracefully or throws errors with different exception types/messages than Spark expects.

## Failing Tests

### Original 8 tests (from initial report)

| Test | Suite | Root Cause | Difficulty |
|------|-------|------------|------------|
| SPARK-35640: read binary as timestamp should throw schema incompatible error | `ParquetIOSuite` | Native reader doesn't reject binary→timestamp | **Easy** — add type check in schema adapter |
| SPARK-35640: int as long should throw schema incompatible error | `ParquetIOSuite` | Native reader allows INT32→INT64 widening | **Easy** — add type check in schema adapter |
| SPARK-36182: can't read TimestampLTZ as TimestampNTZ | `ParquetQuerySuite` | INT96 timestamps don't carry timezone info in Parquet schema, so native reader can't detect LTZ→NTZ mismatch | **Hard** — would need INT96-specific handling |
| SPARK-34212: Parquet should read decimals correctly | `ParquetQuerySuite` | Different decimal precision/scale handling | **Medium** — needs decimal-specific validation |
| SPARK-45604 / schema mismatch failure error message | `ParquetSchemaSuite` | Tests check for `SchemaColumnConvertNotSupportedException` which is a Spark-internal type | **Ignore** — can't match Spark-internal exception types from native |
| Spark native readers should respect spark.sql.caseSensitive | `FileBasedDataSourceSuite` | Tests check for `SparkRuntimeException` with specific error class | **Ignore** — can't match Spark error classes from native |
| row group skipping doesn't overflow when reading into larger type | `ParquetReadSuite` | Already fixed on main | N/A |

### Additional tests discovered during PR #3416

| Test | Suite | Root Cause |
|------|-------|------------|
| SPARK-25207: exception when duplicate fields in case-insensitive mode | `ParquetFilterSuite` (x2: V1 and V2) | Exception wrapping differs — Comet wraps in `SparkException` but the cause is not `RuntimeException` as test expects |
| SPARK-26709: OptimizeMetadataOnlyQuery does not handle empty records correctly | `SQLQuerySuite` | Schema validation rejects INT32→INT64 coercion (empty file written with `lit(1)` as INT32, but `range()` writes INT64; combined read fails) |

## Analysis from PR #3416

PR #3416 attempted to add Spark-compatible schema validation in the native schema adapter. Key learnings:

### Approach taken
- Added validation in `SparkSchemaAdapter.map_schema()` (Rust) that rejects type coercions Spark's vectorized reader would reject
- New config `spark.comet.parquet.schemaValidation.enabled` (default: true)
- Wrapped native schema errors in `SparkException` with compatible error messages in `CometExecIterator`
- Passed `schema_evolution_enabled` to native side via proto

### What worked
- 2 tests were unignored (SPARK-35640 binary→timestamp and int→long)
- Schema validation correctly catches most type mismatches

### What didn't work (and why the PR was closed)
- **More tests were ignored (3) than unignored (2)**, making it a net negative
- **INT96 timestamp limitation**: Arrow-rs represents INT96 timestamps as `Timestamp(Nanosecond, None)` — same representation as TimestampNTZ. The native reader cannot distinguish "this was originally LTZ written as INT96" from "this is NTZ", making SPARK-36182 unfixable without INT96-specific metadata tracking
- **Exception type mismatch**: Spark tests check for specific exception cause types (`RuntimeException`, `SchemaColumnConvertNotSupportedException`, `SparkRuntimeException` with error classes) that can't be produced from native Rust code
- **Schema evolution interaction**: Some tests write data with mixed types across partitions (INT32 from `lit(1)` + INT64 from `range()`), and the validation correctly rejects INT32→INT64 on Spark 3.x where schema evolution is off — but this breaks tests that relied on silent widening

### Bug found and fixed
The `fileNotFoundPattern` regex in `CometExecIterator` had a bug: it included the `^External:` prefix in the pattern, but the code stripped that prefix before matching, so it could never match. This caused 4 `HiveMetadataCacheSuite` tests to fail because file-not-found errors weren't being wrapped in `SparkException` with `FileNotFoundException`. Fix: [320cc02](https://github.com/apache/datafusion-comet/commit/320cc02b3).

## Recommended approach

Rather than adding schema validation on the native side, it may be simpler to:

1. **Ignore tests that check Spark-internal exception types** — these can never be matched from native code
2. **Add targeted type checks** for the 2 easy cases (binary→timestamp, int→long) without a full validation framework
3. **Fix the `CometExecIterator` regex bug** independently (the fix from PR #3416 should be cherry-picked)
4. **Accept** that INT96 timestamp LTZ/NTZ detection is a known limitation of the native reader

## Related

- PR #3416 (closed): Full schema validation implementation
- Discovered in CI for #3307 (enable native_datafusion in auto scan mode)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[native_datafusion] [Spark SQL Tests] Schema incompatibility tests expect exceptions that native_datafusion handles gracefully #3311

Summary

Failing Tests

Original 8 tests (from initial report)

Additional tests discovered during PR #3416

Analysis from PR #3416

Approach taken

What worked

What didn't work (and why the PR was closed)

Bug found and fixed

Recommended approach

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test	Suite	Root Cause	Difficulty
SPARK-35640: read binary as timestamp should throw schema incompatible error	`ParquetIOSuite`	Native reader doesn't reject binary→timestamp	Easy — add type check in schema adapter
SPARK-35640: int as long should throw schema incompatible error	`ParquetIOSuite`	Native reader allows INT32→INT64 widening	Easy — add type check in schema adapter
SPARK-36182: can't read TimestampLTZ as TimestampNTZ	`ParquetQuerySuite`	INT96 timestamps don't carry timezone info in Parquet schema, so native reader can't detect LTZ→NTZ mismatch	Hard — would need INT96-specific handling
SPARK-34212: Parquet should read decimals correctly	`ParquetQuerySuite`	Different decimal precision/scale handling	Medium — needs decimal-specific validation
SPARK-45604 / schema mismatch failure error message	`ParquetSchemaSuite`	Tests check for `SchemaColumnConvertNotSupportedException` which is a Spark-internal type	Ignore — can't match Spark-internal exception types from native
Spark native readers should respect spark.sql.caseSensitive	`FileBasedDataSourceSuite`	Tests check for `SparkRuntimeException` with specific error class	Ignore — can't match Spark error classes from native
row group skipping doesn't overflow when reading into larger type	`ParquetReadSuite`	Already fixed on main	N/A

Test	Suite	Root Cause
SPARK-25207: exception when duplicate fields in case-insensitive mode	`ParquetFilterSuite` (x2: V1 and V2)	Exception wrapping differs — Comet wraps in `SparkException` but the cause is not `RuntimeException` as test expects
SPARK-26709: OptimizeMetadataOnlyQuery does not handle empty records correctly	`SQLQuerySuite`	Schema validation rejects INT32→INT64 coercion (empty file written with `lit(1)` as INT32, but `range()` writes INT64; combined read fails)

[native_datafusion] [Spark SQL Tests] Schema incompatibility tests expect exceptions that native_datafusion handles gracefully #3311

Description

Summary

Failing Tests

Original 8 tests (from initial report)

Additional tests discovered during PR #3416

Analysis from PR #3416

Approach taken

What worked

What didn't work (and why the PR was closed)

Bug found and fixed

Recommended approach

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions