Skip to content

[native_datafusion] [Spark SQL Tests] Parquet field ID matching not supported #3316

@andygrove

Description

@andygrove

Summary

5 Spark SQL tests fail because native_datafusion doesn't respect Parquet field ID matching.

Failing Tests

  • ParquetFieldIdIOSuite: "Parquet reads infer fields using field ids correctly"
  • ParquetFieldIdIOSuite: "absence of field ids"
  • ParquetFieldIdIOSuite: "SPARK-38094: absence of field ids: reading nested schema"
  • ParquetFieldIdIOSuite: "multiple id matches"
  • ParquetFieldIdIOSuite: "read parquet file without ids"
  • ParquetFieldIdIOSuite: "global read/write flag should work correctly"

Root Cause

native_datafusion reads columns by name/position rather than Parquet field IDs, producing wrong results when spark.sql.parquet.fieldId.read.enabled is true.

Possible Fix

In CometScanRule.nativeDataFusionScan(), detect when spark.sql.parquet.fieldId.read.enabled is true and fall back to native_iceberg_compat.

Related

Discovered in CI for #3307 (enable native_datafusion in auto scan mode).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions