docs: Update Parquet scan documentation by andygrove · Pull Request #3433 · apache/datafusion-comet

andygrove · 2026-02-06T14:09:19Z

Overview

This PR removes all references to the deprecated native_comet scan implementation from the documentation and
configuration, and improves the accuracy and clarity of the Parquet scan documentation.

Changed Files

`common/src/main/scala/org/apache/comet/CometConf.scala`

Changed the category of spark.comet.scan.impl from CATEGORY_SCAN to CATEGORY_PARQUET
Rewrote the doc string to describe native_datafusion and native_iceberg_compat without referencing
native_comet
Removed the .internal() marker, making this configuration visible to users

`docs/source/contributor-guide/parquet_scans.md`

Major rewrite of the Parquet scan documentation:

Removed all references to the deprecated native_comet scan (previously listed as one of three implementations)
Removed the comparison table that included native_comet and the "benefits over native_comet" section
Removed the separate native_comet S3 section (which described Hadoop-AWS-based S3 access)
Updated the S3 configuration and examples sections to reference both native_datafusion and native_iceberg_compat
(previously only referenced native_datafusion)
Clarified that auto mode currently always selects native_iceberg_compat
Separated limitations into two clear categories:
- Fallback to Spark (safe): unsupported features that cause Comet to fall back to Spark, producing correct
  results with reduced performance
- Potential incorrect results: issues that do not fall back and may produce wrong answers (datetime rebasing
  for both scans, hard-coded config defaults for native_iceberg_compat)
Added previously undocumented native_datafusion limitations that cause fallback:
- Dynamic Partition Pruning (DPP)
- input_file_name(), input_file_block_start(), input_file_block_length() SQL functions
- Spark metadata columns (e.g., _metadata.file_path)
Added Parquet encryption as a shared fallback limitation
Fixed misleading wording for ignoreMissingFiles/ignoreCorruptFiles (previously said "not compatible with Spark",
now clarifies it falls back to Spark)
Removed stale issue links (#1545, #1758) that referenced old native_datafusion issues

`docs/source/contributor-guide/ffi.md`

Replaced reference to native_comet with a general description of scans that use mutable buffers

`docs/source/contributor-guide/roadmap.md`

Removed the "Removing the native_comet scan implementation" roadmap section (now completed)
Simplified the Iceberg integration description by removing the mention of the native_comet to
native_iceberg_compat transition

Fix grammar, add encryption fallback and native_iceberg_compat hard-coded config limitations, clarify S3 section applies to both scan implementations, and remove orphaned link references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clarify which limitations fall back to Spark vs which may produce incorrect results. Add missing documented limitations for native_datafusion (DPP, input_file_name, metadata columns). Fix misleading wording for ignoreCorruptFiles/ignoreMissingFiles. Note that auto mode currently always selects native_iceberg_compat. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The section intro already states all limitations fall back to Spark, so individual bullet points don't need to repeat it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restructure shared and per-scan limitation lists into two clear categories: features that fall back to Spark (safe) and issues that may produce incorrect results without falling back. Remove redundant "Comet falls back to Spark" from individual bullets where the section intro already states it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs/source/contributor-guide/parquet_scans.md

comphead · 2026-02-17T17:26:13Z

docs/source/contributor-guide/parquet_scans.md

+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's


Suggested change

Comet currently has two distinct implementations of the Parquet scan operator.

The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's

Comet currently has following distinct implementations of the Parquet scan operator

- `native_datafusion`

- `native_iceberg_compat`

comphead

Thanks @andygrove

mbutrovich

Some suggested changes.

mbutrovich · 2026-02-17T17:41:56Z

docs/source/contributor-guide/ffi.md

-If the data originates from `native_comet` scan (deprecated, will be removed in a future release) or from
-`native_iceberg_compat` in some cases, then ownership is not transferred to native and the JVM may re-use the
-underlying buffers in the future.
+If the data originates from a scan that uses mutable buffers (such Iceberg scans using the Iceberg Java integration path),


Iceberg Java is a bit ambiguous since the native reader still needs Iceberg Java for planning. We could link to the hybrid reader here:

https://datafusion.apache.org/comet/user-guide/latest/iceberg.html#hybrid-reader

I am interested in standardizing terminology on referring to this codepath as a legacy path.

mbutrovich · 2026-02-17T17:42:47Z

docs/source/contributor-guide/parquet_scans.md

+
+The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's
+`DataSourceExec`. The main difference between these implementations is that `native_datafusion` runs fully natively, and
+`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support some Spark features that


This sentence is hard to follow with the subject switching back and forth, making it unclear what "but has some performance overhead due to crossing the JVM/Rust boundary." is actually referring to. Suggest breaking it up.

mbutrovich · 2026-02-17T17:43:18Z

docs/source/contributor-guide/parquet_scans.md

+The configuration property
+`spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which
+currently always uses the `native_iceberg_compat` implementation. Most users should not need to change this setting.
+However, it is possible to force Comet to try and use a particular implementation for all scan operations by setting


"to try and" is not doing anything here. Suggest removing.

mbutrovich · 2026-02-17T17:43:53Z

docs/source/contributor-guide/parquet_scans.md

+  Julian/Gregorian calendar), dates/timestamps will be read as if they were written using the Proleptic Gregorian
+  calendar. This may produce incorrect results for dates before October 15, 1582.
+
+The `native_datafusion` scan has some additional limitations. All of these cause Comet to fall back to Spark.


mostly related to Parquet metadata columns.

Co-authored-by: Oleks V <comphead@users.noreply.github.com>

andygrove and others added 5 commits February 6, 2026 07:08

docs: remove all mentions of native_comet scan

8f3d2de

update

87cd794

prettier

e394f2a

update config docs

fbe2f33

andygrove changed the title ~~docs: remove all mentions of native_comet scan~~ docs: Update Parquet scan documentation Feb 9, 2026

prettier

c25a7cd

andygrove added this to the 0.14.0 milestone Feb 9, 2026

andygrove and others added 5 commits February 13, 2026 12:52

Merge remote-tracking branch 'apache/main' into native_comet_docs

69a4e0b

docs: remove redundant fallback language in native_datafusion section

32334bd

The section intro already states all limitations fall back to Spark, so individual bullet points don't need to repeat it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix

2789c36

andygrove marked this pull request as ready for review February 13, 2026 20:07

update

0266613

andygrove requested review from comphead, mbutrovich and parthchandra February 13, 2026 20:12

andygrove and others added 2 commits February 13, 2026 13:53

remove encryption from unsupported list, move DPP to common list

ba192a1

Merge branch 'main' into native_comet_docs

a696cf1

comphead reviewed Feb 17, 2026

View reviewed changes

docs/source/contributor-guide/parquet_scans.md Outdated Show resolved Hide resolved

comphead reviewed Feb 17, 2026

View reviewed changes

comphead approved these changes Feb 17, 2026

View reviewed changes

mbutrovich requested changes Feb 17, 2026

View reviewed changes

Update docs/source/contributor-guide/parquet_scans.md

df37c22

Co-authored-by: Oleks V <comphead@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Update Parquet scan documentation#3433

docs: Update Parquet scan documentation#3433
andygrove wants to merge 15 commits intoapache:mainfrom
andygrove:native_comet_docs

andygrove commented Feb 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

comphead Feb 17, 2026

Uh oh!

comphead left a comment

Uh oh!

mbutrovich left a comment

Uh oh!

mbutrovich Feb 17, 2026

Uh oh!

mbutrovich Feb 17, 2026

Uh oh!

mbutrovich Feb 17, 2026

Uh oh!

mbutrovich Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		Comet currently has two distinct implementations of the Parquet scan operator.

		The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's

Conversation

andygrove commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changed Files

common/src/main/scala/org/apache/comet/CometConf.scala

docs/source/contributor-guide/parquet_scans.md

docs/source/contributor-guide/ffi.md

docs/source/contributor-guide/roadmap.md

Uh oh!

Uh oh!

comphead Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Feb 6, 2026 •

edited

Loading

`common/src/main/scala/org/apache/comet/CometConf.scala`

`docs/source/contributor-guide/parquet_scans.md`

`docs/source/contributor-guide/ffi.md`

`docs/source/contributor-guide/roadmap.md`