Skip to content

Conversation

@gaborkaszab
Copy link
Collaborator

No description provided.

@gaborkaszab
Copy link
Collaborator Author

This is needed for a follow-up step for the partition stat Scan API to filter the stats. InternalData is used to read the partition stats and as a next step filtering is to be introduced, however, InternalData doesn't have a support for this now.

@gaborkaszab
Copy link
Collaborator Author

Anther way to implement this is to introduce a SupportsFiltering interface and parquet reader can derive from that. At the creation site we have to check if the reader is an instance of SupportsFiltering and add filter if yes.

@gaborkaszab gaborkaszab requested a review from pvary November 19, 2025 15:32
/** Set a custom class for in-memory objects at the given field ID. */
ReadBuilder setCustomType(int fieldId, Class<? extends StructLike> structClass);

/** Set a filter to apply on result rows if applicable. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very important to mention that the caller still needs to do residual filtering, because some filters might not be supported, and some formats might not support filtering at all.
Something like this:


  /**
   * Pushes down the {@link Expression} filter for the reader to prevent reading unnecessary
   * records. Some readers may not support filtering, or may only support filtering for certain expressions.
   * In this case the reader might return unfiltered or partially filtered rows. It is the caller's responsibility to
   * apply the filter again.
   *
   * @param filter the filter to set
   */

ReadBuilder setCustomType(int fieldId, Class<? extends StructLike> structClass);

/** Set a filter to apply on result rows if applicable. */
default ReadBuilder filter(Expression newFilter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe call the parameter filter

@gaborkaszab
Copy link
Collaborator Author

Took another look, and apparently our internal Parquet reader can only do row group filtering, so for InternalData that reads metadata (including partitions stats) probably such filtering won't bring any additional value, because such data rarely spans multiple row groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants