Optimize the parquet RecordReader implementation when:  A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema 

The RecordReader implementation currently will read all the columns before applying the filter predicate and deciding whether to keep the row or discard it.
We can have a RecordReader which will only assemble the columns on which filters are applied (which are usually a few), then apply the filter and decide whether to keep the row or not , and then goes on to assemble the remaining columns or skip the remaining columns accordingly.

Also for applications like spark sql , the schema usually applied is a flat one with no repeating or nested columns. In such cases, its better to have a light-weight, faster RecordReader.

The performance improvement by this change is seen to be significant , and is better in case smaller number of rows are returned by filtering (which is usually the case) and there are many number of columns

**Reporter**: [Yash Datta](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=saucam) / @saucam

<sub>**Note**: *This issue was originally created as [PARQUET-128](https://issues.apache.org/jira/browse/PARQUET-128). Please see the [migration documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize the parquet RecordReader implementation when: A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema #1640

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize the parquet RecordReader implementation when: A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema #1640

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions