Skip to content

Add api to read dictionary from each column chunk for predicate pushdown #1887

@asfimport

Description

@asfimport

Parquet files's dictionary could be used for predicate pushdown
eg.
SQL query:
select * from table where column = 10;
could skip reading the whole row group if the dictionary for column has values [5, 11, 17, 20]
This could save IO and improve performance.

We implemented predicate pushdown using dictionary in Presto for parquet files, and benchmark shows up to 40X speedup for selective queries.

Need to add an api to ParquetFileReader, so that it returns dictionaries for requested columns.
If the column is not dictionary encoded in this row group, return null.
If the not all column pages are dictionary encoded in this row group, return null.

Reporter: Zhenxiao Luo / @zhenxiao
Assignee: Zhenxiao Luo / @zhenxiao

Related issues:

Note: This issue was originally created as PARQUET-374. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions