Skip to content

parquet: Add an option to not parse the Page Index on each query #12547

@progval

Description

@progval

Is your feature request related to a problem or challenge?

CREATE TABLE does not parse the Page Index, and SELECT does not cache it. This can make requests on large Parquet datasets take a significant time for queries that have a small number of results.

For example, with a simple SELECT int_column, other_int_column WHERE int_column=123456 on a table with 184 billion rows (so about 9 million Page Index items, given the default 20k page size)

output_rows=0, elapsed_compute=96ns, num_predicate_creation_errors=0, page_index_rows_filtered=0, predicate_evaluation_errors=0, row_groups_pruned_bloom_filter=21050, row_groups_matched_bloom_filter=0, file_open_errors=0, file_scan_errors=0, bytes_scanned=25023432248, row_groups_matched_statistics=21050, pushdown_rows_filtered=0, row_groups_pruned_statistics=173576, time_elapsed_scanning_total=16.763964ms, page_index_eval_time=3.153918ms, time_elapsed_scanning_until_data=16.745759ms, time_elapsed_processing=61.531313027s, time_elapsed_opening=96.012649352s, pushdown_eval_time=382ns

Describe the solution you'd like

Parse it once and for all, either on CREATE TABLE or lazily as SELECT queries read the files. (Note that in the case of partitioned tables, not all files may be read by the first SELECT)

Describe alternatives you've considered

https://github.com/apache/datafusion/blob/3b93cc952b889cec2364ad2490ae18ecddb3ca49/datafusion-examples/examples/advanced_parquet_index.rs

but it requires using the low-level API, and is not available through the SQL or Python interfaces.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions