-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
CREATE TABLE
does not parse the Page Index, and SELECT
does not cache it. This can make requests on large Parquet datasets take a significant time for queries that have a small number of results.
For example, with a simple SELECT int_column, other_int_column WHERE int_column=123456
on a table with 184 billion rows (so about 9 million Page Index items, given the default 20k page size)
output_rows=0, elapsed_compute=96ns, num_predicate_creation_errors=0, page_index_rows_filtered=0, predicate_evaluation_errors=0, row_groups_pruned_bloom_filter=21050, row_groups_matched_bloom_filter=0, file_open_errors=0, file_scan_errors=0, bytes_scanned=25023432248, row_groups_matched_statistics=21050, pushdown_rows_filtered=0, row_groups_pruned_statistics=173576, time_elapsed_scanning_total=16.763964ms, page_index_eval_time=3.153918ms, time_elapsed_scanning_until_data=16.745759ms, time_elapsed_processing=61.531313027s, time_elapsed_opening=96.012649352s, pushdown_eval_time=382ns
Describe the solution you'd like
Parse it once and for all, either on CREATE TABLE
or lazily as SELECT
queries read the files. (Note that in the case of partitioned tables, not all files may be read by the first SELECT
)
Describe alternatives you've considered
but it requires using the low-level API, and is not available through the SQL or Python interfaces.
Additional context
No response