Advanced example for building an external index for Row Groups within parquet files #10580

alamb · 2024-05-20T15:56:20Z

Is your feature request related to a problem or challenge?

It is common in databases and other analytic system to have additional external "indexes" (perhaps stored in the "metadata catalog", perhaps stored alongside the data files, perhaps embedded in the files, perhaps elsewhere)

These indexes are used to speed up queries by "pruning": specifically evaluating a predicate on the index and then only reading the portions of files that would pass the filters in the query. In #10546 we showed how to create a index for entire files.

I would also like to create an example of how to create such an index for row groups within a file (showing how to read it without re-reading the metadata each time)

To complete this example, I think we need:

The API from @NGA-TRAN in [EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453
The API described in API in ParquetExec to pass in RowSelections to ParquetExec (enable custom indexes, finer grained pushdown) #9929

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

This is a follow on to #10546

The text was updated successfully, but these errors were encountered:

alamb · 2024-06-17T14:24:54Z

a PR for this example is now ready for review: #10701

alamb self-assigned this May 20, 2024

alamb mentioned this issue May 28, 2024

Add advanced_parquet_index.rs example of index in into parquet files #10701

Merged

This was referenced Jun 11, 2024

DataFusion weekly project plan (Andrew Lamb) - June 10, 2024 #10869

Closed

Docs: clarify when the parquet reader will read from object store when using cached metadata #10909

Merged

alamb mentioned this issue Jun 17, 2024

DataFusion weekly project plan (Andrew Lamb) - June 17, 2024 #10955

Closed

5 tasks

alamb closed this as completed in #10701 Jun 22, 2024

alamb mentioned this issue Jun 24, 2024

DataFusion weekly project plan (Andrew Lamb) - June 24, 2024 #11106

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced example for building an external index for Row Groups within parquet files #10580

Advanced example for building an external index for Row Groups within parquet files #10580

alamb commented May 20, 2024

alamb commented Jun 17, 2024

Advanced example for building an external index for Row Groups *within* parquet files #10580

Advanced example for building an external index for Row Groups *within* parquet files #10580

Comments

alamb commented May 20, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jun 17, 2024

Advanced example for building an external index for Row Groups within parquet files #10580

Advanced example for building an external index for Row Groups within parquet files #10580