You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suggested by @crepererum in #4169 (comment)
Some systems such as IOx, store parquet files in a particular sorted order, and then uses the fact the data is sorted for a variety of sort related optimizations.
Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster.
The BasicEnforcement rule added in #4122 by @mingmwang allows DataFusion to take advantage of known information about the sort order.
One contrived example is if your parquet file is sorted by price and your query is select * from data order by price limit 10 datafusion can avoid scanning the entire file
Another more interesting example could be using sorted order to reorder pushdown filters or using a sort-merge-join without actually sorting
I have a question, if we detect the sort information when initializing the physical plan, would it cause a performance regression since we need read meta of all the parquet files?
I have a question, if we detect the sort information when initializing the physical plan, would it cause a performance regression since we need read meta of all the parquet files?
Depends where you place the parquet metadata. We'll likely don't wanna pre-fetch metadata when constructing the physical plan. However you could store the metadata in some catalog or cache, in which case it could be available during planning.
Some part of the parquet file metadata is already read as part of physical planning (e.g. fetching the statistics). I don't quite remember how it is all hooked up but you can trace it back from
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suggested by @crepererum in #4169 (comment)
Some systems such as IOx, store parquet files in a particular sorted order, and then uses the fact the data is sorted for a variety of sort related optimizations.
Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster.
The
BasicEnforcement
rule added in #4122 by @mingmwang allows DataFusion to take advantage of known information about the sort order.One contrived example is if your parquet file is sorted by
price
and your query isselect * from data order by price limit 10
datafusion can avoid scanning the entire fileAnother more interesting example could be using sorted order to reorder pushdown filters or using a sort-merge-join without actually sorting
Describe the solution you'd like
SortingColumn
when reading and writing parquet metadata arrow-rs#3090Describe alternatives you've considered
Don't do it
Additional context
Here is a ticket that tracks allowing users of DataFusion to manually specify the sort order: #4169
The text was updated successfully, but these errors were encountered: