-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Row Format in SortExec #7053
Labels
enhancement
New feature or request
Comments
AFAIK this is implemented |
Hm looking at the code, this doesn't seem to be the case |
12 tasks
take |
I probably need a efficient way to efficiently spill Rows to disk. Right now I am using a dumb way to spill Rows (len + row_data) |
Current design is
pub enum RowOrColumn {
Row(Rows),
Column(RecordBatch),
}
/// Contains a Rows or a Recordbatch
pub type RowOrColumnStream = Pin<Box<dyn Stream<Item = Result<RowOrColumn>> + Send>>;
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge?
Currently
SortExec::sort_batch_stream
useslexsort_to_indices
to sort the producedRecordBatch
. For multi-column sorts this makes use ofLexicographicalComparator
. The branching and dynamic dispatch involved in this comparator is relatively expensive. Converting to the row format first, and comparing these rows has been found to offer significant performance advantages in similar applications - #3386.Describe the solution you'd like
SortExec should:
sort_to_indices
to sort the input batchesDescribe alternatives you've considered
No response
Additional context
This is likely not a good first issue, and I do not recommend people pick it up, creating primarily for tracking purposes. I will likely pick it up at some point in the near-ish future.
The text was updated successfully, but these errors were encountered: