-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Lexicographic Kernels With Row Format #2871
Comments
I just wonder why not use row format in |
If you wanted to make a start migrating DataFusion's SortExec over to using the row format that would be amazing. I'm currently working on some benchmarks for lexsort vs row format and I fully expect this to turn up some performance issues to fix. Therefore if you are able to start migrating DF over in parallel, that would be awesome |
@tustvold appreciate any details on how to use row format in sortExec |
This was already implemented in apache/datafusion#6163 I will file an upstream ticket for removing the last use of lexsort in DataFusion |
Thanks please link the upstream ticket, I'd like to contribute on that one, as lexsort affects our use case. |
I have filed apache/datafusion#7053 but I would strongly discourage you from picking it up, there is a reason I avoided doing it when I last played with SortExec, it is very fiddly to do well |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
lexsort
,lexsort_to_indices
,lexicographical_partition_ranges
, etc... make use ofLexicographicalComparator
to compare rows. The branching and dynamic dispatch involved in this comparator is relatively expensive. Converting to the row format first, and comparing these rows has been found to offer significant performance advantages in similar applications - apache/datafusion#3386.Describe the solution you'd like
We should provide examples, and potentially some utilities if necessary, to use the row format for these use-cases instead. We can then deprecate the lexicographic kernels and eventually remove them
Describe alternatives you've considered
We could leverage the row format within the existing kernels, however, this has a few drawbacks:
Additional context
The text was updated successfully, but these errors were encountered: