-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add lexicographically partition points and ranges #424
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I left an optional comment; it can be moved to a separate PR 👍
/// | ||
/// The returned vec would be of size k+1 where k is cardinality of the sorted values; the first and | ||
/// last value would be 0 and n. | ||
fn lexicographical_partition_points(columns: &[SortColumn]) -> Result<Vec<usize>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to write the below as an iterator, thereby avoiding the allocation of the vector.
The general idea: create a struct with a method try_new
. That method initializes the struct's content and errors if something is wrong (e.g. empty columns). It also initializes lexicographical_comparator
.
Next, implement Iterator<Item=usize>
for that struct where Iterator::next
yields previous_partition_point
and increments it as written in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracked in #437
BTW I hope to get this into Arrow 4.3 -- I plan to build a release candidate for that on Thursday or Friday this week and release early next week. Once it gets released it will then be available in DataFusion |
Which issue does this PR close?
Closes #428
Rationale for this change
in order to support order by and partition by within window functions, we'll need to find out partition points on already sorted columns (lexicographically), and with binary search the time complexity can be O(log(n)). Here we'll utilize the lexicographical comparator extracted from #423 and add two functions.
What changes are included in this PR?
Are there any user-facing changes?