-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized sorting for ordered dictionaries #1048
Optimized sorting for ordered dictionaries #1048
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1048 +/- ##
==========================================
+ Coverage 83.48% 83.50% +0.02%
==========================================
Files 196 196
Lines 55923 56047 +124
==========================================
+ Hits 46686 46802 +116
- Misses 9237 9245 +8
Continue to review full report at Codecov.
|
Updated benchmark results for single-array and lexicographical sorting Single array, length 1_000_000, 50% nullsCardinality 1_000
Cardinality 10_000
Cardinality 100_000
Cardinality 250_000
Cardinality 500_000
The threshold where converting to sorted dictionaries is no longer beneficial seems to be around a cardinality that is 25% of the array length. Two arrays, length 1_000_000, 50% nullsCardinalities 1_000 (both)
Cardinalities 10_000 (both)
Cardinalities 100_000 (both)
Cardinalities 250_000 (both)
Cardinalities 500_000 (both)
Cardinalities 1_000_000 (both)
Surprisingly there does not seem to be a threshold for the lexicographical sort, even with nearly unique values sorting the dictionary first seems beneficial. |
The integration test failure https://github.com/apache/arrow-rs/runs/4552743883?check_suite_focus=true looked to be related to some insfrastructure problem. I restarted the tests and hopefuly they will pass this time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really neat @jhorstmann -- thank you. I think it will be valuable for a lot of the community (including IOx).
I don't know if you feel the PR is ready for review, but I reviewed it anyways ;) I think my biggest suggestion is avoiding the change to SortOptions
which I think will actually make the feature easier to use (and test)
arrow/src/compute/kernels/sort.rs
Outdated
@@ -403,6 +439,10 @@ pub struct SortOptions { | |||
pub descending: bool, | |||
/// Whether to sort nulls first | |||
pub nulls_first: bool, | |||
/// Whether dictionary arrays can be sorted by their keys instead of values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already some hooks in the arrow codebase for is_ordered
-- like DictionaryArray::is_ordered
What do you think about using those hooks rather than a new assume_sorted_dictionaries
option on SortOptions
-- that would make it harder to pick the wrong option
Perhaps we could add a function like
impl DictionaryArray {
fn sorted(self) -> Self {
// check if dictionary is already sorted,
// otherwise sort it
Self {
is_ordered: true
self,
}
}
With an unsafe variant to skip the validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding a new field in SortOptions
would also likely reduce the size of this PR in terms of number of lines changed as well as keep the change API compatible.
arrow/benches/partition_kernels.rs
Outdated
{ | ||
let mut rng = seedable_rng(); | ||
|
||
let key_array: ArrayRef = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if we could add an option to StringDictionaryBuilder
to ensure the resulting dictionary was sorted
Thanks for the review @alamb, I will go through the individual comments later. I was still doing some profiling because I expected a bigger speedup. It seems the The reason for extending the Now that I'm thinking about, for that usecase we could also pretend that the dictionary is sorted by creating a new DictionaryArray, pointing to the same data, but with the Renaming the flag to |
👍 Makes sense Make sense. Maybe the sort options flag could be named something like "sort_dictionary_by_key_value" to make it clear that the request is to sort the data such that the same values that are contiguous but not necessarily sorted by value. |
then we could have a follow on PR that also sorts the dictionary by keys if the |
I changed to using the So this code would only become useful when we also add a way to efficiently transform a DictionaryArray to being sorted. |
…tioning-by-dictionary-keys
…tioning-by-dictionary-keys
@@ -151,8 +152,16 @@ fn partition_validity(array: &ArrayRef) -> (Vec<u32>, Vec<u32>) { | |||
// faster path | |||
0 => ((0..(array.len() as u32)).collect(), vec![]), | |||
_ => { | |||
let indices = 0..(array.len() as u32); | |||
indices.partition(|index| array.is_valid(*index as usize)) | |||
let validity = array.data().null_buffer().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed this as a hotspot while profiling. The is_valid
function does not seem to get inlined and contains a branch, and we can also initialize both vectors with the right capacities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct to me. This is how is_valid
implemented internally. Just not sure if it is good to expose the details here.
is_valid(i)
is basically:
let buffer = array.null_buffer().unwrap();
bit_util::get_bit_raw(buffer.as_ptr(), offset + i)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a #inline
or #[inline(always)]
annotation to the various locations?
There doesn't seem to be any such annotations yet
https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/array/data.rs?L420:12&subtree=true
https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/array/array.rs?L178:26&subtree=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW looping through the validity bitmap using get_bit_raw
is likely significantly slower than using UnalignedBitChunkIterator
as used by filter::IndexIterator
. Perhaps we should expose this as a method on bitmap 🤔 - I'll create a ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking quite cool @jhorstmann
I had a comment about the API and a comment about the tests, but otherwise this is looking quite sweet.
/// Returns a DictionaryArray referencing the same data | ||
/// with the [DictionaryArray::is_ordered] flag set to `true`. | ||
/// Note that this does not actually reorder the values in the dictionary. | ||
pub fn as_ordered(&self) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this API should be marked as unsafe
as it relies on the user "doing the right thing"?
Perhaps we could call it something like pub unsafe fn as_orderd_unchecked()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't think it would be possible to trigger undefined behavior with this method, it would only sort differently. But it certainly makes an assumption that the compiler can not verify.
Maybe the method could also have a better name, my intention was something like assume_can_be_sorted_by_keys
. Even if the dictionary is not actually sorted, setting the flag allows useful behavior, like sorting by keys and then using lexicographical_partition_ranges
with the same key-based comparator.
} else { | ||
// validate up front that we can do conversions from/to usize for the whole range of keys | ||
// this allows using faster unchecked conversions below | ||
K::Native::from_usize(values.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
ArrayData::new_unchecked( | ||
self.data_type().clone(), | ||
self.len(), | ||
Some(self.data.null_count()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense the nullness is the same before and after sorting. I assume it was faster to do this than to create the new values buffer directly from an iterator of Option<K::Native>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I benchmarked this separately, but it should be quite a bit faster. In the most common case of offset being zero this would be zero-copy, while creating a PrimitiveArray from an iterator has some overhead per element, checking whether the buffers need to grow and setting individual bits in the validity bitmap. If there is no validity, the iterator api would also construct one first.
I might try looking into the performance of the iterator based apis in a separate issue.
left.cmp(right) | ||
}) | ||
// only compare by keys if both arrays actually point to the same value buffers | ||
if left.is_ordered() && ArrayData::ptr_eq(left_values.data(), right_values.data()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we also need to check left.is_ordered()
if we know the arrays actually point to the same underlying values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but I think this is the only place where we check whether to sort by keys. Could probably make this even stricter and check that left
and right
are the same pointer, then both would be guaranteed to have the same is_ordered
value. The way it is now, in theory right
could be marked as is_ordered
but use the same dictionary values as another array which is not marked.
@@ -151,8 +152,16 @@ fn partition_validity(array: &ArrayRef) -> (Vec<u32>, Vec<u32>) { | |||
// faster path | |||
0 => ((0..(array.len() as u32)).collect(), vec![]), | |||
_ => { | |||
let indices = 0..(array.len() as u32); | |||
indices.partition(|index| array.is_valid(*index as usize)) | |||
let validity = array.data().null_buffer().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut array = data.into_iter().collect::<DictionaryArray<T>>(); | ||
|
||
if ordered { | ||
array = array.as_ordered(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to call array.make_ordered()
? I don't understand why the test would pass in a dictionary whose values appear to be unsorted and then mark them as_sorted
Sorry if I am missing something obvious
One open issue (that kind of existed already before) is that the |
For what it is worth I may have time to help get this PR over the line in the next few weeks (as sorting dictionary arrays is currently the tall pole in certain parts of my project) |
Marking this as draft as has bitrotted significantly -- @jhorstmann let us know what you want to do with this one |
Closing this as I believe similar functionality is now available with the row format. |
Which issue does this PR close?
Closes #980.
As opposed to the initial ticket description, we use the
is_ordered
flag ofDictionaryArray
and add a functionmake_ordered
that sorts the dictionary values and remaps the keys, so that afterwards the dictionary can be sorted by keys only. There is also a functionas_ordered
which just the flags (assuming that values are already ordered).Rationale for this change
Sorting by comparing dictionary keys is faster than comparing strings. The benefit should be bigger for smaller numbers of distinct strings in the dictionary.
What changes are included in this PR?
as_ordered
andmake_ordered
onDictionaryArray
(still open for suggestions for better names)is_ordered
flag.Are there any user-facing changes?
There should be no breaking API changes in this PR. The decision whether to convert dictionary arrays before sorting should happen on a higher level, perhaps based on the cardinality of the dictionary.