Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized sorting for ordered dictionaries #1048

Conversation

jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Dec 16, 2021

Which issue does this PR close?

Closes #980.

As opposed to the initial ticket description, we use the is_ordered flag of DictionaryArray and add a function make_ordered that sorts the dictionary values and remaps the keys, so that afterwards the dictionary can be sorted by keys only. There is also a function as_ordered which just the flags (assuming that values are already ordered).

Rationale for this change

Sorting by comparing dictionary keys is faster than comparing strings. The benefit should be bigger for smaller numbers of distinct strings in the dictionary.

What changes are included in this PR?

  • Functions as_ordered and make_ordered on DictionaryArray (still open for suggestions for better names)
  • The sort kernel and lexicographical comparator make use of the is_ordered flag.

Are there any user-facing changes?

There should be no breaking API changes in this PR. The decision whether to convert dictionary arrays before sorting should happen on a higher level, perhaps based on the cardinality of the dictionary.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 16, 2021
@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2021

Codecov Report

Merging #1048 (07389a2) into master (6e6a9e1) will increase coverage by 0.02%.
The diff coverage is 90.79%.

@@            Coverage Diff             @@
##           master    #1048      +/-   ##
==========================================
+ Coverage   83.48%   83.50%   +0.02%     
==========================================
  Files         196      196              
  Lines       55923    56047     +124     
==========================================
+ Hits        46686    46802     +116     
- Misses       9237     9245       +8     
Impacted Files Coverage Δ
arrow/src/array/array_dictionary.rs 91.78% <87.71%> (-0.14%) ⬇️
arrow/src/compute/kernels/sort.rs 95.83% <90.80%> (+0.15%) ⬆️
arrow/src/array/ord.rs 71.69% <100.00%> (+2.52%) ⬆️
arrow/src/datatypes/datatype.rs 65.42% <0.00%> (-0.38%) ⬇️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.23%) ⬇️
arrow/src/array/transform/mod.rs 86.74% <0.00%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e6a9e1...07389a2. Read the comment docs.

@jhorstmann
Copy link
Contributor Author

jhorstmann commented Dec 16, 2021

Updated benchmark results for single-array and lexicographical sorting

Single array, length 1_000_000, 50% nulls

Cardinality 1_000

dict string sort nulls
time:   [34.597 ms 34.639 ms 34.686 ms]

make_ordered dict string sort nulls
time:   [29.566 ms 29.596 ms 29.628 ms]

presorted dict string sort nulls
time:   [12.669 ms 12.686 ms 12.704 ms]

Cardinality 10_000

dict string sort nulls
time:   [44.222 ms 44.267 ms 44.314 ms]

make_ordered dict string sort nulls
time:   [33.521 ms 33.624 ms 33.732 ms]

presorted dict string sort nulls
time:   [15.844 ms 16.181 ms 16.574 ms]

Cardinality 100_000

dict string sort nulls
time:   [62.620 ms 62.763 ms 62.919 ms]

make_ordered dict string sort nulls
time:   [51.165 ms 51.245 ms 51.329 ms]

presorted dict string sort nulls
time:   [21.753 ms 21.780 ms 21.809 ms]

Cardinality 250_000

dict string sort nulls
time:   [66.361 ms 66.485 ms 66.620 ms]

make_ordered dict string sort nulls
time:   [67.663 ms 67.719 ms 67.780 ms]

presorted dict string sort nulls
time:   [22.449 ms 22.473 ms 22.499 ms]

Cardinality 500_000

dict string sort nulls
time:   [68.842 ms 68.939 ms 69.044 ms]

make_ordered dict string sort nulls
time:   [84.778 ms 85.016 ms 85.308 ms]

presorted dict string sort nulls
time:   [22.677 ms 22.740 ms 22.811 ms]

The threshold where converting to sorted dictionaries is no longer beneficial seems to be around a cardinality that is 25% of the array length.

Two arrays, length 1_000_000, 50% nulls

Cardinalities 1_000 (both)

dict string sort nulls
time:   [164.84 ms 167.72 ms 170.79 ms]

make_ordered dict string sort nulls
time:   [136.31 ms 139.17 ms 142.40 ms]

presorted dict string sort nulls
time:   [115.07 ms 117.37 ms 120.10 ms]

Cardinalities 10_000 (both)

dict string sort nulls
time:   [198.13 ms 200.17 ms 202.24 ms]

make_ordered dict string sort nulls
time:   [142.33 ms 143.21 ms 144.18 ms]

presorted dict string sort nulls
time:   [122.91 ms 123.66 ms 124.50 ms]

Cardinalities 100_000 (both)

dict string sort nulls
time:   [277.07 ms 278.39 ms 279.87 ms]

make_ordered dict string sort nulls
time:   [208.17 ms 208.85 ms 209.56 ms]

presorted dict string sort nulls
time:   [162.91 ms 164.82 ms 166.81 ms]

Cardinalities 250_000 (both)

dict string sort nulls
time:   [290.72 ms 291.26 ms 291.85 ms]

make_ordered dict string sort nulls
time:   [217.11 ms 217.44 ms 217.78 ms]

presorted dict string sort nulls
time:   [138.00 ms 138.36 ms 138.75 ms]

Cardinalities 500_000 (both)

dict string sort nulls
time:   [336.32 ms 341.53 ms 347.01 ms]

make_ordered dict string sort nulls
time:   [277.58 ms 279.05 ms 280.55 ms]

presorted dict string sort nulls
time:   [150.96 ms 151.97 ms 153.07 ms]

Cardinalities 1_000_000 (both)

dict string sort nulls
time:   [314.28 ms 316.31 ms 318.52 ms]

make_ordered dict string sort nulls
time:   [295.75 ms 296.77 ms 297.79 ms]

presorted dict string sort nulls
time:   [146.40 ms 146.76 ms 147.14 ms]

Surprisingly there does not seem to be a threshold for the lexicographical sort, even with nearly unique values sorting the dictionary first seems beneficial.

@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

The integration test failure https://github.com/apache/arrow-rs/runs/4552743883?check_suite_focus=true looked to be related to some insfrastructure problem. I restarted the tests and hopefuly they will pass this time.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat @jhorstmann -- thank you. I think it will be valuable for a lot of the community (including IOx).

I don't know if you feel the PR is ready for review, but I reviewed it anyways ;) I think my biggest suggestion is avoiding the change to SortOptions which I think will actually make the feature easier to use (and test)

@@ -403,6 +439,10 @@ pub struct SortOptions {
pub descending: bool,
/// Whether to sort nulls first
pub nulls_first: bool,
/// Whether dictionary arrays can be sorted by their keys instead of values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already some hooks in the arrow codebase for is_ordered -- like DictionaryArray::is_ordered

What do you think about using those hooks rather than a new assume_sorted_dictionaries option on SortOptions -- that would make it harder to pick the wrong option

https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+is_ordered&patternType=literal

Perhaps we could add a function like

impl DictionaryArray {

fn sorted(self) -> Self { 
  // check if dictionary is already sorted, 
  // otherwise sort it
  Self { 
    is_ordered: true
    self,
  }
}

With an unsafe variant to skip the validation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding a new field in SortOptions would also likely reduce the size of this PR in terms of number of lines changed as well as keep the change API compatible.

arrow/src/compute/kernels/sort.rs Outdated Show resolved Hide resolved
arrow/src/array/ord.rs Outdated Show resolved Hide resolved
arrow/src/array/ord.rs Show resolved Hide resolved
arrow/src/array/ord.rs Outdated Show resolved Hide resolved
arrow/src/array/ord.rs Outdated Show resolved Hide resolved
{
let mut rng = seedable_rng();

let key_array: ArrayRef =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool if we could add an option to StringDictionaryBuilder to ensure the resulting dictionary was sorted

arrow/src/compute/kernels/sort.rs Show resolved Hide resolved
@jhorstmann
Copy link
Contributor Author

jhorstmann commented Dec 20, 2021

Thanks for the review @alamb, I will go through the individual comments later. I was still doing some profiling because I expected a bigger speedup. It seems the DynComparator and LexicographicalComparator take longer than the difference between comparing strings and integers. I was also experimenting with trying to inline is_valid calls on another branch to see if that is the bottleneck.

The reason for extending the SortOptions is that this functionality helps for two separate usecases. The name of the flag is a bit misleading and another reason why the PR is still in draft. First usecase is of course for faster sorting if the dictionaries are already sorted, for that the is_ordered flag on DictionaryArray would work fine. The other usecase is for partitioning, as used for example for window functions. In an expression like SUM(a) OVER (PARTITION BY b ORDER BY c) Datafusion will sort by b, c, with this change we could track that b is only used for partitioning in the logical plan and then set the new flag in SortOptions, regardless of whether the dictionary is actually sorted.

Now that I'm thinking about, for that usecase we could also pretend that the dictionary is sorted by creating a new DictionaryArray, pointing to the same data, but with the is_ordered flag set, and sort by that array. And that is probably what you meant by the first comment.

Renaming the flag to sort_for_partitioning or sort_by_dictionary_keys could be an option to make the purpose clearer. And the is_ordered flag should also be taken into account, regardless of the SortOption flag.

@alamb
Copy link
Contributor

alamb commented Dec 21, 2021

The other usecase is for partitioning, as used for example for window functions. In an expression like SUM(a) OVER (PARTITION BY b ORDER BY c) Datafusion will sort by b, c, with this change we could track that b is only used for partitioning in the logical plan and then set the new flag in SortOptions, regardless of whether the dictionary is actually sorted.

👍 Makes sense

Make sense. Maybe the sort options flag could be named something like "sort_dictionary_by_key_value" to make it clear that the request is to sort the data such that the same values that are contiguous but not necessarily sorted by value.

@alamb
Copy link
Contributor

alamb commented Dec 21, 2021

then we could have a follow on PR that also sorts the dictionary by keys if the is_ordered flag is set.

@jhorstmann
Copy link
Contributor Author

I changed to using the is_ordered flag as initially proposed and a fn as_ordered(&self, is_ordered: bool) -> Self that would return a DictionaryArray with the flag set (without actually sorting or checking the ordering). Now testing this in our engine I realized that instead of as_ordered for partitioning I could also just get the keys array directly and partition on that, as_ordered doesn't really simplify that part of the code.

So this code would only become useful when we also add a way to efficiently transform a DictionaryArray to being sorted.

@jhorstmann jhorstmann changed the title Implement option to sort by dictionary keys in sort and partition kernels Optimized sorting for ordered dictionaries May 29, 2022
@@ -151,8 +152,16 @@ fn partition_validity(array: &ArrayRef) -> (Vec<u32>, Vec<u32>) {
// faster path
0 => ((0..(array.len() as u32)).collect(), vec![]),
_ => {
let indices = 0..(array.len() as u32);
indices.partition(|index| array.is_valid(*index as usize))
let validity = array.data().null_buffer().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed this as a hotspot while profiling. The is_valid function does not seem to get inlined and contains a branch, and we can also initialize both vectors with the right capacities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold / @viirya can you help verify this change?

Copy link
Member

@viirya viirya Jun 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me. This is how is_valid implemented internally. Just not sure if it is good to expose the details here.

is_valid(i) is basically:

let buffer = array.null_buffer().unwrap();
bit_util::get_bit_raw(buffer.as_ptr(), offset + i)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a #inline or #[inline(always)] annotation to the various locations?

There doesn't seem to be any such annotations yet
https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/array/data.rs?L420:12&subtree=true
https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/array/array.rs?L178:26&subtree=true

Copy link
Contributor

@tustvold tustvold Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW looping through the validity bitmap using get_bit_raw is likely significantly slower than using UnalignedBitChunkIterator as used by filter::IndexIterator. Perhaps we should expose this as a method on bitmap 🤔 - I'll create a ticket

#1864

@jhorstmann jhorstmann marked this pull request as ready for review May 29, 2022 18:34
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking quite cool @jhorstmann

I had a comment about the API and a comment about the tests, but otherwise this is looking quite sweet.

cc @tustvold @viirya do you have

/// Returns a DictionaryArray referencing the same data
/// with the [DictionaryArray::is_ordered] flag set to `true`.
/// Note that this does not actually reorder the values in the dictionary.
pub fn as_ordered(&self) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this API should be marked as unsafe as it relies on the user "doing the right thing"?

Perhaps we could call it something like pub unsafe fn as_orderd_unchecked() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't think it would be possible to trigger undefined behavior with this method, it would only sort differently. But it certainly makes an assumption that the compiler can not verify.

Maybe the method could also have a better name, my intention was something like assume_can_be_sorted_by_keys. Even if the dictionary is not actually sorted, setting the flag allows useful behavior, like sorting by keys and then using lexicographical_partition_ranges with the same key-based comparator.

} else {
// validate up front that we can do conversions from/to usize for the whole range of keys
// this allows using faster unchecked conversions below
K::Native::from_usize(values.len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

ArrayData::new_unchecked(
self.data_type().clone(),
self.len(),
Some(self.data.null_count()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense the nullness is the same before and after sorting. I assume it was faster to do this than to create the new values buffer directly from an iterator of Option<K::Native>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I benchmarked this separately, but it should be quite a bit faster. In the most common case of offset being zero this would be zero-copy, while creating a PrimitiveArray from an iterator has some overhead per element, checking whether the buffers need to grow and setting individual bits in the validity bitmap. If there is no validity, the iterator api would also construct one first.

I might try looking into the performance of the iterator based apis in a separate issue.

left.cmp(right)
})
// only compare by keys if both arrays actually point to the same value buffers
if left.is_ordered() && ArrayData::ptr_eq(left_values.data(), right_values.data()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we also need to check left.is_ordered() if we know the arrays actually point to the same underlying values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but I think this is the only place where we check whether to sort by keys. Could probably make this even stricter and check that left and right are the same pointer, then both would be guaranteed to have the same is_ordered value. The way it is now, in theory right could be marked as is_ordered but use the same dictionary values as another array which is not marked.

@@ -151,8 +152,16 @@ fn partition_validity(array: &ArrayRef) -> (Vec<u32>, Vec<u32>) {
// faster path
0 => ((0..(array.len() as u32)).collect(), vec![]),
_ => {
let indices = 0..(array.len() as u32);
indices.partition(|index| array.is_valid(*index as usize))
let validity = array.data().null_buffer().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold / @viirya can you help verify this change?

let mut array = data.into_iter().collect::<DictionaryArray<T>>();

if ordered {
array = array.as_ordered();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to call array.make_ordered()? I don't understand why the test would pass in a dictionary whose values appear to be unsorted and then mark them as_sorted

Sorry if I am missing something obvious

@jhorstmann
Copy link
Contributor Author

One open issue (that kind of existed already before) is that the is_ordered flag only exists on the the Array and does not round-trip through ArrayData. Usually all information needed to reconstruct an array is contained in ArrayData.

@alamb
Copy link
Contributor

alamb commented Jul 8, 2022

For what it is worth I may have time to help get this PR over the line in the next few weeks (as sorting dictionary arrays is currently the tall pole in certain parts of my project)

@alamb
Copy link
Contributor

alamb commented Nov 1, 2022

Marking this as draft as has bitrotted significantly -- @jhorstmann let us know what you want to do with this one

@alamb alamb marked this pull request as draft November 1, 2022 15:44
@jhorstmann
Copy link
Contributor Author

Closing this as I believe similar functionality is now available with the row format.

@jhorstmann jhorstmann closed this Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to sort by dictionary keys in sort kernels
5 participants