Optimized sorting for ordered dictionaries #1048

jhorstmann · 2021-12-16T19:03:54Z

Which issue does this PR close?

Closes #980.

As opposed to the initial ticket description, we use the is_ordered flag of DictionaryArray and add a function make_ordered that sorts the dictionary values and remaps the keys, so that afterwards the dictionary can be sorted by keys only. There is also a function as_ordered which just the flags (assuming that values are already ordered).

Rationale for this change

Sorting by comparing dictionary keys is faster than comparing strings. The benefit should be bigger for smaller numbers of distinct strings in the dictionary.

What changes are included in this PR?

Functions as_ordered and make_ordered on DictionaryArray (still open for suggestions for better names)
The sort kernel and lexicographical comparator make use of the is_ordered flag.

Are there any user-facing changes?

There should be no breaking API changes in this PR. The decision whether to convert dictionary arrays before sorting should happen on a higher level, perhaps based on the cardinality of the dictionary.

…tioning

codecov-commenter · 2021-12-16T19:16:42Z

Codecov Report

Merging #1048 (07389a2) into master (6e6a9e1) will increase coverage by 0.02%.
The diff coverage is 90.79%.

@@            Coverage Diff             @@
##           master    #1048      +/-   ##
==========================================
+ Coverage   83.48%   83.50%   +0.02%     
==========================================
  Files         196      196              
  Lines       55923    56047     +124     
==========================================
+ Hits        46686    46802     +116     
- Misses       9237     9245       +8

Impacted Files	Coverage Δ
arrow/src/array/array_dictionary.rs	`91.78% <87.71%> (-0.14%)`	⬇️
arrow/src/compute/kernels/sort.rs	`95.83% <90.80%> (+0.15%)`	⬆️
arrow/src/array/ord.rs	`71.69% <100.00%> (+2.52%)`	⬆️
arrow/src/datatypes/datatype.rs	`65.42% <0.00%> (-0.38%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.75% <0.00%> (-0.23%)`	⬇️
arrow/src/array/transform/mod.rs	`86.74% <0.00%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e6a9e1...07389a2. Read the comment docs.

jhorstmann · 2021-12-16T20:16:51Z

Updated benchmark results for single-array and lexicographical sorting

Single array, length 1_000_000, 50% nulls

Cardinality 1_000

dict string sort nulls
time:   [34.597 ms 34.639 ms 34.686 ms]

make_ordered dict string sort nulls
time:   [29.566 ms 29.596 ms 29.628 ms]

presorted dict string sort nulls
time:   [12.669 ms 12.686 ms 12.704 ms]

Cardinality 10_000

dict string sort nulls
time:   [44.222 ms 44.267 ms 44.314 ms]

make_ordered dict string sort nulls
time:   [33.521 ms 33.624 ms 33.732 ms]

presorted dict string sort nulls
time:   [15.844 ms 16.181 ms 16.574 ms]

Cardinality 100_000

dict string sort nulls
time:   [62.620 ms 62.763 ms 62.919 ms]

make_ordered dict string sort nulls
time:   [51.165 ms 51.245 ms 51.329 ms]

presorted dict string sort nulls
time:   [21.753 ms 21.780 ms 21.809 ms]

Cardinality 250_000

dict string sort nulls
time:   [66.361 ms 66.485 ms 66.620 ms]

make_ordered dict string sort nulls
time:   [67.663 ms 67.719 ms 67.780 ms]

presorted dict string sort nulls
time:   [22.449 ms 22.473 ms 22.499 ms]

Cardinality 500_000

dict string sort nulls
time:   [68.842 ms 68.939 ms 69.044 ms]

make_ordered dict string sort nulls
time:   [84.778 ms 85.016 ms 85.308 ms]

presorted dict string sort nulls
time:   [22.677 ms 22.740 ms 22.811 ms]

The threshold where converting to sorted dictionaries is no longer beneficial seems to be around a cardinality that is 25% of the array length.

Two arrays, length 1_000_000, 50% nulls

Cardinalities 1_000 (both)

dict string sort nulls
time:   [164.84 ms 167.72 ms 170.79 ms]

make_ordered dict string sort nulls
time:   [136.31 ms 139.17 ms 142.40 ms]

presorted dict string sort nulls
time:   [115.07 ms 117.37 ms 120.10 ms]

Cardinalities 10_000 (both)

dict string sort nulls
time:   [198.13 ms 200.17 ms 202.24 ms]

make_ordered dict string sort nulls
time:   [142.33 ms 143.21 ms 144.18 ms]

presorted dict string sort nulls
time:   [122.91 ms 123.66 ms 124.50 ms]

Cardinalities 100_000 (both)

dict string sort nulls
time:   [277.07 ms 278.39 ms 279.87 ms]

make_ordered dict string sort nulls
time:   [208.17 ms 208.85 ms 209.56 ms]

presorted dict string sort nulls
time:   [162.91 ms 164.82 ms 166.81 ms]

Cardinalities 250_000 (both)

dict string sort nulls
time:   [290.72 ms 291.26 ms 291.85 ms]

make_ordered dict string sort nulls
time:   [217.11 ms 217.44 ms 217.78 ms]

presorted dict string sort nulls
time:   [138.00 ms 138.36 ms 138.75 ms]

Cardinalities 500_000 (both)

dict string sort nulls
time:   [336.32 ms 341.53 ms 347.01 ms]

make_ordered dict string sort nulls
time:   [277.58 ms 279.05 ms 280.55 ms]

presorted dict string sort nulls
time:   [150.96 ms 151.97 ms 153.07 ms]

Cardinalities 1_000_000 (both)

dict string sort nulls
time:   [314.28 ms 316.31 ms 318.52 ms]

make_ordered dict string sort nulls
time:   [295.75 ms 296.77 ms 297.79 ms]

presorted dict string sort nulls
time:   [146.40 ms 146.76 ms 147.14 ms]

Surprisingly there does not seem to be a threshold for the lexicographical sort, even with nearly unique values sorting the dictionary first seems beneficial.

alamb · 2021-12-20T17:58:12Z

The integration test failure https://github.com/apache/arrow-rs/runs/4552743883?check_suite_focus=true looked to be related to some insfrastructure problem. I restarted the tests and hopefuly they will pass this time.

alamb

This is really neat @jhorstmann -- thank you. I think it will be valuable for a lot of the community (including IOx).

I don't know if you feel the PR is ready for review, but I reviewed it anyways ;) I think my biggest suggestion is avoiding the change to SortOptions which I think will actually make the feature easier to use (and test)

alamb · 2021-12-20T19:41:02Z

arrow/src/compute/kernels/sort.rs

@@ -403,6 +439,10 @@ pub struct SortOptions {
    pub descending: bool,
    /// Whether to sort nulls first
    pub nulls_first: bool,
+    /// Whether dictionary arrays can be sorted by their keys instead of values.


There are already some hooks in the arrow codebase for is_ordered -- like DictionaryArray::is_ordered

What do you think about using those hooks rather than a new assume_sorted_dictionaries option on SortOptions -- that would make it harder to pick the wrong option

https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+is_ordered&patternType=literal

Perhaps we could add a function like

impl DictionaryArray { fn sorted(self) -> Self { // check if dictionary is already sorted, // otherwise sort it Self { is_ordered: true self, } }

With an unsafe variant to skip the validation

Avoiding a new field in SortOptions would also likely reduce the size of this PR in terms of number of lines changed as well as keep the change API compatible.

arrow/src/compute/kernels/sort.rs

arrow/src/array/ord.rs

alamb · 2021-12-20T20:02:34Z

arrow/benches/partition_kernels.rs

+{
+    let mut rng = seedable_rng();
+
+    let key_array: ArrayRef =


It would be cool if we could add an option to StringDictionaryBuilder to ensure the resulting dictionary was sorted

arrow/src/compute/kernels/sort.rs

jhorstmann · 2021-12-20T22:14:37Z

Thanks for the review @alamb, I will go through the individual comments later. I was still doing some profiling because I expected a bigger speedup. It seems the DynComparator and LexicographicalComparator take longer than the difference between comparing strings and integers. I was also experimenting with trying to inline is_valid calls on another branch to see if that is the bottleneck.

The reason for extending the SortOptions is that this functionality helps for two separate usecases. The name of the flag is a bit misleading and another reason why the PR is still in draft. First usecase is of course for faster sorting if the dictionaries are already sorted, for that the is_ordered flag on DictionaryArray would work fine. The other usecase is for partitioning, as used for example for window functions. In an expression like SUM(a) OVER (PARTITION BY b ORDER BY c) Datafusion will sort by b, c, with this change we could track that b is only used for partitioning in the logical plan and then set the new flag in SortOptions, regardless of whether the dictionary is actually sorted.

Now that I'm thinking about, for that usecase we could also pretend that the dictionary is sorted by creating a new DictionaryArray, pointing to the same data, but with the is_ordered flag set, and sort by that array. And that is probably what you meant by the first comment.

Renaming the flag to sort_for_partitioning or sort_by_dictionary_keys could be an option to make the purpose clearer. And the is_ordered flag should also be taken into account, regardless of the SortOption flag.

alamb · 2021-12-21T14:55:15Z

The other usecase is for partitioning, as used for example for window functions. In an expression like SUM(a) OVER (PARTITION BY b ORDER BY c) Datafusion will sort by b, c, with this change we could track that b is only used for partitioning in the logical plan and then set the new flag in SortOptions, regardless of whether the dictionary is actually sorted.

👍 Makes sense

Make sense. Maybe the sort options flag could be named something like "sort_dictionary_by_key_value" to make it clear that the request is to sort the data such that the same values that are contiguous but not necessarily sorted by value.

alamb · 2021-12-21T14:55:47Z

then we could have a follow on PR that also sorts the dictionary by keys if the is_ordered flag is set.

… on SortOptions

jhorstmann · 2022-01-12T12:45:09Z

I changed to using the is_ordered flag as initially proposed and a fn as_ordered(&self, is_ordered: bool) -> Self that would return a DictionaryArray with the flag set (without actually sorting or checking the ordering). Now testing this in our engine I realized that instead of as_ordered for partitioning I could also just get the keys array directly and partition on that, as_ordered doesn't really simplify that part of the code.

So this code would only become useful when we also add a way to efficiently transform a DictionaryArray to being sorted.

…tioning-by-dictionary-keys

…bounds checks

…ow an edge case

…tioning-by-dictionary-keys

jhorstmann · 2022-05-29T17:06:45Z

arrow/src/compute/kernels/sort.rs

@@ -151,8 +152,16 @@ fn partition_validity(array: &ArrayRef) -> (Vec<u32>, Vec<u32>) {
        // faster path
        0 => ((0..(array.len() as u32)).collect(), vec![]),
        _ => {
-            let indices = 0..(array.len() as u32);
-            indices.partition(|index| array.is_valid(*index as usize))
+            let validity = array.data().null_buffer().unwrap();


Noticed this as a hotspot while profiling. The is_valid function does not seem to get inlined and contains a branch, and we can also initialize both vectors with the right capacities.

@tustvold / @viirya can you help verify this change?

Looks correct to me. This is how is_valid implemented internally. Just not sure if it is good to expose the details here.

is_valid(i) is basically:

let buffer = array.null_buffer().unwrap(); bit_util::get_bit_raw(buffer.as_ptr(), offset + i)

Maybe we can add a #inline or #[inline(always)] annotation to the various locations?

There doesn't seem to be any such annotations yet
https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/array/data.rs?L420:12&subtree=true
https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/array/array.rs?L178:26&subtree=true

FWIW looping through the validity bitmap using get_bit_raw is likely significantly slower than using UnalignedBitChunkIterator as used by filter::IndexIterator. Perhaps we should expose this as a method on bitmap 🤔 - I'll create a ticket

#1864

alamb

I think this is looking quite cool @jhorstmann

I had a comment about the API and a comment about the tests, but otherwise this is looking quite sweet.

cc @tustvold @viirya do you have

alamb · 2022-06-03T14:42:23Z

arrow/src/array/array_dictionary.rs

+    /// Returns a DictionaryArray referencing the same data
+    /// with the [DictionaryArray::is_ordered] flag set to `true`.
+    /// Note that this does not actually reorder the values in the dictionary.
+    pub fn as_ordered(&self) -> Self {


I wonder if this API should be marked as unsafe as it relies on the user "doing the right thing"?

Perhaps we could call it something like pub unsafe fn as_orderd_unchecked() ?

Good point. I don't think it would be possible to trigger undefined behavior with this method, it would only sort differently. But it certainly makes an assumption that the compiler can not verify.

Maybe the method could also have a better name, my intention was something like assume_can_be_sorted_by_keys. Even if the dictionary is not actually sorted, setting the flag allows useful behavior, like sorting by keys and then using lexicographical_partition_ranges with the same key-based comparator.

alamb · 2022-06-03T14:42:38Z

arrow/src/array/array_dictionary.rs

+        } else {
+            // validate up front that we can do conversions from/to usize for the whole range of keys
+            // this allows using faster unchecked conversions below
+            K::Native::from_usize(values.len())


alamb · 2022-06-03T14:47:21Z

arrow/src/array/array_dictionary.rs

+                ArrayData::new_unchecked(
+                    self.data_type().clone(),
+                    self.len(),
+                    Some(self.data.null_count()),


It makes sense the nullness is the same before and after sorting. I assume it was faster to do this than to create the new values buffer directly from an iterator of Option<K::Native>?

I don't think I benchmarked this separately, but it should be quite a bit faster. In the most common case of offset being zero this would be zero-copy, while creating a PrimitiveArray from an iterator has some overhead per element, checking whether the buffers need to grow and setting individual bits in the validity bitmap. If there is no validity, the iterator api would also construct one first.

I might try looking into the performance of the iterator based apis in a separate issue.

alamb · 2022-06-03T14:49:14Z

arrow/src/array/ord.rs

-        left.cmp(right)
-    })
+    // only compare by keys if both arrays actually point to the same value buffers
+    if left.is_ordered() && ArrayData::ptr_eq(left_values.data(), right_values.data()) {


I don't understand why we also need to check left.is_ordered() if we know the arrays actually point to the same underlying values.

I might be missing something, but I think this is the only place where we check whether to sort by keys. Could probably make this even stricter and check that left and right are the same pointer, then both would be guaranteed to have the same is_ordered value. The way it is now, in theory right could be marked as is_ordered but use the same dictionary values as another array which is not marked.

alamb · 2022-06-03T14:52:11Z

arrow/src/compute/kernels/sort.rs

@@ -151,8 +152,16 @@ fn partition_validity(array: &ArrayRef) -> (Vec<u32>, Vec<u32>) {
        // faster path
        0 => ((0..(array.len() as u32)).collect(), vec![]),
        _ => {
-            let indices = 0..(array.len() as u32);
-            indices.partition(|index| array.is_valid(*index as usize))
+            let validity = array.data().null_buffer().unwrap();


@tustvold / @viirya can you help verify this change?

alamb · 2022-06-03T14:56:07Z

arrow/src/compute/kernels/sort.rs

+        let mut array = data.into_iter().collect::<DictionaryArray<T>>();
+
+        if ordered {
+            array = array.as_ordered();


Is this supposed to call array.make_ordered()? I don't understand why the test would pass in a dictionary whose values appear to be unsorted and then mark them as_sorted

Sorry if I am missing something obvious

jhorstmann · 2022-06-04T09:41:46Z

One open issue (that kind of existed already before) is that the is_ordered flag only exists on the the Array and does not round-trip through ArrayData. Usually all information needed to reconstruct an array is contained in ArrayData.

alamb · 2022-07-08T16:01:42Z

For what it is worth I may have time to help get this PR over the line in the next few weeks (as sorting dictionary arrays is currently the tall pole in certain parts of my project)

alamb · 2022-11-01T15:44:40Z

Marking this as draft as has bitrotted significantly -- @jhorstmann let us know what you want to do with this one

jhorstmann · 2022-11-14T22:19:50Z

Closing this as I believe similar functionality is now available with the row format.

jhorstmann added 2 commits December 16, 2021 19:40

Optionally sort dictionary arrays by key instead of value

63f1936

Handle comparing by dictionary keys in lexicographical sort and parti…

38dfd18

…tioning

github-actions bot added the arrow Changes to the arrow crate label Dec 16, 2021

jhorstmann added 2 commits December 16, 2021 21:06

Benchmarks for partition kernels with dictionary arrays

60eca77

Benchmarks with different cardinalities

cf2aa11

alamb reviewed Dec 20, 2021

View reviewed changes

jhorstmann added 3 commits January 7, 2022 19:48

Use the is_ordered flag on DictionaryArray instead of a separate flag…

a2ca03e

… on SortOptions

Comments on one test case

5c31db2

Rename method for setting the is_ordered flag

d84fbb3

jhorstmann added 14 commits January 14, 2022 18:32

Function to convert DictionaryArray to one with ordered values

25fcee6

Adjust error checking

d5a0575

Merge remote-tracking branch 'upstream/master' into sorting-and-parti…

789f6d1

…tioning-by-dictionary-keys

Actually mark the dictionary as ordered after sorting and avoid some …

6ad616e

…bounds checks

Reuse ArrayData::ptr_eq

743ea19

Add benchmarks

fc7189e

Remove leftover comment

6ccf7f1

Correct data for as_ordered keys

199aaeb

Remove parameter from as_ordered method

791e863

Revert enabling force_validate

4482b1b

Need a loopup vector for mapping to new keys, previous test was someh…

172752a

…ow an edge case

Optimize partition_validity

a013778

Optimize DictionaryArray::make_ordered

17c20e7

Merge remote-tracking branch 'upstream/master' into sorting-and-parti…

a540f08

…tioning-by-dictionary-keys

jhorstmann changed the title ~~Implement option to sort by dictionary keys in sort and partition kernels~~ Optimized sorting for ordered dictionaries May 29, 2022

jhorstmann commented May 29, 2022

View reviewed changes

jhorstmann added 2 commits May 29, 2022 19:17

Fix partition_kernels benchmark

10bc713

Formatting and safety comment

07389a2

jhorstmann marked this pull request as ready for review May 29, 2022 18:34

alamb reviewed Jun 3, 2022

View reviewed changes

jhorstmann mentioned this pull request Jun 13, 2022

Inline is_valid calls in PrimitiveIter/BooleanIter #1857

Open

alamb marked this pull request as draft November 1, 2022 15:44

jhorstmann closed this Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized sorting for ordered dictionaries #1048

Optimized sorting for ordered dictionaries #1048

jhorstmann commented Dec 16, 2021 •

edited

Loading

codecov-commenter commented Dec 16, 2021 •

edited

Loading

jhorstmann commented Dec 16, 2021 •

edited

Loading

alamb commented Dec 20, 2021

alamb left a comment

alamb Dec 20, 2021

alamb Dec 20, 2021

alamb Dec 20, 2021

jhorstmann commented Dec 20, 2021 •

edited

Loading

alamb commented Dec 21, 2021

alamb commented Dec 21, 2021

jhorstmann commented Jan 12, 2022

jhorstmann May 29, 2022

alamb Jun 3, 2022

viirya Jun 4, 2022 •

edited

Loading

alamb Jun 4, 2022

tustvold Jun 13, 2022 •

edited

Loading

alamb left a comment

alamb Jun 3, 2022

jhorstmann Jun 4, 2022

alamb Jun 3, 2022

alamb Jun 3, 2022

jhorstmann Jun 4, 2022

alamb Jun 3, 2022

jhorstmann Jun 4, 2022

alamb Jun 3, 2022

alamb Jun 3, 2022

jhorstmann commented Jun 4, 2022

alamb commented Jul 8, 2022

alamb commented Nov 1, 2022

jhorstmann commented Nov 14, 2022

Optimized sorting for ordered dictionaries #1048

Optimized sorting for ordered dictionaries #1048

Conversation

jhorstmann commented Dec 16, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Dec 16, 2021 • edited Loading

Codecov Report

jhorstmann commented Dec 16, 2021 • edited Loading

Single array, length 1_000_000, 50% nulls

Cardinality 1_000

Cardinality 10_000

Cardinality 100_000

Cardinality 250_000

Cardinality 500_000

Two arrays, length 1_000_000, 50% nulls

Cardinalities 1_000 (both)

Cardinalities 10_000 (both)

Cardinalities 100_000 (both)

Cardinalities 250_000 (both)

Cardinalities 500_000 (both)

Cardinalities 1_000_000 (both)

alamb commented Dec 20, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhorstmann commented Dec 20, 2021 • edited Loading

alamb commented Dec 21, 2021

alamb commented Dec 21, 2021

jhorstmann commented Jan 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jun 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhorstmann commented Jun 4, 2022

alamb commented Jul 8, 2022

alamb commented Nov 1, 2022

jhorstmann commented Nov 14, 2022

jhorstmann commented Dec 16, 2021 •

edited

Loading

codecov-commenter commented Dec 16, 2021 •

edited

Loading

jhorstmann commented Dec 16, 2021 •

edited

Loading

jhorstmann commented Dec 20, 2021 •

edited

Loading

viirya Jun 4, 2022 •

edited

Loading

tustvold Jun 13, 2022 •

edited

Loading