Use partition for bool sort #448

jimexist · 2021-06-11T12:46:59Z

Which issue does this PR close?

Closes #447

Rationale for this change

when no limit is present, the best way to sort boolean array is to partition in place, the second best way is to partition.

we are currently using stable sort which will allocate O(n/2) additional memory so this new change might be faster because even if partition_in_place is not used (due to nightly features), the memory allocated is also linear, resulting similar memory overhead, and the sorting happens quicker;

however if the original sort had been unstable (quicksort, in place), this change is not so obvious to be faster, because for quick sort the first pivot sweep would most definitely have the booleans separated.

What changes are included in this PR?

Are there any user-facing changes?

alamb · 2021-06-13T11:35:59Z

I will try and review this tomorrow

alamb

It does indeed appear that sort_by allocates auxiliary memory according to the documentation: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.sort_by

I tested the performance of this change using the benchmark in #457

I ran the following tests:

(arrow_dev) alamb@ip-10-0-0-124:~/Software/arrow-rs$ cargo bench -p arrow --bench sort_kernel -- 'bool sort' --save-baseline master

(arrow_dev) alamb@ip-10-0-0-124:~/Software/arrow-rs$ cargo bench -p arrow --bench sort_kernel -- 'bool sort' --save-baseline bool-partition

Critcmp implies there isn't much difference with this implementation vs what is on master

critcmp master bool-partition
group                   bool-partition                         master
-----                   --------------                         ------
bool sort 2^12          1.02   422.0±16.63µs        ? ?/sec    1.00   414.9±15.86µs        ? ?/sec
bool sort nulls 2^12    1.01   416.4±16.02µs        ? ?/sec    1.00    410.6±9.14µs        ? ?/sec
alamb@ip-10-0-0-124:~/Software/arrow-rs$

So looks good to me 👍

jimexist · 2021-06-15T00:25:23Z

i guess 1024 is too short an array so it's dominated by the memory allocation rather than sorting

alamb · 2021-06-15T10:21:15Z

i guess 1024 is too short an array so it's dominated by the memory allocation rather than sorting

I also was under the impression that the partition sort was aiming to improve memory usage rather than speed

jimexist · 2021-06-15T11:09:02Z

i guess 1024 is too short an array so it's dominated by the memory allocation rather than sorting

I also was under the impression that the partition sort was aiming to improve memory usage rather than speed

Both but the memory consumption is not so obvious unless partition in place is stabilized which is currently in nightly

alamb · 2021-06-16T11:01:39Z

Ok, I'll increase the size of the sort benchmark and see if I can see any difference

alamb · 2021-06-19T11:44:38Z

I finally got a chance to re-run the tests

Using 2^14 = 16384

critcmp master bool-partition
group                   bool-partition                         master
-----                   --------------                         ------
bool sort 2^14          1.00  1912.9±52.24µs        ? ?/sec    1.09      2.1±0.16ms        ? ?/sec
bool sort nulls 2^12    1.00   404.6±11.62µs        ? ?/sec    1.06   426.9±25.21µs        ? ?/sec

So looks like a nice change to me

Here is the test change in case anyone else wants to try:

diff --git a/arrow/benches/sort_kernel.rs b/arrow/benches/sort_kernel.rs
index f9f5f24c1..f517afede 100644
--- a/arrow/benches/sort_kernel.rs
+++ b/arrow/benches/sort_kernel.rs
@@ -80,9 +80,9 @@ fn add_benchmark(c: &mut Criterion) {
         b.iter(|| bench_sort(&arr_a, &arr_b, None))
     });
 
-    let arr_a = create_bool_array(2u64.pow(12) as usize, false);
-    let arr_b = create_bool_array(2u64.pow(12) as usize, false);
-    c.bench_function("bool sort 2^12", |b| {
+    let arr_a = create_bool_array(2u64.pow(14) as usize, false);
+    let arr_b = create_bool_array(2u64.pow(14) as usize, false);
+    c.bench_function("bool sort 2^14", |b| {
         b.iter(|| bench_sort(&arr_a, &arr_b, None))
     });

Sorry for the delay @jimexist -- it has been a busy few days for me

* optimize boolean sort using parition * add docs

* optimize boolean sort using parition * add docs Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com>

Jiayu Liu added 2 commits June 13, 2021 23:11

optimize boolean sort using parition

6c3e5d8

add docs

b3e3dcd

jimexist force-pushed the use-partition-for-bool-sort branch from 403fe20 to b3e3dcd Compare June 13, 2021 15:11

alamb mentioned this pull request Jun 14, 2021

Add sort boolean benchmark #457

Merged

alamb approved these changes Jun 14, 2021

View reviewed changes

alamb merged commit 1f1c637 into apache:master Jun 19, 2021

jimexist deleted the use-partition-for-bool-sort branch June 19, 2021 12:03

alamb pushed a commit that referenced this pull request Jun 22, 2021

Use partition for bool sort (#448)

c1878ce

* optimize boolean sort using parition * add docs

alamb added the cherry-picked label Jun 22, 2021

alamb mentioned this pull request Jun 22, 2021

Cherry-pick Use partition for bool sort (#448) #484

Closed

alamb removed the cherry-picked label Jun 22, 2021

alamb pushed a commit that referenced this pull request Jun 23, 2021

Use partition for bool sort (#448)

7d7c879

* optimize boolean sort using parition * add docs

alamb added the cherry-picked label Jun 23, 2021

alamb mentioned this pull request Jun 23, 2021

Cherry pick Use partition for bool sort to active_release #494

Merged

alamb added a commit that referenced this pull request Jun 23, 2021

Use partition for bool sort (#448) (#494)

c1f9083

* optimize boolean sort using parition * add docs Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use partition for bool sort #448

Use partition for bool sort #448

jimexist commented Jun 11, 2021 •

edited

Loading

alamb commented Jun 13, 2021

alamb left a comment

jimexist commented Jun 15, 2021

alamb commented Jun 15, 2021

jimexist commented Jun 15, 2021

alamb commented Jun 16, 2021

alamb commented Jun 19, 2021

Use partition for bool sort #448

Use partition for bool sort #448

Conversation

jimexist commented Jun 11, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Jun 13, 2021

alamb left a comment

Choose a reason for hiding this comment

jimexist commented Jun 15, 2021

alamb commented Jun 15, 2021

jimexist commented Jun 15, 2021

alamb commented Jun 16, 2021

alamb commented Jun 19, 2021

jimexist commented Jun 11, 2021 •

edited

Loading