Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10540: [Rust] Extended filter kernel to all types and improved performance #8960

Closed
wants to merge 11 commits into from
Closed

ARROW-10540: [Rust] Extended filter kernel to all types and improved performance #8960

wants to merge 11 commits into from

Conversation

jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Dec 18, 2020

This PR improves the filter kernel:

  • made the filter benchmarks more realistic
  • performance improved by 1.2-4x for all multi-filter operations
  • performance decreased by 30% for a single-filter operation with 50% taken (and 2x faster for dense)
  • filter now supports all types supported by MutableArrayData (in particular nested lists, struct, etc.)
  • removed 400 LOC

There are two novel ideas here:

  1. it minimizes the number of memcopies when building the filtered array, both for single filter and multi-filter operations.

  2. for single filter operations, it leverages an iterator to create the new array on the fly. For multi filter operations, it persists the iterator's result in a vector and iterates over it per array.

This PR also improves the performance of MutableArrayData by avoiding some bound checks via unsafe (properly documented).

Summary of the benchmarks:

benchmark variation (%)
filter u8 29.5
filter u8 low selectivity 7.3
filter context u8 w NULLs -17.5
filter context u8 w NULLs high selectivity -21.9
filter context f32 high selectivity -22.0
filter context f32 -26.8
filter context string high selectivity -27.5
filter context string -31.4
filter context u8 -40.3
filter u8 high selectivity -47.3
filter context string low selectivity -55.3
filter context u8 w NULLs low selectivity -57.7
filter context f32 low selectivity -64.8
filter context u8 low selectivity -66.0
filter context u8 high selectivity -77.2

Code used to benchmark:

git checkout 54da4378d138df12bd4e09a68b0f4c80218834c7
cargo bench --bench filter_kernels
git checkout mutable_filter2
cargo bench --bench filter_kernels

Benchmark result:

   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 01s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-5208f9a404de52c9
Gnuplot not found, using plotters backend
filter u8               time:   [512.54 us 513.43 us 514.37 us]                      
                        change: [+29.070% +29.548% +30.003%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

filter u8 high selectivity                                                                             
                        time:   [11.494 us 11.513 us 11.532 us]
                        change: [-47.846% -47.337% -46.755%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

filter u8 low selectivity                                                                             
                        time:   [7.0342 us 7.0520 us 7.0693 us]
                        change: [+6.5543% +7.3409% +8.1080%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

filter context u8       time:   [233.81 us 234.31 us 234.93 us]                              
                        change: [-40.715% -40.329% -39.886%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

filter context u8 high selectivity                                                                             
                        time:   [4.5943 us 4.6100 us 4.6276 us]
                        change: [-77.449% -77.231% -77.022%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  10 (10.00%) high mild
  8 (8.00%) high severe

filter context u8 low selectivity                                                                             
                        time:   [1.7582 us 1.7664 us 1.7742 us]
                        change: [-66.250% -65.989% -65.669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

filter context u8 w NULLs                                                                            
                        time:   [476.99 us 477.71 us 478.44 us]
                        change: [-17.852% -17.457% -17.000%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

filter context u8 w NULLs high selectivity                                                                            
                        time:   [296.46 us 297.03 us 297.67 us]
                        change: [-22.297% -21.871% -21.393%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

filter context u8 w NULLs low selectivity                                                                             
                        time:   [2.5988 us 2.6124 us 2.6268 us]
                        change: [-58.065% -57.668% -57.237%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

filter context f32      time:   [470.69 us 472.39 us 474.73 us]                               
                        change: [-29.574% -26.769% -24.242%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe

filter context f32 high selectivity                                                                            
                        time:   [307.16 us 307.58 us 308.03 us]
                        change: [-22.472% -22.039% -21.532%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

filter context f32 low selectivity                                                                             
                        time:   [2.4266 us 2.4323 us 2.4384 us]
                        change: [-65.024% -64.764% -64.517%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

filter context string   time:   [645.82 us 647.32 us 649.04 us]                                  
                        change: [-31.810% -31.427% -31.046%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

Benchmarking filter context string high selectivity: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
filter context string high selectivity                                                                             
                        time:   [999.11 us 1.0008 ms 1.0027 ms]
                        change: [-28.133% -27.524% -26.930%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

filter context string low selectivity                                                                             
                        time:   [3.6441 us 3.6623 us 3.6799 us]
                        change: [-55.650% -55.329% -55.013%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) low mild
  2 (2.00%) high severe

@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Dec 18, 2020

Codecov Report

Merging #8960 (f9ee235) into master (71e37e2) will increase coverage by 0.34%.
The diff coverage is 86.06%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8960      +/-   ##
==========================================
+ Coverage   83.26%   83.61%   +0.34%     
==========================================
  Files         195      196       +1     
  Lines       48066    47875     -191     
==========================================
+ Hits        40024    40032       +8     
+ Misses       8042     7843     -199     
Impacted Files Coverage Δ
rust/arrow/src/array/transform/boolean.rs 76.92% <50.00%> (ø)
rust/arrow/src/array/transform/fixed_binary.rs 78.94% <66.66%> (ø)
rust/arrow/src/array/transform/mod.rs 88.21% <67.08%> (+1.03%) ⬆️
rust/arrow/src/array/transform/list.rs 83.87% <77.77%> (+2.05%) ⬆️
rust/arrow/src/compute/kernels/filter.rs 98.47% <96.66%> (+36.88%) ⬆️
rust/arrow/src/array/array_boolean.rs 86.71% <100.00%> (ø)
rust/arrow/src/array/builder.rs 84.17% <100.00%> (ø)
rust/arrow/src/array/transform/primitive.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/transform/utils.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/transform/variable_size.rs 100.00% <100.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13cf86f...f9ee235. Read the comment docs.

@yordan-pavlov
Copy link
Contributor

@jorgecarleitao these are some great performance improvements when multiple arrays are filtered - this should have great performance when filtering a record batch containing many columns. I imagine this is explained by doing more work in advance, when building the filter, and less work when applying the filter to each array (compared to the previous implementation with the filter context).

The performance degradation in the filter u8 is interesting - do you have a hypothesis for what's causing this? I wonder if this could be explained again by this new implementation doing more work in advance, which works very well when filtering multiple columns but is a bit slower when filtering a single column.

Also I would expect the benchmarks with highly selective filters (mostly 0s in the filter array) to be faster (as there is more skipping and less copying), compared to the low selectivity filter (mostly 1s in the filter array) benchmarks (because of more copying and less skipping), but this relationship appears to be reversed in the results above.

I also wonder how repeatable the benchmarks are now that they use randomly generated arrays. What are your observations; are the benchmarks results fairly stable across multiple runs?

I also like how the filter kernel is now implemented using the BitChunkIterator; overall great work!

self.on_region = false;
return Some(result);
}
} else if mask == 18446744073709551615u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that it might be more performant (although I suspect the compiler optimizes this) to not calculate !0u64 inside the loop , but isn't this going be more readable and obvious if !0u64 is put in a suitably named constant instead of 18446744073709551615u64 ?

@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Dec 19, 2020

@yordan-pavlov

Thanks for the feedback. All great points.

The performance degradation in the filter u8 is interesting - do you have a hypothesis for what's causing this?

I am sorry, I was not clear in the PR description:

  • filter u8 is the case of "keep 50%" of the array
  • filter u8 high selectivity is the case "keep 90%" of the array.

filter u8 has a 30% degradation, filter u8 high selectivity is 2x faster. Both are single array filters.

I wonder if this could be explained again by this new implementation doing more work in advance, which works very well when filtering multiple columns but is a bit slower when filtering a single column.

The single filter in this PR uses a single pass via an iterator, while master was building the context on filter. So, there is less initial work, and 2x less memory consumption on the operation, as there is no vector allocated on filter, only on multi filter.

This PR's implementation does perform more work per slot, to minimize the number of copies. I.e. while the total number of copied bytes is the same in both implementations, this implementation is guaranteed to call memcopy the minimum necessary number of times, by "packing" these calls together in a single call when the calls are contiguous in memory. This implementation is thus optimized for filters that take contiguous regions, which tends to happen when there are a lot of 1s, or when the data is distributed in the array in that way. AFAIK, in real data, this happens more often than by chance, so our benchmarks are being conservative here.

This behavior (of minimizing the number of calls) is crucial for non-primitive types because they minimize the number of relocations. The prime example here is the StringArray. There, we can't tell how much to reserve the "data" buffer from the number of 1s in the filter (it depends on the actual number of bytes on each slot). Therefore, when building the new array, the calls ("copy slot [2]", "copy slot [3]") vs "copy slots [2,3]" is very different: the former can cause up to 2 reallocations, while the latter is up to 1. The more complex and larger the data structure is, the worse this is.

By grouping these "extend" together, we reduce the number of relocations when building the new array. This is not so relevant in primitive types because we know the buffer size from the number of 1s and data type.

The implementation is just performing this computation on the fly (via an Iterator), so that, in single filter ops, they happen during the build of the array (and is cached for multi-filter ops).

highly selective filters (mostly 0s in the filter array)

This PR is calling "highly selective" as filters that have mostly 1s, as they "select" many items, but maybe the naming is incorrect? If this naming is incorrect, could we come up with something other than selective at all? IMO highly selective for selecting few items is confusing.

I also wonder how repeatable the benchmarks are now that they use randomly generated arrays. What are your observations; are the benchmarks results fairly stable across multiple runs?

Good point. The randomness is to not make assumptions about the distribution of the data (e.g. i % 2 == 0 is highly predictable). We do this in most benches. The results are reproducible because the seed is set to be a constant (independent of the thread, machine or time).

@Dandandan
Copy link
Contributor

I looked a bit at this PR, looks good, didn't find anything weird (but also don't know the details of this part of the code).
Speed ups look great! Any idea where the filter u8 regression comes from?

@jorgecarleitao
Copy link
Member Author

Any idea where the filter u8 regression comes from?

I tried to explain it in the comment above (there was a race condition here). Does it make sense?

@Dandandan
Copy link
Contributor

Makes sense @jorgecarleitao thanks

@yordan-pavlov
Copy link
Contributor

@jorgecarleitao thanks for the detailed explanation - it's great to see you have thought about optimizing the filtering of both single and multiple columns as much as possible;

regarding the meaning of high vs low selectivity of a filter, I agree it can be confusing - a highly / very selective filter is one which discards most of the data; it's not easy to find a good explanation from a credible source; here is one:

Selectivity could be defined as “the percentage of matching rows compared to total rows, regarding a given query’s criteria.” A lower percentage indicates higher selectivity. This means that if there are very few rows that meet a query’s (or an index’s, in the case of a filtered index) WHERE criteria compared to the total number of rows, the index is considered very selective.

from here https://www.red-gate.com/simple-talk/sql/performance/introduction-to-sql-server-filtered-indexes/

you might be right though - it might be better to come up with more intuitive names for those benchmarks

@Dandandan
Copy link
Contributor

Dandandan commented Dec 21, 2020

Did a small test against master on your branch against DataFusion to see the impact (merged master to this branch). From profiling I know query 1 is spending quite some time in filtering.

Master

Query 1 iteration 0 took 895.2 ms
Query 1 iteration 1 took 900.6 ms
Query 1 iteration 2 took 877.8 ms
Query 1 iteration 3 took 882.3 ms
Query 1 iteration 4 took 870.9 ms
Query 1 iteration 5 took 876.8 ms
Query 1 iteration 6 took 875.2 ms
Query 1 iteration 7 took 879.5 ms
Query 1 iteration 8 took 864.7 ms
Query 1 iteration 9 took 874.1 ms
Query 1 avg time: 879.70 ms

This PR

Query 1 iteration 0 took 688.4 ms
Query 1 iteration 1 took 688.1 ms
Query 1 iteration 2 took 679.2 ms
Query 1 iteration 3 took 688.1 ms
Query 1 iteration 4 took 687.9 ms
Query 1 iteration 5 took 694.7 ms
Query 1 iteration 6 took 701.7 ms
Query 1 iteration 7 took 708.3 ms
Query 1 iteration 8 took 704.2 ms
Query 1 iteration 9 took 702.1 ms
Query 1 avg time: 694.28 ms

Looks like a pretty decent speedup @jorgecarleitao

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the reported performance improvements, I think this PR sounds good from a feature-perspective.

I spent quite a while going through the code to and while I am not an expert in all these areas, it seems pretty good to me. A few more tests in filtering for larger sizes might be in order but all in all really nice work @jorgecarleitao

I also ran the tests under valgrind to try and double validate the use of unsafe and it did not report any errors

So all in all, I think this is pretty much good to go. 👍 I think it would be good to have at least one more person carefully review it (if @yordan-pavlov already did a thorough review, that is good for me; It just wasn't 100% clear to me if he had done so)

@@ -46,16 +46,21 @@ struct _MutableArrayData<'a> {
pub len: usize,
pub null_buffer: MutableBuffer,

pub buffers: Vec<MutableBuffer>,
pub buffer1: MutableBuffer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend comments here explaining the use of buffer1 and buffer2 for future readers who may not have the context of this PR

// Soundness
// * offset buffer is always extended in slices of T and aligned accordingly.
// * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid.
let offsets = offset_buffer.data().align_to::<T>().1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be a good place to use a debug_assert.

Something like the following to ensure the data was actually aligned as intended

Suggested change
let offsets = offset_buffer.data().align_to::<T>().1;
let (prefix, offsets, suffix) = offset_buffer.data().align_to::<T>();
debug_assert!(prefix.len() == 0 && suffix.len() == 0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe something more direct:

debug_assert!(*offsets.get_unchecked(offsets.len() - 1) == mutable_offsets[mutable_offsets.len() - 1]);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the first option (prefix, offsets, suffix) in the parquet crate. I'd support either option that you're suggesting

@@ -298,79 +290,137 @@ impl<'a> MutableArrayData<'a> {
use_nulls = true;
};

let buffers = match &data_type {
let empty_buffer = MutableBuffer::new(0);
let buffers: [MutableBuffer; 2] = match &data_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor suggestion:

Suggested change
let buffers: [MutableBuffer; 2] = match &data_type {
let (buffer1, buffer2) = match &data_type {

And you can remove the destructuring below.

let values_buffer = &mut mutable.buffer2;

// this is safe due to how offset is built. See details on `get_last_offset`
let last_offset = unsafe { get_last_offset(offset_buffer) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find use of unsafe in order to call offset_buffer somewhat confusing. I suspect you are trying to follow the unsafe guidelines and ensure it is clear where unsafe is being used.

However, in this case the only thing the caller can do is trust that the MutableBuffer it was passed in was created correctly. Forcing callers to say unsafe in order for the call to get_last_offset even though they can do nothing to ensure/validate things are safe or not seems unnecessarily confusing to me

I would personally suggest making get_last_offset an associated function, such as MutableBuffer::get_last_offset

And then change calls such as this to

let last_offset = offset_buffer.get_last_offset();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point.

get_last_offset must only be used in offset buffers, i.e. buffers whose bytes represent a i32 or i64 and were specifically built from those types. Doing so in other buffers is undefined behavior (even with the safeguards of using align_to). We can remove the unsafe mark from get_last_offset, though.

My proposal is that we refactor the src/transform code so that it has a struct specific for each array type (that implements some trait for dyn support). This will allow last_offset to be stored in the array-specific struct, thereby avoiding this problem altogether (of having to read bytes written to the MutableBuffer).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan to me

FilterContext::new(filter_array)?.filter_primitive_array(data_array)
}
/// Filters an [Array], returning elements matching the filter (i.e. where the values are true).
/// WARNING: the nulls of `filter` are ignored and the value on its slot is considered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this WARNING should also be included in the doc comments of build_filter as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or instead of ignoring the nulls, we could perform a values_buffer & null_buffer if there are nulls. Given that the input filter will almost always be the result of some computation, I'd prefer that we incur the slight cost of the AND operation, so that we treat null slots as false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "WARNING" was more like a way of describing a (unknown to me) feature. master already does this, but I did not have the time to go about checking what other implementations do (whether the result is null or something else).

Note that this hits datafusion already: there is at least one test there where we ignore a null value from a filter (because the predicate is built using arrays with nulls).

Comment on lines 187 to 189
/// Returns a function used to filter arbitrary arrays.
/// This is faster (2x for primitive types) than using [filter] on multiple arrays, but slower
/// than [filter] when filtering a single array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns a function used to filter arbitrary arrays.
/// This is faster (2x for primitive types) than using [filter] on multiple arrays, but slower
/// than [filter] when filtering a single array.
/// Returns a prepared function which can be applied to filter any number of arbitrary arrays.
///
/// You should use [filter] when filtering a single array and `build_filter` when filtering multiple arrays.
///
/// Creating this function requires time, but the prepared function is faster than [filter] when the
/// same filtering must be applied to multiple arrays (e.g. a multi-column `RecordBatch`).


#[test]
fn test_slice_iterator_chunk_and_bits() {
let filter_values = (0..127).map(|i| i % 62 != 0).collect::<Vec<bool>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the algorithm's use of 64-bit chunks, I recommend testing an array that is not a multiple of 64, ideally something of length 192 + 17 or something that would also test the transition State::Chunks --> State::Bits --> State::Chunks

I may have missed this in reviewing the tests

@yordan-pavlov
Copy link
Contributor

indeed, great work @jorgecarleitao and impressive performance improvements; @alamb I did review the changes a couple of days ago and I think it all looks good overall; once the comments have been addressed I would be happy to have this merged and see arrow and datafusion become even faster

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much more to add to @alamb's review

// Soundness
// * offset buffer is always extended in slices of T and aligned accordingly.
// * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid.
let offsets = offset_buffer.data().align_to::<T>().1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the first option (prefix, offsets, suffix) in the parquet crate. I'd support either option that you're suggesting

FilterContext::new(filter_array)?.filter_primitive_array(data_array)
}
/// Filters an [Array], returning elements matching the filter (i.e. where the values are true).
/// WARNING: the nulls of `filter` are ignored and the value on its slot is considered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or instead of ignoring the nulls, we could perform a values_buffer & null_buffer if there are nulls. Given that the input filter will almost always be the result of some computation, I'd prefer that we incur the slight cost of the AND operation, so that we treat null slots as false.

@jorgecarleitao
Copy link
Member Author

Thank you all for your comments and suggestions. Really helpful.

I addressed them all except the get_last_offset one, which IMO needs a better approach and thus kept it as is and will work on it later.

This was also rebased against latest master.

@alamb alamb closed this in 8fa68b0 Dec 24, 2020
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…performance

This PR improves the filter kernel:

* made the filter benchmarks more realistic
* performance improved by 1.2-4x for all multi-filter operations
* performance decreased by 30% for a single-filter operation with 50% taken (and 2x faster for dense)
* filter now supports all types supported by `MutableArrayData` (in particular nested lists, `struct`, etc.)
* removed 400 LOC

There are two novel ideas here:

1. it minimizes the number of memcopies when building the filtered array, both for single filter and multi-filter operations.

2. for single filter operations, it leverages an iterator to create the new array on the fly. For multi filter operations, it persists the iterator's result in a vector and iterates over it per array.

This PR also improves the performance of `MutableArrayData` by avoiding some bound checks via `unsafe` (properly documented).

Summary of the benchmarks:

|  benchmark | variation (%) |
|-------------- | -------------- |
| filter u8 | 29.5 |
| filter u8 low selectivity | 7.3 |
| filter context u8 w NULLs | -17.5 |
| filter context u8 w NULLs high selectivity | -21.9 |
| filter context f32 high selectivity | -22.0 |
| filter context f32 | -26.8 |
| filter context string high selectivity | -27.5 |
| filter context string | -31.4 |
| filter context u8 | -40.3 |
| filter u8 high selectivity | -47.3 |
| filter context string low selectivity | -55.3 |
| filter context u8 w NULLs low selectivity | -57.7 |
| filter context f32 low selectivity | -64.8 |
| filter context u8 low selectivity | -66.0 |
| filter context u8 high selectivity | -77.2 |

Code used to benchmark:
```bash
git checkout 54da437
cargo bench --bench filter_kernels
git checkout mutable_filter2
cargo bench --bench filter_kernels
```

Benchmark result:
```
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 01s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-5208f9a404de52c9
Gnuplot not found, using plotters backend
filter u8               time:   [512.54 us 513.43 us 514.37 us]
                        change: [+29.070% +29.548% +30.003%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

filter u8 high selectivity
                        time:   [11.494 us 11.513 us 11.532 us]
                        change: [-47.846% -47.337% -46.755%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

filter u8 low selectivity
                        time:   [7.0342 us 7.0520 us 7.0693 us]
                        change: [+6.5543% +7.3409% +8.1080%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

filter context u8       time:   [233.81 us 234.31 us 234.93 us]
                        change: [-40.715% -40.329% -39.886%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

filter context u8 high selectivity
                        time:   [4.5943 us 4.6100 us 4.6276 us]
                        change: [-77.449% -77.231% -77.022%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  10 (10.00%) high mild
  8 (8.00%) high severe

filter context u8 low selectivity
                        time:   [1.7582 us 1.7664 us 1.7742 us]
                        change: [-66.250% -65.989% -65.669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

filter context u8 w NULLs
                        time:   [476.99 us 477.71 us 478.44 us]
                        change: [-17.852% -17.457% -17.000%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

filter context u8 w NULLs high selectivity
                        time:   [296.46 us 297.03 us 297.67 us]
                        change: [-22.297% -21.871% -21.393%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

filter context u8 w NULLs low selectivity
                        time:   [2.5988 us 2.6124 us 2.6268 us]
                        change: [-58.065% -57.668% -57.237%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

filter context f32      time:   [470.69 us 472.39 us 474.73 us]
                        change: [-29.574% -26.769% -24.242%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe

filter context f32 high selectivity
                        time:   [307.16 us 307.58 us 308.03 us]
                        change: [-22.472% -22.039% -21.532%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

filter context f32 low selectivity
                        time:   [2.4266 us 2.4323 us 2.4384 us]
                        change: [-65.024% -64.764% -64.517%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

filter context string   time:   [645.82 us 647.32 us 649.04 us]
                        change: [-31.810% -31.427% -31.046%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

Benchmarking filter context string high selectivity: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
filter context string high selectivity
                        time:   [999.11 us 1.0008 ms 1.0027 ms]
                        change: [-28.133% -27.524% -26.930%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

filter context string low selectivity
                        time:   [3.6441 us 3.6623 us 3.6799 us]
                        change: [-55.650% -55.329% -55.013%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) low mild
  2 (2.00%) high severe
```

Closes apache#8960 from jorgecarleitao/mutable_filter2

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants