ARROW-10540: [Rust] Extended filter kernel to all types and improved performance #8960

jorgecarleitao · 2020-12-18T07:43:54Z

This PR improves the filter kernel:

made the filter benchmarks more realistic
performance improved by 1.2-4x for all multi-filter operations
performance decreased by 30% for a single-filter operation with 50% taken (and 2x faster for dense)
filter now supports all types supported by MutableArrayData (in particular nested lists, struct, etc.)
removed 400 LOC

There are two novel ideas here:

it minimizes the number of memcopies when building the filtered array, both for single filter and multi-filter operations.
for single filter operations, it leverages an iterator to create the new array on the fly. For multi filter operations, it persists the iterator's result in a vector and iterates over it per array.

This PR also improves the performance of MutableArrayData by avoiding some bound checks via unsafe (properly documented).

Summary of the benchmarks:

benchmark	variation (%)
filter u8	29.5
filter u8 low selectivity	7.3
filter context u8 w NULLs	-17.5
filter context u8 w NULLs high selectivity	-21.9
filter context f32 high selectivity	-22.0
filter context f32	-26.8
filter context string high selectivity	-27.5
filter context string	-31.4
filter context u8	-40.3
filter u8 high selectivity	-47.3
filter context string low selectivity	-55.3
filter context u8 w NULLs low selectivity	-57.7
filter context f32 low selectivity	-64.8
filter context u8 low selectivity	-66.0
filter context u8 high selectivity	-77.2

Code used to benchmark:

git checkout 54da4378d138df12bd4e09a68b0f4c80218834c7
cargo bench --bench filter_kernels
git checkout mutable_filter2
cargo bench --bench filter_kernels

Benchmark result:

   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 01s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-5208f9a404de52c9
Gnuplot not found, using plotters backend
filter u8               time:   [512.54 us 513.43 us 514.37 us]                      
                        change: [+29.070% +29.548% +30.003%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

filter u8 high selectivity                                                                             
                        time:   [11.494 us 11.513 us 11.532 us]
                        change: [-47.846% -47.337% -46.755%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

filter u8 low selectivity                                                                             
                        time:   [7.0342 us 7.0520 us 7.0693 us]
                        change: [+6.5543% +7.3409% +8.1080%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

filter context u8       time:   [233.81 us 234.31 us 234.93 us]                              
                        change: [-40.715% -40.329% -39.886%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

filter context u8 high selectivity                                                                             
                        time:   [4.5943 us 4.6100 us 4.6276 us]
                        change: [-77.449% -77.231% -77.022%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  10 (10.00%) high mild
  8 (8.00%) high severe

filter context u8 low selectivity                                                                             
                        time:   [1.7582 us 1.7664 us 1.7742 us]
                        change: [-66.250% -65.989% -65.669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

filter context u8 w NULLs                                                                            
                        time:   [476.99 us 477.71 us 478.44 us]
                        change: [-17.852% -17.457% -17.000%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

filter context u8 w NULLs high selectivity                                                                            
                        time:   [296.46 us 297.03 us 297.67 us]
                        change: [-22.297% -21.871% -21.393%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

filter context u8 w NULLs low selectivity                                                                             
                        time:   [2.5988 us 2.6124 us 2.6268 us]
                        change: [-58.065% -57.668% -57.237%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

filter context f32      time:   [470.69 us 472.39 us 474.73 us]                               
                        change: [-29.574% -26.769% -24.242%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe

filter context f32 high selectivity                                                                            
                        time:   [307.16 us 307.58 us 308.03 us]
                        change: [-22.472% -22.039% -21.532%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

filter context f32 low selectivity                                                                             
                        time:   [2.4266 us 2.4323 us 2.4384 us]
                        change: [-65.024% -64.764% -64.517%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

filter context string   time:   [645.82 us 647.32 us 649.04 us]                                  
                        change: [-31.810% -31.427% -31.046%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

Benchmarking filter context string high selectivity: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
filter context string high selectivity                                                                             
                        time:   [999.11 us 1.0008 ms 1.0027 ms]
                        change: [-28.133% -27.524% -26.930%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

filter context string low selectivity                                                                             
                        time:   [3.6441 us 3.6623 us 3.6799 us]
                        change: [-55.650% -55.329% -55.013%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) low mild
  2 (2.00%) high severe

github-actions · 2020-12-18T07:45:18Z

https://issues.apache.org/jira/browse/ARROW-10540

codecov-io · 2020-12-18T08:21:36Z

Codecov Report

Merging #8960 (f9ee235) into master (71e37e2) will increase coverage by 0.34%.
The diff coverage is 86.06%.

@@            Coverage Diff             @@
##           master    #8960      +/-   ##
==========================================
+ Coverage   83.26%   83.61%   +0.34%     
==========================================
  Files         195      196       +1     
  Lines       48066    47875     -191     
==========================================
+ Hits        40024    40032       +8     
+ Misses       8042     7843     -199

Impacted Files	Coverage Δ
rust/arrow/src/array/transform/boolean.rs	`76.92% <50.00%> (ø)`
rust/arrow/src/array/transform/fixed_binary.rs	`78.94% <66.66%> (ø)`
rust/arrow/src/array/transform/mod.rs	`88.21% <67.08%> (+1.03%)`	⬆️
rust/arrow/src/array/transform/list.rs	`83.87% <77.77%> (+2.05%)`	⬆️
rust/arrow/src/compute/kernels/filter.rs	`98.47% <96.66%> (+36.88%)`	⬆️
rust/arrow/src/array/array_boolean.rs	`86.71% <100.00%> (ø)`
rust/arrow/src/array/builder.rs	`84.17% <100.00%> (ø)`
rust/arrow/src/array/transform/primitive.rs	`100.00% <100.00%> (ø)`
rust/arrow/src/array/transform/utils.rs	`100.00% <100.00%> (ø)`
rust/arrow/src/array/transform/variable_size.rs	`100.00% <100.00%> (ø)`
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13cf86f...f9ee235. Read the comment docs.

yordan-pavlov · 2020-12-18T21:23:10Z

@jorgecarleitao these are some great performance improvements when multiple arrays are filtered - this should have great performance when filtering a record batch containing many columns. I imagine this is explained by doing more work in advance, when building the filter, and less work when applying the filter to each array (compared to the previous implementation with the filter context).

The performance degradation in the filter u8 is interesting - do you have a hypothesis for what's causing this? I wonder if this could be explained again by this new implementation doing more work in advance, which works very well when filtering multiple columns but is a bit slower when filtering a single column.

Also I would expect the benchmarks with highly selective filters (mostly 0s in the filter array) to be faster (as there is more skipping and less copying), compared to the low selectivity filter (mostly 1s in the filter array) benchmarks (because of more copying and less skipping), but this relationship appears to be reversed in the results above.

I also wonder how repeatable the benchmarks are now that they use randomly generated arrays. What are your observations; are the benchmarks results fairly stable across multiple runs?

I also like how the filter kernel is now implemented using the BitChunkIterator; overall great work!

yordan-pavlov · 2020-12-18T21:29:25Z

rust/arrow/src/compute/kernels/filter.rs

+                    self.on_region = false;
+                    return Some(result);
+                }
+            } else if mask == 18446744073709551615u64 {


I get that it might be more performant (although I suspect the compiler optimizes this) to not calculate !0u64 inside the loop , but isn't this going be more readable and obvious if !0u64 is put in a suitably named constant instead of 18446744073709551615u64 ?

jorgecarleitao · 2020-12-19T08:30:15Z

@yordan-pavlov

Thanks for the feedback. All great points.

The performance degradation in the filter u8 is interesting - do you have a hypothesis for what's causing this?

I am sorry, I was not clear in the PR description:

filter u8 is the case of "keep 50%" of the array
filter u8 high selectivity is the case "keep 90%" of the array.

filter u8 has a 30% degradation, filter u8 high selectivity is 2x faster. Both are single array filters.

I wonder if this could be explained again by this new implementation doing more work in advance, which works very well when filtering multiple columns but is a bit slower when filtering a single column.

The single filter in this PR uses a single pass via an iterator, while master was building the context on filter. So, there is less initial work, and 2x less memory consumption on the operation, as there is no vector allocated on filter, only on multi filter.

This PR's implementation does perform more work per slot, to minimize the number of copies. I.e. while the total number of copied bytes is the same in both implementations, this implementation is guaranteed to call memcopy the minimum necessary number of times, by "packing" these calls together in a single call when the calls are contiguous in memory. This implementation is thus optimized for filters that take contiguous regions, which tends to happen when there are a lot of 1s, or when the data is distributed in the array in that way. AFAIK, in real data, this happens more often than by chance, so our benchmarks are being conservative here.

This behavior (of minimizing the number of calls) is crucial for non-primitive types because they minimize the number of relocations. The prime example here is the StringArray. There, we can't tell how much to reserve the "data" buffer from the number of 1s in the filter (it depends on the actual number of bytes on each slot). Therefore, when building the new array, the calls ("copy slot [2]", "copy slot [3]") vs "copy slots [2,3]" is very different: the former can cause up to 2 reallocations, while the latter is up to 1. The more complex and larger the data structure is, the worse this is.

By grouping these "extend" together, we reduce the number of relocations when building the new array. This is not so relevant in primitive types because we know the buffer size from the number of 1s and data type.

The implementation is just performing this computation on the fly (via an Iterator), so that, in single filter ops, they happen during the build of the array (and is cached for multi-filter ops).

highly selective filters (mostly 0s in the filter array)

This PR is calling "highly selective" as filters that have mostly 1s, as they "select" many items, but maybe the naming is incorrect? If this naming is incorrect, could we come up with something other than selective at all? IMO highly selective for selecting few items is confusing.

I also wonder how repeatable the benchmarks are now that they use randomly generated arrays. What are your observations; are the benchmarks results fairly stable across multiple runs?

Good point. The randomness is to not make assumptions about the distribution of the data (e.g. i % 2 == 0 is highly predictable). We do this in most benches. The results are reproducible because the seed is set to be a constant (independent of the thread, machine or time).

Dandandan · 2020-12-19T08:30:24Z

I looked a bit at this PR, looks good, didn't find anything weird (but also don't know the details of this part of the code).
Speed ups look great! Any idea where the filter u8 regression comes from?

jorgecarleitao · 2020-12-19T08:43:17Z

Any idea where the filter u8 regression comes from?

I tried to explain it in the comment above (there was a race condition here). Does it make sense?

Dandandan · 2020-12-19T19:39:17Z

Makes sense @jorgecarleitao thanks

yordan-pavlov · 2020-12-19T21:11:31Z

@jorgecarleitao thanks for the detailed explanation - it's great to see you have thought about optimizing the filtering of both single and multiple columns as much as possible;

regarding the meaning of high vs low selectivity of a filter, I agree it can be confusing - a highly / very selective filter is one which discards most of the data; it's not easy to find a good explanation from a credible source; here is one:

Selectivity could be defined as “the percentage of matching rows compared to total rows, regarding a given query’s criteria.” A lower percentage indicates higher selectivity. This means that if there are very few rows that meet a query’s (or an index’s, in the case of a filtered index) WHERE criteria compared to the total number of rows, the index is considered very selective.

from here https://www.red-gate.com/simple-talk/sql/performance/introduction-to-sql-server-filtered-indexes/

you might be right though - it might be better to come up with more intuitive names for those benchmarks

Dandandan · 2020-12-21T21:47:21Z

Did a small test against master on your branch against DataFusion to see the impact (merged master to this branch). From profiling I know query 1 is spending quite some time in filtering.

Master

Query 1 iteration 0 took 895.2 ms
Query 1 iteration 1 took 900.6 ms
Query 1 iteration 2 took 877.8 ms
Query 1 iteration 3 took 882.3 ms
Query 1 iteration 4 took 870.9 ms
Query 1 iteration 5 took 876.8 ms
Query 1 iteration 6 took 875.2 ms
Query 1 iteration 7 took 879.5 ms
Query 1 iteration 8 took 864.7 ms
Query 1 iteration 9 took 874.1 ms
Query 1 avg time: 879.70 ms

This PR

Query 1 iteration 0 took 688.4 ms
Query 1 iteration 1 took 688.1 ms
Query 1 iteration 2 took 679.2 ms
Query 1 iteration 3 took 688.1 ms
Query 1 iteration 4 took 687.9 ms
Query 1 iteration 5 took 694.7 ms
Query 1 iteration 6 took 701.7 ms
Query 1 iteration 7 took 708.3 ms
Query 1 iteration 8 took 704.2 ms
Query 1 iteration 9 took 702.1 ms
Query 1 avg time: 694.28 ms

Looks like a pretty decent speedup @jorgecarleitao

alamb

Given the reported performance improvements, I think this PR sounds good from a feature-perspective.

I spent quite a while going through the code to and while I am not an expert in all these areas, it seems pretty good to me. A few more tests in filtering for larger sizes might be in order but all in all really nice work @jorgecarleitao

I also ran the tests under valgrind to try and double validate the use of unsafe and it did not report any errors

So all in all, I think this is pretty much good to go. 👍 I think it would be good to have at least one more person carefully review it (if @yordan-pavlov already did a thorough review, that is good for me; It just wasn't 100% clear to me if he had done so)

alamb · 2020-12-22T12:50:55Z

rust/arrow/src/array/transform/mod.rs

@@ -46,16 +46,21 @@ struct _MutableArrayData<'a> {
    pub len: usize,
    pub null_buffer: MutableBuffer,

-    pub buffers: Vec<MutableBuffer>,
+    pub buffer1: MutableBuffer,


I recommend comments here explaining the use of buffer1 and buffer2 for future readers who may not have the context of this PR

alamb · 2020-12-22T13:00:10Z

rust/arrow/src/array/transform/utils.rs

+    //  Soundness
+    //      * offset buffer is always extended in slices of T and aligned accordingly.
+    //      * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid.
+    let offsets = offset_buffer.data().align_to::<T>().1;


I wonder if this would be a good place to use a debug_assert.

Something like the following to ensure the data was actually aligned as intended

Suggested change

let offsets = offset_buffer.data().align_to::<T>().1;

let (prefix, offsets, suffix) = offset_buffer.data().align_to::<T>();

debug_assert!(prefix.len() == 0 && suffix.len() == 0);

Or maybe something more direct:

debug_assert!(*offsets.get_unchecked(offsets.len() - 1) == mutable_offsets[mutable_offsets.len() - 1]);

We use the first option (prefix, offsets, suffix) in the parquet crate. I'd support either option that you're suggesting

alamb · 2020-12-22T13:23:02Z

rust/arrow/src/array/transform/mod.rs

@@ -298,79 +290,137 @@ impl<'a> MutableArrayData<'a> {
            use_nulls = true;
        };

-        let buffers = match &data_type {
+        let empty_buffer = MutableBuffer::new(0);
+        let buffers: [MutableBuffer; 2] = match &data_type {


Very minor suggestion:

Suggested change

let buffers: [MutableBuffer; 2] = match &data_type {

let (buffer1, buffer2) = match &data_type {

And you can remove the destructuring below.

alamb · 2020-12-22T13:26:55Z

rust/arrow/src/array/transform/variable_size.rs

+                let values_buffer = &mut mutable.buffer2;
+
+                // this is safe due to how offset is built. See details on `get_last_offset`
+                let last_offset = unsafe { get_last_offset(offset_buffer) };


I find use of unsafe in order to call offset_buffer somewhat confusing. I suspect you are trying to follow the unsafe guidelines and ensure it is clear where unsafe is being used.

However, in this case the only thing the caller can do is trust that the MutableBuffer it was passed in was created correctly. Forcing callers to say unsafe in order for the call to get_last_offset even though they can do nothing to ensure/validate things are safe or not seems unnecessarily confusing to me

I would personally suggest making get_last_offset an associated function, such as MutableBuffer::get_last_offset

And then change calls such as this to

let last_offset = offset_buffer.get_last_offset();

I get your point.

get_last_offset must only be used in offset buffers, i.e. buffers whose bytes represent a i32 or i64 and were specifically built from those types. Doing so in other buffers is undefined behavior (even with the safeguards of using align_to). We can remove the unsafe mark from get_last_offset, though.

My proposal is that we refactor the src/transform code so that it has a struct specific for each array type (that implements some trait for dyn support). This will allow last_offset to be stored in the array-specific struct, thereby avoiding this problem altogether (of having to read bytes written to the MutableBuffer).

Sounds like a good plan to me

alamb · 2020-12-22T13:40:01Z

rust/arrow/src/compute/kernels/filter.rs

-    FilterContext::new(filter_array)?.filter_primitive_array(data_array)
-}
+/// Filters an [Array], returning elements matching the filter (i.e. where the values are true).
+/// WARNING: the nulls of `filter` are ignored and the value on its slot is considered.


I think this WARNING should also be included in the doc comments of build_filter as well

Or instead of ignoring the nulls, we could perform a values_buffer & null_buffer if there are nulls. Given that the input filter will almost always be the result of some computation, I'd prefer that we incur the slight cost of the AND operation, so that we treat null slots as false.

This "WARNING" was more like a way of describing a (unknown to me) feature. master already does this, but I did not have the time to go about checking what other implementations do (whether the result is null or something else).

Note that this hits datafusion already: there is at least one test there where we ignore a null value from a filter (because the predicate is built using arrays with nulls).

alamb · 2020-12-22T13:46:34Z

rust/arrow/src/compute/kernels/filter.rs

+/// Returns a function used to filter arbitrary arrays.
+/// This is faster (2x for primitive types) than using [filter] on multiple arrays, but slower
+/// than [filter] when filtering a single array.


Suggested change

/// Returns a function used to filter arbitrary arrays.

/// This is faster (2x for primitive types) than using [filter] on multiple arrays, but slower

/// than [filter] when filtering a single array.

/// Returns a prepared function which can be applied to filter any number of arbitrary arrays.

///

/// You should use [filter] when filtering a single array and `build_filter` when filtering multiple arrays.

///

/// Creating this function requires time, but the prepared function is faster than [filter] when the

/// same filtering must be applied to multiple arrays (e.g. a multi-column `RecordBatch`).

alamb · 2020-12-22T13:56:53Z

rust/arrow/src/compute/kernels/filter.rs

+
+    #[test]
+    fn test_slice_iterator_chunk_and_bits() {
+        let filter_values = (0..127).map(|i| i % 62 != 0).collect::<Vec<bool>>();


Given the algorithm's use of 64-bit chunks, I recommend testing an array that is not a multiple of 64, ideally something of length 192 + 17 or something that would also test the transition State::Chunks --> State::Bits --> State::Chunks

I may have missed this in reviewing the tests

yordan-pavlov · 2020-12-22T19:51:37Z

indeed, great work @jorgecarleitao and impressive performance improvements; @alamb I did review the changes a couple of days ago and I think it all looks good overall; once the comments have been addressed I would be happy to have this merged and see arrow and datafusion become even faster

nevi-me

I don't have much more to add to @alamb's review

nevi-me · 2020-12-23T14:27:56Z

rust/arrow/src/array/transform/utils.rs

+    //  Soundness
+    //      * offset buffer is always extended in slices of T and aligned accordingly.
+    //      * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid.
+    let offsets = offset_buffer.data().align_to::<T>().1;


We use the first option (prefix, offsets, suffix) in the parquet crate. I'd support either option that you're suggesting

nevi-me · 2020-12-23T14:41:08Z

rust/arrow/src/compute/kernels/filter.rs

-    FilterContext::new(filter_array)?.filter_primitive_array(data_array)
-}
+/// Filters an [Array], returning elements matching the filter (i.e. where the values are true).
+/// WARNING: the nulls of `filter` are ignored and the value on its slot is considered.


Or instead of ignoring the nulls, we could perform a values_buffer & null_buffer if there are nulls. Given that the input filter will almost always be the result of some computation, I'd prefer that we incur the slight cost of the AND operation, so that we treat null slots as false.

jorgecarleitao · 2020-12-24T05:44:32Z

Thank you all for your comments and suggestions. Really helpful.

I addressed them all except the get_last_offset one, which IMO needs a better approach and thus kept it as is and will work on it later.

This was also rebased against latest master.

…performance This PR improves the filter kernel: * made the filter benchmarks more realistic * performance improved by 1.2-4x for all multi-filter operations * performance decreased by 30% for a single-filter operation with 50% taken (and 2x faster for dense) * filter now supports all types supported by `MutableArrayData` (in particular nested lists, `struct`, etc.) * removed 400 LOC There are two novel ideas here: 1. it minimizes the number of memcopies when building the filtered array, both for single filter and multi-filter operations. 2. for single filter operations, it leverages an iterator to create the new array on the fly. For multi filter operations, it persists the iterator's result in a vector and iterates over it per array. This PR also improves the performance of `MutableArrayData` by avoiding some bound checks via `unsafe` (properly documented). Summary of the benchmarks: | benchmark | variation (%) | |-------------- | -------------- | | filter u8 | 29.5 | | filter u8 low selectivity | 7.3 | | filter context u8 w NULLs | -17.5 | | filter context u8 w NULLs high selectivity | -21.9 | | filter context f32 high selectivity | -22.0 | | filter context f32 | -26.8 | | filter context string high selectivity | -27.5 | | filter context string | -31.4 | | filter context u8 | -40.3 | | filter u8 high selectivity | -47.3 | | filter context string low selectivity | -55.3 | | filter context u8 w NULLs low selectivity | -57.7 | | filter context f32 low selectivity | -64.8 | | filter context u8 low selectivity | -66.0 | | filter context u8 high selectivity | -77.2 | Code used to benchmark: ```bash git checkout 54da437 cargo bench --bench filter_kernels git checkout mutable_filter2 cargo bench --bench filter_kernels ``` Benchmark result: ``` Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 01s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-5208f9a404de52c9 Gnuplot not found, using plotters backend filter u8 time: [512.54 us 513.43 us 514.37 us] change: [+29.070% +29.548% +30.003%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe filter u8 high selectivity time: [11.494 us 11.513 us 11.532 us] change: [-47.846% -47.337% -46.755%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe filter u8 low selectivity time: [7.0342 us 7.0520 us 7.0693 us] change: [+6.5543% +7.3409% +8.1080%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) high mild 4 (4.00%) high severe filter context u8 time: [233.81 us 234.31 us 234.93 us] change: [-40.715% -40.329% -39.886%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) high mild 5 (5.00%) high severe filter context u8 high selectivity time: [4.5943 us 4.6100 us 4.6276 us] change: [-77.449% -77.231% -77.022%] (p = 0.00 < 0.05) Performance has improved. Found 18 outliers among 100 measurements (18.00%) 10 (10.00%) high mild 8 (8.00%) high severe filter context u8 low selectivity time: [1.7582 us 1.7664 us 1.7742 us] change: [-66.250% -65.989% -65.669%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe filter context u8 w NULLs time: [476.99 us 477.71 us 478.44 us] change: [-17.852% -17.457% -17.000%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe filter context u8 w NULLs high selectivity time: [296.46 us 297.03 us 297.67 us] change: [-22.297% -21.871% -21.393%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe filter context u8 w NULLs low selectivity time: [2.5988 us 2.6124 us 2.6268 us] change: [-58.065% -57.668% -57.237%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe filter context f32 time: [470.69 us 472.39 us 474.73 us] change: [-29.574% -26.769% -24.242%] (p = 0.00 < 0.05) Performance has improved. Found 14 outliers among 100 measurements (14.00%) 9 (9.00%) high mild 5 (5.00%) high severe filter context f32 high selectivity time: [307.16 us 307.58 us 308.03 us] change: [-22.472% -22.039% -21.532%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) high mild 4 (4.00%) high severe filter context f32 low selectivity time: [2.4266 us 2.4323 us 2.4384 us] change: [-65.024% -64.764% -64.517%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe filter context string time: [645.82 us 647.32 us 649.04 us] change: [-31.810% -31.427% -31.046%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) high mild 5 (5.00%) high severe Benchmarking filter context string high selectivity: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60. filter context string high selectivity time: [999.11 us 1.0008 ms 1.0027 ms] change: [-28.133% -27.524% -26.930%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 7 (7.00%) high mild 4 (4.00%) high severe filter context string low selectivity time: [3.6441 us 3.6623 us 3.6799 us] change: [-55.650% -55.329% -55.013%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) low mild 2 (2.00%) high severe ``` Closes apache#8960 from jorgecarleitao/mutable_filter2 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions bot added the Component: Rust label Dec 18, 2020

yordan-pavlov reviewed Dec 18, 2020

View reviewed changes

jorgecarleitao added 8 commits December 19, 2020 08:35

Added bench for mutable array.

300c58e

Trying unsafe.

77c49c0

Simplified bit chunk iterator.

d52b9af

Improved filter bench.

ed54b2c

Migrated filter to mutableArrayData.

6681d5b

Improved.

62df957

Clippy and docs.

eb4d72a

Added example to filter docs.

33438a7

alamb approved these changes Dec 22, 2020

View reviewed changes

nevi-me approved these changes Dec 23, 2020

View reviewed changes

jorgecarleitao added 2 commits December 24, 2020 05:40

PR review comments.

35d0157

Improved test.

75b3bcd

alamb approved these changes Dec 24, 2020

View reviewed changes

Clippy.

9b7b392

alamb closed this in 8fa68b0 Dec 24, 2020

asfimport mentioned this pull request Dec 24, 2020

[Rust] Allow unary kernels of arbitrary array types #26507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10540: [Rust] Extended filter kernel to all types and improved performance #8960

ARROW-10540: [Rust] Extended filter kernel to all types and improved performance #8960

jorgecarleitao commented Dec 18, 2020 •

edited

Loading

github-actions bot commented Dec 18, 2020

codecov-io commented Dec 18, 2020 •

edited

Loading

yordan-pavlov commented Dec 18, 2020

yordan-pavlov Dec 18, 2020

jorgecarleitao commented Dec 19, 2020 •

edited

Loading

Dandandan commented Dec 19, 2020

jorgecarleitao commented Dec 19, 2020

Dandandan commented Dec 19, 2020

yordan-pavlov commented Dec 19, 2020

Dandandan commented Dec 21, 2020 •

edited

Loading

alamb left a comment

alamb Dec 22, 2020

alamb Dec 22, 2020

alamb Dec 22, 2020

nevi-me Dec 23, 2020

alamb Dec 22, 2020

alamb Dec 22, 2020

jorgecarleitao Dec 24, 2020

alamb Dec 24, 2020

alamb Dec 22, 2020

nevi-me Dec 23, 2020

jorgecarleitao Dec 24, 2020

alamb Dec 22, 2020

alamb Dec 22, 2020

yordan-pavlov commented Dec 22, 2020

nevi-me left a comment

nevi-me Dec 23, 2020

nevi-me Dec 23, 2020

jorgecarleitao commented Dec 24, 2020

	let offsets = offset_buffer.data().align_to::<T>().1;
	let (prefix, offsets, suffix) = offset_buffer.data().align_to::<T>();
	debug_assert!(prefix.len() == 0 && suffix.len() == 0);

	let buffers: [MutableBuffer; 2] = match &data_type {
	let (buffer1, buffer2) = match &data_type {

-/// Returns a function used to filter arbitrary arrays.
-/// This is faster (2x for primitive types) than using [filter] on multiple arrays, but slower
-/// than [filter] when filtering a single array.
+/// Returns a prepared function which can be applied to filter any number of arbitrary arrays.
+///
+/// You should use [filter] when filtering a single array and `build_filter` when filtering multiple arrays.
+///
+/// Creating this function requires time, but the prepared function is faster than [filter] when the
+/// same filtering must be applied to multiple arrays (e.g. a multi-column `RecordBatch`).

ARROW-10540: [Rust] Extended filter kernel to all types and improved performance #8960

ARROW-10540: [Rust] Extended filter kernel to all types and improved performance #8960

Conversation

jorgecarleitao commented Dec 18, 2020 • edited Loading

github-actions bot commented Dec 18, 2020

codecov-io commented Dec 18, 2020 • edited Loading

Codecov Report

yordan-pavlov commented Dec 18, 2020

Choose a reason for hiding this comment

jorgecarleitao commented Dec 19, 2020 • edited Loading

Dandandan commented Dec 19, 2020

jorgecarleitao commented Dec 19, 2020

Dandandan commented Dec 19, 2020

yordan-pavlov commented Dec 19, 2020

Dandandan commented Dec 21, 2020 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yordan-pavlov commented Dec 22, 2020

nevi-me left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao commented Dec 24, 2020

jorgecarleitao commented Dec 18, 2020 •

edited

Loading

codecov-io commented Dec 18, 2020 •

edited

Loading

jorgecarleitao commented Dec 19, 2020 •

edited

Loading

Dandandan commented Dec 21, 2020 •

edited

Loading