-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10540: [Rust] Extended filter kernel to all types and improved performance #8960
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8960 +/- ##
==========================================
+ Coverage 83.26% 83.61% +0.34%
==========================================
Files 195 196 +1
Lines 48066 47875 -191
==========================================
+ Hits 40024 40032 +8
+ Misses 8042 7843 -199
Continue to review full report at Codecov.
|
@jorgecarleitao these are some great performance improvements when multiple arrays are filtered - this should have great performance when filtering a record batch containing many columns. I imagine this is explained by doing more work in advance, when building the filter, and less work when applying the filter to each array (compared to the previous implementation with the filter context). The performance degradation in the Also I would expect the benchmarks with highly selective filters (mostly 0s in the filter array) to be faster (as there is more skipping and less copying), compared to the low selectivity filter (mostly 1s in the filter array) benchmarks (because of more copying and less skipping), but this relationship appears to be reversed in the results above. I also wonder how repeatable the benchmarks are now that they use randomly generated arrays. What are your observations; are the benchmarks results fairly stable across multiple runs? I also like how the filter kernel is now implemented using the |
self.on_region = false; | ||
return Some(result); | ||
} | ||
} else if mask == 18446744073709551615u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that it might be more performant (although I suspect the compiler optimizes this) to not calculate !0u64
inside the loop , but isn't this going be more readable and obvious if !0u64
is put in a suitably named constant instead of 18446744073709551615u64
?
Thanks for the feedback. All great points.
I am sorry, I was not clear in the PR description:
The single filter in this PR uses a single pass via an iterator, while master was building the context on This PR's implementation does perform more work per slot, to minimize the number of copies. I.e. while the total number of copied bytes is the same in both implementations, this implementation is guaranteed to call memcopy the minimum necessary number of times, by "packing" these calls together in a single call when the calls are contiguous in memory. This implementation is thus optimized for filters that take contiguous regions, which tends to happen when there are a lot of 1s, or when the data is distributed in the array in that way. AFAIK, in real data, this happens more often than by chance, so our benchmarks are being conservative here. This behavior (of minimizing the number of calls) is crucial for non-primitive types because they minimize the number of relocations. The prime example here is the By grouping these "extend" together, we reduce the number of relocations when building the new array. This is not so relevant in primitive types because we know the buffer size from the number of 1s and data type. The implementation is just performing this computation on the fly (via an Iterator), so that, in single filter ops, they happen during the build of the array (and is cached for multi-filter ops).
This PR is calling "highly selective" as filters that have mostly
Good point. The randomness is to not make assumptions about the distribution of the data (e.g. |
I looked a bit at this PR, looks good, didn't find anything weird (but also don't know the details of this part of the code). |
I tried to explain it in the comment above (there was a race condition here). Does it make sense? |
Makes sense @jorgecarleitao thanks |
@jorgecarleitao thanks for the detailed explanation - it's great to see you have thought about optimizing the filtering of both single and multiple columns as much as possible; regarding the meaning of high vs low selectivity of a filter, I agree it can be confusing - a highly / very selective filter is one which discards most of the data; it's not easy to find a good explanation from a credible source; here is one:
from here https://www.red-gate.com/simple-talk/sql/performance/introduction-to-sql-server-filtered-indexes/ you might be right though - it might be better to come up with more intuitive names for those benchmarks |
Did a small test against master on your branch against DataFusion to see the impact (merged master to this branch). From profiling I know query 1 is spending quite some time in filtering. Master
This PR
Looks like a pretty decent speedup @jorgecarleitao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the reported performance improvements, I think this PR sounds good from a feature-perspective.
I spent quite a while going through the code to and while I am not an expert in all these areas, it seems pretty good to me. A few more tests in filtering for larger sizes might be in order but all in all really nice work @jorgecarleitao
I also ran the tests under valgrind to try and double validate the use of unsafe
and it did not report any errors
So all in all, I think this is pretty much good to go. 👍 I think it would be good to have at least one more person carefully review it (if @yordan-pavlov already did a thorough review, that is good for me; It just wasn't 100% clear to me if he had done so)
@@ -46,16 +46,21 @@ struct _MutableArrayData<'a> { | |||
pub len: usize, | |||
pub null_buffer: MutableBuffer, | |||
|
|||
pub buffers: Vec<MutableBuffer>, | |||
pub buffer1: MutableBuffer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend comments here explaining the use of buffer1
and buffer2
for future readers who may not have the context of this PR
// Soundness | ||
// * offset buffer is always extended in slices of T and aligned accordingly. | ||
// * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid. | ||
let offsets = offset_buffer.data().align_to::<T>().1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would be a good place to use a debug_assert.
Something like the following to ensure the data was actually aligned as intended
let offsets = offset_buffer.data().align_to::<T>().1; | |
let (prefix, offsets, suffix) = offset_buffer.data().align_to::<T>(); | |
debug_assert!(prefix.len() == 0 && suffix.len() == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe something more direct:
debug_assert!(*offsets.get_unchecked(offsets.len() - 1) == mutable_offsets[mutable_offsets.len() - 1]);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the first option (prefix, offsets, suffix)
in the parquet crate. I'd support either option that you're suggesting
@@ -298,79 +290,137 @@ impl<'a> MutableArrayData<'a> { | |||
use_nulls = true; | |||
}; | |||
|
|||
let buffers = match &data_type { | |||
let empty_buffer = MutableBuffer::new(0); | |||
let buffers: [MutableBuffer; 2] = match &data_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor suggestion:
let buffers: [MutableBuffer; 2] = match &data_type { | |
let (buffer1, buffer2) = match &data_type { |
And you can remove the destructuring below.
let values_buffer = &mut mutable.buffer2; | ||
|
||
// this is safe due to how offset is built. See details on `get_last_offset` | ||
let last_offset = unsafe { get_last_offset(offset_buffer) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find use of unsafe in order to call offset_buffer
somewhat confusing. I suspect you are trying to follow the unsafe guidelines and ensure it is clear where unsafe is being used.
However, in this case the only thing the caller can do is trust that the MutableBuffer
it was passed in was created correctly. Forcing callers to say unsafe
in order for the call to get_last_offset
even though they can do nothing to ensure/validate things are safe or not seems unnecessarily confusing to me
I would personally suggest making get_last_offset
an associated function, such as MutableBuffer::get_last_offset
And then change calls such as this to
let last_offset = offset_buffer.get_last_offset();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point.
get_last_offset
must only be used in offset buffers, i.e. buffers whose bytes represent a i32 or i64 and were specifically built from those types. Doing so in other buffers is undefined behavior (even with the safeguards of using align_to
). We can remove the unsafe
mark from get_last_offset
, though.
My proposal is that we refactor the src/transform
code so that it has a struct specific for each array type (that implements some trait for dyn
support). This will allow last_offset
to be stored in the array-specific struct, thereby avoiding this problem altogether (of having to read bytes written to the MutableBuffer
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan to me
FilterContext::new(filter_array)?.filter_primitive_array(data_array) | ||
} | ||
/// Filters an [Array], returning elements matching the filter (i.e. where the values are true). | ||
/// WARNING: the nulls of `filter` are ignored and the value on its slot is considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this WARNING should also be included in the doc comments of build_filter
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or instead of ignoring the nulls, we could perform a values_buffer & null_buffer
if there are nulls. Given that the input filter
will almost always be the result of some computation, I'd prefer that we incur the slight cost of the AND
operation, so that we treat null slots as false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "WARNING" was more like a way of describing a (unknown to me) feature. master already does this, but I did not have the time to go about checking what other implementations do (whether the result is null or something else).
Note that this hits datafusion already: there is at least one test there where we ignore a null value from a filter (because the predicate is built using arrays with nulls).
/// Returns a function used to filter arbitrary arrays. | ||
/// This is faster (2x for primitive types) than using [filter] on multiple arrays, but slower | ||
/// than [filter] when filtering a single array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns a function used to filter arbitrary arrays. | |
/// This is faster (2x for primitive types) than using [filter] on multiple arrays, but slower | |
/// than [filter] when filtering a single array. | |
/// Returns a prepared function which can be applied to filter any number of arbitrary arrays. | |
/// | |
/// You should use [filter] when filtering a single array and `build_filter` when filtering multiple arrays. | |
/// | |
/// Creating this function requires time, but the prepared function is faster than [filter] when the | |
/// same filtering must be applied to multiple arrays (e.g. a multi-column `RecordBatch`). |
|
||
#[test] | ||
fn test_slice_iterator_chunk_and_bits() { | ||
let filter_values = (0..127).map(|i| i % 62 != 0).collect::<Vec<bool>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the algorithm's use of 64-bit chunks, I recommend testing an array that is not a multiple of 64, ideally something of length 192 + 17
or something that would also test the transition State::Chunks
--> State::Bits
--> State::Chunks
I may have missed this in reviewing the tests
indeed, great work @jorgecarleitao and impressive performance improvements; @alamb I did review the changes a couple of days ago and I think it all looks good overall; once the comments have been addressed I would be happy to have this merged and see arrow and datafusion become even faster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much more to add to @alamb's review
// Soundness | ||
// * offset buffer is always extended in slices of T and aligned accordingly. | ||
// * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid. | ||
let offsets = offset_buffer.data().align_to::<T>().1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the first option (prefix, offsets, suffix)
in the parquet crate. I'd support either option that you're suggesting
FilterContext::new(filter_array)?.filter_primitive_array(data_array) | ||
} | ||
/// Filters an [Array], returning elements matching the filter (i.e. where the values are true). | ||
/// WARNING: the nulls of `filter` are ignored and the value on its slot is considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or instead of ignoring the nulls, we could perform a values_buffer & null_buffer
if there are nulls. Given that the input filter
will almost always be the result of some computation, I'd prefer that we incur the slight cost of the AND
operation, so that we treat null slots as false
.
Thank you all for your comments and suggestions. Really helpful. I addressed them all except the This was also rebased against latest master. |
…performance This PR improves the filter kernel: * made the filter benchmarks more realistic * performance improved by 1.2-4x for all multi-filter operations * performance decreased by 30% for a single-filter operation with 50% taken (and 2x faster for dense) * filter now supports all types supported by `MutableArrayData` (in particular nested lists, `struct`, etc.) * removed 400 LOC There are two novel ideas here: 1. it minimizes the number of memcopies when building the filtered array, both for single filter and multi-filter operations. 2. for single filter operations, it leverages an iterator to create the new array on the fly. For multi filter operations, it persists the iterator's result in a vector and iterates over it per array. This PR also improves the performance of `MutableArrayData` by avoiding some bound checks via `unsafe` (properly documented). Summary of the benchmarks: | benchmark | variation (%) | |-------------- | -------------- | | filter u8 | 29.5 | | filter u8 low selectivity | 7.3 | | filter context u8 w NULLs | -17.5 | | filter context u8 w NULLs high selectivity | -21.9 | | filter context f32 high selectivity | -22.0 | | filter context f32 | -26.8 | | filter context string high selectivity | -27.5 | | filter context string | -31.4 | | filter context u8 | -40.3 | | filter u8 high selectivity | -47.3 | | filter context string low selectivity | -55.3 | | filter context u8 w NULLs low selectivity | -57.7 | | filter context f32 low selectivity | -64.8 | | filter context u8 low selectivity | -66.0 | | filter context u8 high selectivity | -77.2 | Code used to benchmark: ```bash git checkout 54da437 cargo bench --bench filter_kernels git checkout mutable_filter2 cargo bench --bench filter_kernels ``` Benchmark result: ``` Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 01s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-5208f9a404de52c9 Gnuplot not found, using plotters backend filter u8 time: [512.54 us 513.43 us 514.37 us] change: [+29.070% +29.548% +30.003%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe filter u8 high selectivity time: [11.494 us 11.513 us 11.532 us] change: [-47.846% -47.337% -46.755%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe filter u8 low selectivity time: [7.0342 us 7.0520 us 7.0693 us] change: [+6.5543% +7.3409% +8.1080%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) high mild 4 (4.00%) high severe filter context u8 time: [233.81 us 234.31 us 234.93 us] change: [-40.715% -40.329% -39.886%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) high mild 5 (5.00%) high severe filter context u8 high selectivity time: [4.5943 us 4.6100 us 4.6276 us] change: [-77.449% -77.231% -77.022%] (p = 0.00 < 0.05) Performance has improved. Found 18 outliers among 100 measurements (18.00%) 10 (10.00%) high mild 8 (8.00%) high severe filter context u8 low selectivity time: [1.7582 us 1.7664 us 1.7742 us] change: [-66.250% -65.989% -65.669%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe filter context u8 w NULLs time: [476.99 us 477.71 us 478.44 us] change: [-17.852% -17.457% -17.000%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe filter context u8 w NULLs high selectivity time: [296.46 us 297.03 us 297.67 us] change: [-22.297% -21.871% -21.393%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe filter context u8 w NULLs low selectivity time: [2.5988 us 2.6124 us 2.6268 us] change: [-58.065% -57.668% -57.237%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe filter context f32 time: [470.69 us 472.39 us 474.73 us] change: [-29.574% -26.769% -24.242%] (p = 0.00 < 0.05) Performance has improved. Found 14 outliers among 100 measurements (14.00%) 9 (9.00%) high mild 5 (5.00%) high severe filter context f32 high selectivity time: [307.16 us 307.58 us 308.03 us] change: [-22.472% -22.039% -21.532%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) high mild 4 (4.00%) high severe filter context f32 low selectivity time: [2.4266 us 2.4323 us 2.4384 us] change: [-65.024% -64.764% -64.517%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe filter context string time: [645.82 us 647.32 us 649.04 us] change: [-31.810% -31.427% -31.046%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) high mild 5 (5.00%) high severe Benchmarking filter context string high selectivity: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60. filter context string high selectivity time: [999.11 us 1.0008 ms 1.0027 ms] change: [-28.133% -27.524% -26.930%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 7 (7.00%) high mild 4 (4.00%) high severe filter context string low selectivity time: [3.6441 us 3.6623 us 3.6799 us] change: [-55.650% -55.329% -55.013%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) low mild 2 (2.00%) high severe ``` Closes apache#8960 from jorgecarleitao/mutable_filter2 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR improves the filter kernel:
MutableArrayData
(in particular nested lists,struct
, etc.)There are two novel ideas here:
it minimizes the number of memcopies when building the filtered array, both for single filter and multi-filter operations.
for single filter operations, it leverages an iterator to create the new array on the fly. For multi filter operations, it persists the iterator's result in a vector and iterates over it per array.
This PR also improves the performance of
MutableArrayData
by avoiding some bound checks viaunsafe
(properly documented).Summary of the benchmarks:
Code used to benchmark:
Benchmark result: