-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add specialized filter kernels in compute
module (up to 10x faster)
#1248
Conversation
arrow/src/compute/kernels/filter.rs
Outdated
|
||
let mut buffer = MutableBuffer::with_capacity(filter_count * T::get_byte_width()); | ||
|
||
let selectivity_frac = filter_count as f64 / filter.len() as f64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credit to @sunchao for the inspiration behind this hybrid approach - #1191 (comment)
let mut mutable = | ||
MutableArrayData::new(vec![array.data_ref()], false, filter_count); | ||
// actually filter | ||
_ => match values.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is copied straight from the take kernel - there is probably a way to avoid this duplication with some more layers of macros
ab231cf
to
8ed556e
Compare
Codecov Report
@@ Coverage Diff @@
## master #1248 +/- ##
==========================================
- Coverage 82.99% 82.99% -0.01%
==========================================
Files 180 180
Lines 52288 52587 +299
==========================================
+ Hits 43396 43644 +248
- Misses 8892 8943 +51
Continue to review full report at Codecov.
|
This is really cool and promising!. In my experience the filter kernels can be quite expensive in benchmarks. Great stuff 😃 |
I found some time this afternoon, so bashed out porting the filter context abstraction (caching the selection vector) and fixing up the null buffer construction. Here's where we stand now, across the board about 10x performance uplift other than for high selectivity with nulls where the bottleneck on copying ranges of null bitmasks is unchanged
What is interesting, at least to me, is the performance tax imposed by the packed bitmask representation of BooleanArray - even with non-trivial bit-twiddling shenanigans it still appears to be more performant to hydrate the filter to an array of indices or slices when filtering multiple arrays.
Perhaps the compiler just struggles to auto-vectorise bitmask loops or something, not sure 😅 |
arrow/src/compute/kernels/filter.rs
Outdated
} | ||
|
||
impl FilterBuilder { | ||
/// Create a new [`FilterBuilder`] that can be used construct [`FilterPredicate`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably should read: ... that can be used TO construct A FilterPredicate
this looks very promising @tustvold - excellent performance improvement; when I made the filter kernel up to 550 times faster in 457ea62 I was wondering if this was the limit of performance or it could be made even faster, and I am very happy to see that you have been able to reach a new level of performance for the filter kernel |
Damn, that's some improvement 😆 |
To get a good feeling for how close we are to the theoretical peak performance it could be useful to add throughput numbers to the criterion benchmark. When criterion know the number of bytes that are processed (size of the primitive array + size of filter bitmap), it can report the results in Gb/s, which can be compared to the max single-threaded memory bandwidth of the machine, which is usually between 10-15Gb/s. When the arrays fit into L1 cache it can be even higher. The filter benchmarks are also a bit focused on bytes, I don't know how common those are in DataFusion, but I would think that 32 or 64 bit numbers are much more common and should be the focus of benchmarking. |
On recent intel machines there would be a much faster way using the pext instruction. That basically implements a filter for 64 bits at a time. |
Yeah, the f32 benchmarks are encouraging that the performance uplift isn't just a quirk of the fact the benchmarks currently use u8 for integers.
Yeah, I do not doubt some artisanal use of intrinsics could likely eliminate much of the cost of bitmasks, perhaps even surpassing indices. I had I guess hoped that it would be possible to get further before resorting to the arcane 😆 |
437e120
to
c5912ef
Compare
So I've added string and dictionary support. Strings only see a ~2x performance bump, this can almost certainly be pushed further, but I figured it was good enough and I can come back to it in subsequent PRs. Dictionaries see the full 9-10x bump that the integer types see which is 👌 I'm marking this ready for review because I think it represents a non-trivial improvement that would be super-awesome to make it into the arrow 9 release. There is definitely further refinement that could take place, and more perf that could be squeezed out, but I'd prefer to do that in subsequent PRs unless people object. For reference here are the latest numbers, down a little bit since #1228 was merged, but still pretty exciting imo
|
|
||
/// A builder to construct [`FilterPredicate`] | ||
#[derive(Debug)] | ||
pub struct FilterBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind this is that we may wish to give the user more control over how the predicate is executed, by creating a builder we have an obvious extension point for doing this
/// used to build a new [`GenericStringArray`] by copying values from the source | ||
/// | ||
/// TODO(raphael): Could this be used for the take kernel as well? | ||
struct FilterString<'a, OffsetSize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In subsequent PRs I intend to experiment with applying a similar construction to primitive arrays and null bitmasks, e.g. FilterPrimitive
and FilterNull
as I think it might compose nicely and potentially allow for code sharing with the take kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very impressive work @tustvold 👏
I went through this code pretty carefully. It makes sense to me, though I had some comments / requests for clarification
Before merging this I would really like another pair of eyes on this, perhaps @jhorstmann @yordan-pavlov , @ritchie46 or @nevi-me would be willing to review 🙏
arrow/src/compute/kernels/filter.rs
Outdated
@@ -119,17 +147,83 @@ impl<'a> Iterator for SlicesIterator<'a> { | |||
} | |||
} | |||
|
|||
/// An iterator of `usize` whose index in [`BooleanArray`] is true | |||
/// | |||
/// This provides the best performance on all but the most selective predicates, where the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This provides the best performance on all but the most selective predicates, where the | |
/// This provides the best performance on all but the least selective predicates (which keep most / all rows), where the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "least selective" means "high selectivity", confusingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it also worth mentioning that it requires the filter to be entirely non-null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULLs within the filter itself are handled by prep_null_mask_filter
- i.e. they are converted to false prior to any of this code seeing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding of the meaning of "highly selective filter" is that it filters out most of the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, highly selective means low selectivity. Selectivity is a measure of how many rows it selects, selective-ness is a measure of how few rows it selects. Or something like that... This terminology is still new to me 😄
/// `filter` implementation for string arrays | ||
/// | ||
/// Note: NULLs with a non-zero slot length in `array` will have the corresponding | ||
/// data copied across. This allows handling the null mask separately from the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this maybe was handled by applying prep_null_mask_filter
to the filter first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prep_null_mask_filter
handles nulls in the filter, not nulls in the array being filtered.
@@ -696,7 +1273,8 @@ mod tests { | |||
.flat_map(|(idx, v)| v.then(|| idx)) | |||
.collect(); | |||
|
|||
assert_eq!(bits, expected_bits); | |||
assert_eq!(slice_bits, expected_bits); | |||
assert_eq!(index_bits, expected_bits); | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you feel good about the level of testing in this module (specifically with filters of selectivity 0.8 and higher / lower)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eeh... I'm fairly confident, but more tests couldn't hurt. I'll see what I can come up with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a fuzz test in d587a46
Thank you all for taking the time to review this, I think I've incorporated all feedback, but let me know if I've missed something 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @tustvold, looks great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Very impressive improvements, can't wait to see this landed in DataFusion.
FYI @jorgecarleitao might be good to compare performance to arrow2 after those changes? |
arrow/src/compute/kernels/filter.rs
Outdated
|
||
/// Extends the in-progress array by the ranges in the provided iterator | ||
fn extend_slices(&mut self, iter: impl Iterator<Item = (usize, usize)>) { | ||
for slice in iter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for slice in iter { | |
for (start, end) in iter { |
Small style suggestion
Last time I benched arrow2 was 5x faster than arrow (results here), so this most likely beats it :) I haven't had the time to go through this in detail, but the approach is very well though. Kudos to @tustvold ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I finished going through this in detail. Thanks a lot @tustvold , quite powerful ideas here.
The IterationStrategy
design of switching between slices and indexes in particular is quite strong, imo.
I do wonder how the performance diff is so dramatic for primitives. My initial hypothesis is that the perf is primary due to the selectivity split between index vs slice iteration, since I can't see any major differences in the slicing code.
💯
EDIT: Ahhh, I now understand, it is the specialization of the mutableArrayData that offers the performance. Makes total sense.
// now we only have a boolean mask to deal with | ||
let predicate = prep_null_mask_filter(predicate); | ||
return filter(array, &predicate); | ||
pub fn filter(values: &dyn Array, predicate: &BooleanArray) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite related to this PR, but I'm wondering that in the case there are multiple predicates on the same input array, maybe it's beneficial to only materialize the final output array after all the predicates have been evaluated. It seems we don't have such API yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you had multiple predicates, I presume you'd evaluate them separately and then AND
the filters together to get the final filter to "materialize". I'm not sure if this is what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's what I have in mind. I'm thinking that for a chain of ANDs maybe we don't need to evaluate them separately but rather evaluate on the result of the previous one, and skip materialization only until the last predicate (or short-circuit if all values are filtered out before that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely something to measure. I think it would be hard to beat a strategy that materializes very selective predicates and otherwise just evaluates predicates against already "eliminated" rows. A masked comparison kernel, I'd expect to have non-trivial branching overheads, but I honestly don't know 😄 I guess it also depends on how expensive the predicate is to evaluate, a complex cast might tip the scales... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know either. For the chained ANDs case, maybe in future it is worth to evaluate the approach based on selection vector and only materialize the output array at the end.
_ => prep_null_mask_filter(filter), | ||
}; | ||
|
||
let count = filter_count(&filter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the special case where count
is equal to the length of original array (i.e., all values are selected), is there a faster path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout - will add 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 2aa46b0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the same for 0
(no selected values)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually being handled at a higher level in filter_array, I've cleaned this up 7b4a2f1
Great work @tustvold ! really looking forward to this improvement! |
Yeah, setting |
arrow/src/compute/kernels/filter.rs
Outdated
@@ -119,17 +155,83 @@ impl<'a> Iterator for SlicesIterator<'a> { | |||
} | |||
} | |||
|
|||
/// An iterator of `usize` whose index in [`BooleanArray`] is true | |||
/// | |||
/// This provides the best performance on all but the least selective predicates (which keep most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm if I read it wrongly. But doesn't "the least selective predicates" means keep only a few rows? I think here it wants to say is, the highly selective predicates should favours SlicesIterator
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a filter with high selectivity is considered to be not very selective, i.e. it isn't very discerning about what it selects. However, this terminology seems to be a source of a lot of confusion, myself included, so I'm just going to reword it 😆
arrow/src/compute/kernels/filter.rs
Outdated
/// slots of a [BooleanArray] are true. Each interval corresponds to a contiguous region of memory | ||
/// to be "taken" from an array to be filtered. | ||
/// | ||
/// This is only performant for the least selective filters that copy across long contiguous runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, I think this should be highly selective filters?
let iter = SlicesIterator::new(&predicate.filter); | ||
iter.for_each(|(start, end)| mutable.extend(0, start, end)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why using SlicesIterator
covers all other cases, e.g., Indices
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short because MutableArrayData
doesn't support other iteration strategies and I'm not sure that calling it with "ranges" of length 1 will outperform at least attempting to call it with larger ranges.
My 2 cents is that the eventual goal should be to remove this fallback, as MutableArrayData
is a poor fit for filtering where the ranges are typically not substantial enough to amortize its per-range overheads effectively
A great improvement! Looking forward to have this enhancement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me; I think it is ready to go
I'll plan to merge to master tomorrow unless there are any more comments
Thank you to everyone who contributed ideas and discussions
for i in 0..100 { | ||
let filter_percent = match i { | ||
0..=4 => 1., | ||
5..=10 => 0., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL match
works on ranges!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you to everyone who contributed reviews and comments to this PR and thank you again to @tustvold for sticking with this! |
compute
module (up to 10x faster)
Which issue does this PR close?
Closes #1288
Rationale for this change
Make filter kernels faster. This relates to the observation underlying #1225 and #1229, that the filter kernels are surprisingly slow for what they are doing. In many cases the filter kernels on master are slower than converting the filter to a list of indices and using
take
...Currently this PR is ~10x faster than master, and I suspect this could be pushed further.
What changes are included in this PR?
This builds on #1228 and adds specialized filter implementations for primitive array types. I suspect specialized implementations for byte array types, and dictionaries would likely yield similarly significant benefits and I may include them in this PR or a follow up.
Aside from the performance benefits, having specialized impls, much like we do for the
take
kernels will also allow us to take advantage of SIMD intrinsics either through manual implementation, or just helping the compiler's auto-vectorization.Are there any user-facing changes?
No