Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specialized filter kernels in compute module (up to 10x faster) #1248

Merged
merged 21 commits into from
Feb 9, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jan 30, 2022

Which issue does this PR close?

Closes #1288

Rationale for this change

Make filter kernels faster. This relates to the observation underlying #1225 and #1229, that the filter kernels are surprisingly slow for what they are doing. In many cases the filter kernels on master are slower than converting the filter to a list of indices and using take...

Currently this PR is ~10x faster than master, and I suspect this could be pushed further.

filter u8               time:   [47.525 us 47.538 us 47.554 us]                       
                        change: [-90.065% -90.056% -90.048%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [2.3351 us 2.3358 us 2.3365 us]
                        change: [-81.582% -81.570% -81.558%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [1.3204 us 1.3218 us 1.3233 us]
                        change: [-70.639% -70.572% -70.512%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

This builds on #1228 and adds specialized filter implementations for primitive array types. I suspect specialized implementations for byte array types, and dictionaries would likely yield similarly significant benefits and I may include them in this PR or a follow up.

Aside from the performance benefits, having specialized impls, much like we do for the take kernels will also allow us to take advantage of SIMD intrinsics either through manual implementation, or just helping the compiler's auto-vectorization.

Are there any user-facing changes?

No

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Jan 30, 2022

let mut buffer = MutableBuffer::with_capacity(filter_count * T::get_byte_width());

let selectivity_frac = filter_count as f64 / filter.len() as f64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credit to @sunchao for the inspiration behind this hybrid approach - #1191 (comment)

let mut mutable =
MutableArrayData::new(vec![array.data_ref()], false, filter_count);
// actually filter
_ => match values.data_type() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is copied straight from the take kernel - there is probably a way to avoid this duplication with some more layers of macros

@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2022

Codecov Report

Merging #1248 (9fc5fac) into master (1764399) will decrease coverage by 0.00%.
The diff coverage is 75.60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1248      +/-   ##
==========================================
- Coverage   82.99%   82.99%   -0.01%     
==========================================
  Files         180      180              
  Lines       52288    52587     +299     
==========================================
+ Hits        43396    43644     +248     
- Misses       8892     8943      +51     
Impacted Files Coverage Δ
arrow/src/util/bench_util.rs 95.23% <ø> (ø)
arrow/src/compute/kernels/filter.rs 84.77% <75.60%> (-7.73%) ⬇️
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (-0.43%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
arrow/src/ffi.rs 84.69% <0.00%> (ø)
arrow/src/csv/reader.rs 88.12% <0.00%> (ø)
parquet/src/data_type.rs 76.61% <0.00%> (ø)
arrow/src/array/array_list.rs 95.52% <0.00%> (ø)
arrow/src/array/array_union.rs 90.76% <0.00%> (ø)
arrow/src/array/raw_pointer.rs 100.00% <0.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1764399...9fc5fac. Read the comment docs.

@Dandandan
Copy link
Contributor

This is really cool and promising!. In my experience the filter kernels can be quite expensive in benchmarks. Great stuff 😃

@tustvold
Copy link
Contributor Author

I found some time this afternoon, so bashed out porting the filter context abstraction (caching the selection vector) and fixing up the null buffer construction.

Here's where we stand now, across the board about 10x performance uplift other than for high selectivity with nulls where the bottleneck on copying ranges of null bitmasks is unchanged

filter u8               time:   [48.733 us 48.795 us 48.889 us]                       
                        change: [-90.138% -90.127% -90.116%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [2.4004 us 2.4048 us 2.4099 us]
                        change: [-81.016% -80.967% -80.915%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [1.4000 us 1.4015 us 1.4032 us]
                        change: [-67.378% -67.309% -67.246%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8       time:   [14.783 us 14.798 us 14.816 us]                               
                        change: [-95.157% -95.150% -95.143%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 high selectivity                                                                             
                        time:   [1.1631 us 1.1636 us 1.1642 us]
                        change: [-85.209% -85.201% -85.192%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 low selectivity                                                                            
                        time:   [150.47 ns 150.64 ns 150.83 ns]
                        change: [-84.434% -84.380% -84.295%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 w NULLs                                                                             
                        time:   [40.762 us 40.771 us 40.781 us]
                        change: [-89.275% -89.267% -89.259%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 w NULLs high selectivity                                                                             
                        time:   [7.0784 us 7.0828 us 7.0876 us]
                        change: [+2.4840% +2.5549% +2.6258%] (p = 0.00 < 0.05)
                        Performance has regressed.

filter context u8 w NULLs low selectivity                                                                            
                        time:   [267.79 ns 267.96 ns 268.12 ns]
                        change: [-72.091% -72.023% -71.946%] (p = 0.00 < 0.05)
                        Performance has improved.

filter f32              time:   [117.24 us 117.27 us 117.32 us]                       
                        change: [-79.651% -79.637% -79.623%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32      time:   [41.474 us 41.486 us 41.499 us]                                
                        change: [-88.917% -88.911% -88.905%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32 high selectivity                                                                             
                        time:   [11.289 us 11.292 us 11.294 us]
                        change: [+3.9848% +4.0461% +4.1142%] (p = 0.00 < 0.05)
                        Performance has regressed.

filter context f32 low selectivity                                                                            
                        time:   [295.75 ns 296.19 ns 296.65 ns]
                        change: [-69.321% -69.181% -69.008%] (p = 0.00 < 0.05)
                        Performance has improved.

filter single record batch                                                                            
                        time:   [69.716 us 69.749 us 69.783 us]
                        change: [-86.024% -86.009% -85.998%] (p = 0.00 < 0.05)
                        Performance has improved.

What is interesting, at least to me, is the performance tax imposed by the packed bitmask representation of BooleanArray - even with non-trivial bit-twiddling shenanigans it still appears to be more performant to hydrate the filter to an array of indices or slices when filtering multiple arrays.

filter optimize         time:   [45.865 us 45.990 us 46.096 us]                             

filter optimize high selectivity                                                                             
                        time:   [1.6900 us 1.6910 us 1.6920 us]

filter optimize low selectivity                                                                             
                        time:   [1.4087 us 1.4116 us 1.4149 us]

Perhaps the compiler just struggles to auto-vectorise bitmask loops or something, not sure 😅

}

impl FilterBuilder {
/// Create a new [`FilterBuilder`] that can be used construct [`FilterPredicate`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should read: ... that can be used TO construct A FilterPredicate

@yordan-pavlov
Copy link
Contributor

this looks very promising @tustvold - excellent performance improvement; when I made the filter kernel up to 550 times faster in 457ea62 I was wondering if this was the limit of performance or it could be made even faster, and I am very happy to see that you have been able to reach a new level of performance for the filter kernel

@tustvold
Copy link
Contributor Author

550 times faster

Damn, that's some improvement 😆

@jhorstmann
Copy link
Contributor

To get a good feeling for how close we are to the theoretical peak performance it could be useful to add throughput numbers to the criterion benchmark. When criterion know the number of bytes that are processed (size of the primitive array + size of filter bitmap), it can report the results in Gb/s, which can be compared to the max single-threaded memory bandwidth of the machine, which is usually between 10-15Gb/s. When the arrays fit into L1 cache it can be even higher.

The filter benchmarks are also a bit focused on bytes, I don't know how common those are in DataFusion, but I would think that 32 or 64 bit numbers are much more common and should be the focus of benchmarking.

@jhorstmann
Copy link
Contributor

What is interesting, at least to me, is the performance tax imposed by the packed bitmask representation of BooleanArray - even with non-trivial bit-twiddling shenanigans it still appears to be more performant to hydrate the filter to an array of indices or slices when filtering multiple arrays.

On recent intel machines there would be a much faster way using the pext instruction. That basically implements a filter for 64 bits at a time.

@tustvold
Copy link
Contributor Author

tustvold commented Feb 1, 2022

The filter benchmarks are also a bit focused on bytes, I don't know how common those are in DataFusion, but I would think that 32 or 64 bit numbers are much more common and should be the focus of benchmarking.

Yeah, the f32 benchmarks are encouraging that the performance uplift isn't just a quirk of the fact the benchmarks currently use u8 for integers.

On recent intel machines there would be a much faster way using the pext instruction. That basically implements a filter for 64 bits at a time.

Yeah, I do not doubt some artisanal use of intrinsics could likely eliminate much of the cost of bitmasks, perhaps even surpassing indices. I had I guess hoped that it would be possible to get further before resorting to the arcane 😆

@tustvold
Copy link
Contributor Author

tustvold commented Feb 2, 2022

So I've added string and dictionary support. Strings only see a ~2x performance bump, this can almost certainly be pushed further, but I figured it was good enough and I can come back to it in subsequent PRs. Dictionaries see the full 9-10x bump that the integer types see which is 👌

I'm marking this ready for review because I think it represents a non-trivial improvement that would be super-awesome to make it into the arrow 9 release. There is definitely further refinement that could take place, and more perf that could be squeezed out, but I'd prefer to do that in subsequent PRs unless people object.

For reference here are the latest numbers, down a little bit since #1228 was merged, but still pretty exciting imo

filter u8               time:   [47.731 us 47.740 us 47.748 us]                       
                        change: [-85.383% -85.377% -85.371%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [2.4163 us 2.4185 us 2.4210 us]
                        change: [-73.969% -73.939% -73.909%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [1.3798 us 1.3814 us 1.3829 us]
                        change: [-41.966% -41.649% -41.441%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8       time:   [14.557 us 14.567 us 14.580 us]                               
                        change: [-94.178% -94.175% -94.171%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 high selectivity                                                                             
                        time:   [1.1673 us 1.1678 us 1.1684 us]
                        change: [-84.915% -84.905% -84.894%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 low selectivity                                                                            
                        time:   [150.77 ns 150.80 ns 150.83 ns]
                        change: [-82.779% -82.758% -82.735%] (p = 0.00 < 0.05)
                        Performance has improved.

filter i32              time:   [68.835 us 68.857 us 68.882 us]                       
                        change: [-78.627% -78.613% -78.598%] (p = 0.00 < 0.05)
                        Performance has improved.

filter i32 high selectivity                                                                             
                        time:   [6.1117 us 6.1135 us 6.1154 us]
                        change: [-54.814% -54.788% -54.759%] (p = 0.00 < 0.05)
                        Performance has improved.

filter i32 low selectivity                                                                             
                        time:   [1.3835 us 1.3858 us 1.3895 us]
                        change: [-41.772% -41.682% -41.570%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32      time:   [16.049 us 16.058 us 16.068 us]                                
                        change: [-93.552% -93.545% -93.535%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32 high selectivity                                                                             
                        time:   [4.7336 us 4.7417 us 4.7507 us]
                        change: [-59.764% -59.459% -58.934%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32 low selectivity                                                                            
                        time:   [141.69 ns 141.83 ns 141.98 ns]
                        change: [-83.970% -83.929% -83.895%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32 w NULLs                                                                             
                        time:   [42.514 us 42.529 us 42.548 us]
                        change: [-86.620% -86.613% -86.606%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32 w NULLs high selectivity                                                                             
                        time:   [10.785 us 10.788 us 10.790 us]
                        change: [+0.8230% +1.0819% +1.2557%] (p = 0.00 < 0.05)
                        Change within noise threshold.

filter context i32 w NULLs low selectivity                                                                            
                        time:   [279.94 ns 280.04 ns 280.13 ns]
                        change: [-68.389% -68.317% -68.242%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 w NULLs                                                                             
                        time:   [40.525 us 40.540 us 40.557 us]
                        change: [-87.204% -87.196% -87.187%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 w NULLs high selectivity                                                                             
                        time:   [7.0863 us 7.0895 us 7.0928 us]
                        change: [+3.4115% +3.4779% +3.5553%] (p = 0.00 < 0.05)
                        Performance has regressed.

filter context u8 w NULLs low selectivity                                                                            
                        time:   [270.69 ns 270.79 ns 270.89 ns]
                        change: [-67.961% -67.906% -67.849%] (p = 0.00 < 0.05)
                        Performance has improved.

filter f32              time:   [120.99 us 121.18 us 121.56 us]                       
                        change: [-68.315% -68.281% -68.241%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32      time:   [36.935 us 36.951 us 36.968 us]                                
                        change: [-88.072% -88.059% -88.045%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32 high selectivity                                                                             
                        time:   [10.339 us 10.342 us 10.346 us]
                        change: [-1.7648% -1.4694% -1.2565%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32 low selectivity                                                                            
                        time:   [287.59 ns 288.40 ns 289.16 ns]
                        change: [-67.322% -67.087% -66.921%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string   time:   [243.62 us 243.66 us 243.72 us]                                  
                        change: [-40.960% -40.903% -40.858%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string high selectivity                                                                            
                        time:   [75.645 us 75.664 us 75.682 us]
                        change: [-81.225% -81.211% -81.199%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string low selectivity                                                                             
                        time:   [666.91 ns 667.33 ns 667.80 ns]
                        change: [-42.717% -42.635% -42.556%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary                                                                             
                        time:   [16.348 us 16.354 us 16.359 us]
                        change: [-89.690% -89.684% -89.677%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary high selectivity                                                                              
                        time:   [4.8464 us 4.8530 us 4.8617 us]
                        change: [-83.162% -83.106% -83.056%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary low selectivity                                                                            
                        time:   [318.97 ns 319.14 ns 319.32 ns]
                        change: [-49.861% -49.819% -49.770%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary w NULLs                                                                             
                        time:   [42.654 us 42.726 us 42.806 us]
                        change: [-87.775% -87.728% -87.657%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary w NULLs high selectivity                                                                             
                        time:   [11.313 us 11.316 us 11.319 us]
                        change: [-68.181% -68.152% -68.130%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary w NULLs low selectivity                                                                            
                        time:   [480.59 ns 480.81 ns 481.04 ns]
                        change: [-54.683% -54.630% -54.579%] (p = 0.00 < 0.05)
                        Performance has improved.

filter single record batch                                                                            
                        time:   [62.108 us 62.210 us 62.292 us]
                        change: [-80.890% -80.862% -80.837%] (p = 0.00 < 0.05)
                        Performance has improved.

@tustvold tustvold marked this pull request as ready for review February 2, 2022 23:33
@tustvold tustvold changed the title POC: Specialized filter kernels Specialized filter kernels Feb 2, 2022

/// A builder to construct [`FilterPredicate`]
#[derive(Debug)]
pub struct FilterBuilder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this is that we may wish to give the user more control over how the predicate is executed, by creating a builder we have an obvious extension point for doing this

/// used to build a new [`GenericStringArray`] by copying values from the source
///
/// TODO(raphael): Could this be used for the take kernel as well?
struct FilterString<'a, OffsetSize> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In subsequent PRs I intend to experiment with applying a similar construction to primitive arrays and null bitmasks, e.g. FilterPrimitive and FilterNull as I think it might compose nicely and potentially allow for code sharing with the take kernels.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very impressive work @tustvold 👏

I went through this code pretty carefully. It makes sense to me, though I had some comments / requests for clarification

Before merging this I would really like another pair of eyes on this, perhaps @jhorstmann @yordan-pavlov , @ritchie46 or @nevi-me would be willing to review 🙏

arrow/benches/filter_kernels.rs Outdated Show resolved Hide resolved
arrow/benches/filter_kernels.rs Outdated Show resolved Hide resolved
arrow/benches/filter_kernels.rs Outdated Show resolved Hide resolved
arrow/src/util/bench_util.rs Show resolved Hide resolved
@@ -119,17 +147,83 @@ impl<'a> Iterator for SlicesIterator<'a> {
}
}

/// An iterator of `usize` whose index in [`BooleanArray`] is true
///
/// This provides the best performance on all but the most selective predicates, where the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This provides the best performance on all but the most selective predicates, where the
/// This provides the best performance on all but the least selective predicates (which keep most / all rows), where the

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "least selective" means "high selectivity", confusingly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also worth mentioning that it requires the filter to be entirely non-null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULLs within the filter itself are handled by prep_null_mask_filter - i.e. they are converted to false prior to any of this code seeing it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding of the meaning of "highly selective filter" is that it filters out most of the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, highly selective means low selectivity. Selectivity is a measure of how many rows it selects, selective-ness is a measure of how few rows it selects. Or something like that... This terminology is still new to me 😄

arrow/src/compute/kernels/filter.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/filter.rs Outdated Show resolved Hide resolved
/// `filter` implementation for string arrays
///
/// Note: NULLs with a non-zero slot length in `array` will have the corresponding
/// data copied across. This allows handling the null mask separately from the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this maybe was handled by applying prep_null_mask_filter to the filter first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prep_null_mask_filter handles nulls in the filter, not nulls in the array being filtered.

arrow/src/compute/kernels/filter.rs Outdated Show resolved Hide resolved
@@ -696,7 +1273,8 @@ mod tests {
.flat_map(|(idx, v)| v.then(|| idx))
.collect();

assert_eq!(bits, expected_bits);
assert_eq!(slice_bits, expected_bits);
assert_eq!(index_bits, expected_bits);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel good about the level of testing in this module (specifically with filters of selectivity 0.8 and higher / lower)?

Copy link
Contributor Author

@tustvold tustvold Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeh... I'm fairly confident, but more tests couldn't hurt. I'll see what I can come up with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a fuzz test in d587a46

@tustvold
Copy link
Contributor Author

tustvold commented Feb 4, 2022

Thank you all for taking the time to review this, I think I've incorporated all feedback, but let me know if I've missed something 😅

Copy link
Contributor

@yordan-pavlov yordan-pavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @tustvold, looks great

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Very impressive improvements, can't wait to see this landed in DataFusion.

@Dandandan
Copy link
Contributor

Dandandan commented Feb 6, 2022

FYI @jorgecarleitao might be good to compare performance to arrow2 after those changes?


/// Extends the in-progress array by the ranges in the provided iterator
fn extend_slices(&mut self, iter: impl Iterator<Item = (usize, usize)>) {
for slice in iter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for slice in iter {
for (start, end) in iter {

Small style suggestion

@jorgecarleitao
Copy link
Member

Last time I benched arrow2 was 5x faster than arrow (results here), so this most likely beats it :)

I haven't had the time to go through this in detail, but the approach is very well though. Kudos to @tustvold !

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I finished going through this in detail. Thanks a lot @tustvold , quite powerful ideas here.

The IterationStrategy design of switching between slices and indexes in particular is quite strong, imo.

I do wonder how the performance diff is so dramatic for primitives. My initial hypothesis is that the perf is primary due to the selectivity split between index vs slice iteration, since I can't see any major differences in the slicing code.

💯

EDIT: Ahhh, I now understand, it is the specialization of the mutableArrayData that offers the performance. Makes total sense.

arrow/src/compute/kernels/filter.rs Outdated Show resolved Hide resolved
// now we only have a boolean mask to deal with
let predicate = prep_null_mask_filter(predicate);
return filter(array, &predicate);
pub fn filter(values: &dyn Array, predicate: &BooleanArray) -> Result<ArrayRef> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite related to this PR, but I'm wondering that in the case there are multiple predicates on the same input array, maybe it's beneficial to only materialize the final output array after all the predicates have been evaluated. It seems we don't have such API yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you had multiple predicates, I presume you'd evaluate them separately and then AND the filters together to get the final filter to "materialize". I'm not sure if this is what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's what I have in mind. I'm thinking that for a chain of ANDs maybe we don't need to evaluate them separately but rather evaluate on the result of the previous one, and skip materialization only until the last predicate (or short-circuit if all values are filtered out before that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely something to measure. I think it would be hard to beat a strategy that materializes very selective predicates and otherwise just evaluates predicates against already "eliminated" rows. A masked comparison kernel, I'd expect to have non-trivial branching overheads, but I honestly don't know 😄 I guess it also depends on how expensive the predicate is to evaluate, a complex cast might tip the scales... 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know either. For the chained ANDs case, maybe in future it is worth to evaluate the approach based on selection vector and only materialize the output array at the end.

_ => prep_null_mask_filter(filter),
};

let count = filter_count(&filter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the special case where count is equal to the length of original array (i.e., all values are selected), is there a faster path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout - will add 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 2aa46b0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the same for 0 (no selected values)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually being handled at a higher level in filter_array, I've cleaned this up 7b4a2f1

@sunchao
Copy link
Member

sunchao commented Feb 6, 2022

Great work @tustvold ! really looking forward to this improvement!

@tustvold
Copy link
Contributor Author

tustvold commented Feb 6, 2022

the specialization of the mutableArrayData that offers the performance

Yeah, setting FILTER_SLICES_SELECTIVITY_THRESHOLD to 0, i.e. forcing slice iteration, shows a roughly 2-3x performance improvement on low selectivity filters solely from moving away from MutableArrayData. This isn't all that surprising considering each range is 2 dynamic dispatches 😅 The remaining 5x comes from the iteration strategy.

@@ -119,17 +155,83 @@ impl<'a> Iterator for SlicesIterator<'a> {
}
}

/// An iterator of `usize` whose index in [`BooleanArray`] is true
///
/// This provides the best performance on all but the least selective predicates (which keep most
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm if I read it wrongly. But doesn't "the least selective predicates" means keep only a few rows? I think here it wants to say is, the highly selective predicates should favours SlicesIterator, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a filter with high selectivity is considered to be not very selective, i.e. it isn't very discerning about what it selects. However, this terminology seems to be a source of a lot of confusion, myself included, so I'm just going to reword it 😆

/// slots of a [BooleanArray] are true. Each interval corresponds to a contiguous region of memory
/// to be "taken" from an array to be filtered.
///
/// This is only performant for the least selective filters that copy across long contiguous runs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, I think this should be highly selective filters?

Comment on lines +569 to +570
let iter = SlicesIterator::new(&predicate.filter);
iter.for_each(|(start, end)| mutable.extend(0, start, end));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why using SlicesIterator covers all other cases, e.g., Indices here?

Copy link
Contributor Author

@tustvold tustvold Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short because MutableArrayData doesn't support other iteration strategies and I'm not sure that calling it with "ranges" of length 1 will outperform at least attempting to call it with larger ranges.

My 2 cents is that the eventual goal should be to remove this fallback, as MutableArrayData is a poor fit for filtering where the ranges are typically not substantial enough to amortize its per-range overheads effectively

@viirya
Copy link
Member

viirya commented Feb 7, 2022

A great improvement! Looking forward to have this enhancement.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me; I think it is ready to go

I'll plan to merge to master tomorrow unless there are any more comments

Thank you to everyone who contributed ideas and discussions

for i in 0..100 {
let filter_percent = match i {
0..=4 => 1.,
5..=10 => 0.,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL match works on ranges!

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb alamb merged commit c064d53 into apache:master Feb 9, 2022
@alamb
Copy link
Contributor

alamb commented Feb 9, 2022

Thank you to everyone who contributed reviews and comments to this PR and thank you again to @tustvold for sticking with this!

@alamb alamb changed the title Specialized filter kernels Add specialized filter kernels in compute module (up to 10x faster) Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve filter performance by special casing high and low selectivity predicates
9 participants