-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix filter UB and add fast path #341
Conversation
Codecov Report
@@ Coverage Diff @@
## master #341 +/- ##
==========================================
+ Coverage 82.56% 82.60% +0.04%
==========================================
Files 162 162
Lines 44063 44199 +136
==========================================
+ Hits 36379 36510 +131
- Misses 7684 7689 +5
Continue to review full report at Codecov.
|
arrow/src/compute/kernels/filter.rs
Outdated
Ok(make_array(data)) | ||
if iter.filter_count == array.len() { | ||
let data = array.data().clone(); | ||
Ok(make_array(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this just return array
or array.clone()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dyn Array
is a trait object and does not implement Sized
The MIRI failure is unrelated to this PR: #345 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ritchie46 -- This code makes sense to me. @nevi-me @Dandandan or @jorgecarleitao any thoughts?
Note that this is not undefined behavior as defined in Rust; the warning is just that when nulls exist, the value of the null slot will be used for filtering regardless of its validity. I.e. it is a semantic undefined behavior. Do we have some benchmarks available? It would be good to verify that the performance improves here. |
Counting Note that this benchmark is sensitive to scale, because the filter algorithm is |
yeap, that I would expect. :) I was thinking about the case where there is a null mask and non-null values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! This has great consequences. E.g. it removes the need to constant remove filters in DataFusion,.
Thanks a lot @ritchie46 !
Curious, what do you mean by constant remove filters? |
rename argument: 'filter' to 'predicate' to reduce name collissions.
I mean that we no longer need to try to prune expressions of the form |
This was the same reason for met doing the PR. I paid a relatively heavy price for filtering null values on a |
👍 |
The call not indeed. Only the fast paths. |
Thanks @ritchie46 ! |
* fix ub in filter record_batch * filter fast path * add all false fast path * use new_empty_array * rename filter kernel argument rename argument: 'filter' to 'predicate' to reduce name collissions.
This PR and benchmark got me thinking - it is faster to clone an array than to create the equivalent null array from scratch (as can be seen in micro benchmarks above). It can increase memory usage though - as the contents might be big, for string arrays, which could be deallocated otherwise if it has a reference count of 0. On the other hand - this is something we do too for |
I think this makes a lot of sense personally (even if it might increase the peak memory usage of the system). Most filter masks are applied immediately (e.g. as part of some filtering operation) so I think so the amount of additional time that a large string array would be held on to is likely minimal |
* fix ub in filter record_batch * filter fast path * add all false fast path * use new_empty_array * rename filter kernel argument rename argument: 'filter' to 'predicate' to reduce name collissions. Co-authored-by: Ritchie Vink <ritchie46@gmail.com>
This also fixes UB for filter on
RecordBatches
. Still issue #295.Besides this, I also added a fast path. We already do a
popcount
in thefilter
operation, it seems to me a missed opportunity to not justArc
clone the data when all values aretrue
.EDIT:
I also made
ArrayData::new_empty
public. If there is any objection to that I can make it private. IMO this should be public, as I think it should be easy to make an empty container of data structures.