-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify InListExpr ~20-70% Faster #4057
Conversation
|
||
/// Size at which to use a Set rather than Vec for `IN` / `NOT IN` | ||
/// Value chosen by the benchmark at | ||
/// https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845198369 | ||
/// TODO: add switch codeGen in In_List | ||
static OPTIMIZER_INSET_THRESHOLD: usize = 30; | ||
static OPTIMIZER_INSET_THRESHOLD: usize = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to double-check whether we even need this anymore
for<'a> &'a T: ArrayAccessor, | ||
for<'a> <&'a T as ArrayAccessor>::Item: PartialEq + HashValue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is gross, not really sure how to avoid it though as the lifetime of the downcast & dyn Array
necessarily has a lifetime less than the lifetime of ArraySet... So we need to HRTB magic...
.map(|expr| { | ||
expr.evaluate(batch).and_then(|r| match r { | ||
ColumnarValue::Array(_) => Err(DataFusionError::Execution( | ||
"InList expression must evaluate to a scalar".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the code before, which instead returned "InList does not yet support nested columns". I think this error message is more clear about what the cause actually is, it has nothing to do with nested columns at all.
See #3766
@@ -2068,50 +2069,6 @@ mod tests { | |||
lit(struct_literal) | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now always use a HashSet see below
/// Value chosen by the benchmark at | ||
/// https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845198369 | ||
/// TODO: add switch codeGen in In_List | ||
static OPTIMIZER_INSET_THRESHOLD: usize = 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes in this PR the hashset is only slower when list contains 1 or 2 non-null primitive values. I did not feel this warranted the additional complexity, especially as the optimizer should really rewrite an InList of two elements to a disjunctive boolean expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a future ticket, we can create a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably rewrite inlist exprs with few elements from x in (0, 1)
to x=1 or x=2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed #4089 to track
@@ -1543,237 +932,12 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need these duplicated tests as there isn't a separate execution path based on the length of the InList
$(impl HashValue for $t { | ||
fn hash_one(&self, state: &RandomState) -> u64 { | ||
state.hash_one(self.to_le_bytes()) | ||
state.hash_one(<$i>::from_ne_bytes(self.to_ne_bytes())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a drive by fix that improves hashing performance of floating points. This is because there are specialized methods for hashing i32
, i64
as opposed to [u8;N]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this now, are we correctly handling NaNs here?
Looking at the ordered-float implementation for Hash
, it first normalizes nans https://docs.rs/ordered-float/3.3.0/src/ordered_float/lib.rs.html#160
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #4051
TLDR OrderedFloat is a more strict ordering than we need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is a pain, but i would recommend pulling this into its own PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't materially change the hashing? It was a bitwise hash before and after, all that changes is it hashes an integer comprised of the bytes instead of an array of bytew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR quite carefully and it looks 👍
Thank you @tustvold
$(impl HashValue for $t { | ||
fn hash_one(&self, state: &RandomState) -> u64 { | ||
state.hash_one(self.to_le_bytes()) | ||
state.hash_one(<$i>::from_ne_bytes(self.to_ne_bytes())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is a pain, but i would recommend pulling this into its own PR
let in_array = &self.array; | ||
let has_nulls = in_data.null_count() != 0; | ||
|
||
ArrayIter::new(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be more performant to use hash_array
to has the array all in one go rather than doing it an element at a time?
I was thinking that since this is the hot path the more vectorization we get the better
@@ -945,7 +323,6 @@ impl PartialEq<dyn Any> for InListExpr { | |||
self.expr.eq(&x.expr) | |||
&& expr_list_eq_any_order(&self.list, &x.list) | |||
&& self.negated == x.negated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& self.negated == x.negated | |
&& self.negated == x.negated | |
// since self.static_filter is derived from self.list, not necessary to check here |
@@ -1809,58 +976,6 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] | |||
fn in_list_set_timestamp() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I verified there is still coverage of timestamp testing above
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
cc @Ted-Jiang and @liukun4515 as well |
Benchmark runs are scheduled for baseline = 8d6448e and contender = 1287529. 1287529 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
* Simplify InList expression * Simplify * Hash floats as integers * Fix tests * Format * Update datafusion-cli lockfile * Sort Cargo.toml * Update datafusion/physical-expr/src/expressions/in_list.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #2165
Closes #2831
Closes #2833
Rationale for this change
The existing logic is incredibly hard to follow. Additionally its use of ScalarValue results in additional allocations, and dynamic dispatch, which is likely severely hurting performance
Benchmarks in #4068, show a slight regression for single primitive value InList, but otherwise across the board a significant improvement. The regression is due to now always using a HashSet, but the additional complexity of two evaluation methods for this case did not seem worthwhile anymore, especially when the optimizer should really rewrite such expressions to a disjunctive boolean expression.
What changes are included in this PR?
Are there any user-facing changes?
No