-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InList: fix bug for comparing with Null in the list using the set optimization #2809
Conversation
}, | ||
ColumnarValue::Array(_) => { | ||
unimplemented!("InList does not yet support nested columns.") | ||
} | ||
}) | ||
.collect::<Vec<_>>(); | ||
|
||
// TODO do we need to replace this below logic by `in_list_primitive`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to replace below logic by the macro collection_contains_check
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?
Yes, I want to replace the logic in the in_list_primitive
by the macro collection_contains_check
.
I will replace the duplicated code in the in_list_primitive.
.collect::<BooleanArray>(), | ||
))); | ||
.map(|vop| { | ||
match vop.map(|v| !$SET_VALUES.contains(&v.try_into().unwrap())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The f32 is not implemented the eq
trait, we can't use the set::<f32>.contains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just use the set::<Scalarvalue::Float32/Float64>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere in DataFusion (including in ScalarValue
) we use ordered_float
to compare floating point numbers
It might be possible to use set::<OrderedFloat<f32>>
, which would be more space efficient (fewer bytes than ScalarValue
) as well as faster (as the comparison doens't have to dispatch on the type each time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #2809 +/- ##
==========================================
+ Coverage 85.18% 85.27% +0.09%
==========================================
Files 275 275
Lines 48564 48675 +111
==========================================
+ Hits 41367 41508 +141
+ Misses 7197 7167 -30
Continue to review full report at Codecov.
|
I don't get it clearly what this is targeted to fix. The issue it closes is "support decimal in (NULL)", but by a quick look, seems the change is more than that. Although there is a few description, but I cannot get it from it too. Could you rephrase the issue this is going to fix? |
sorry for the confused description. I have described the issue in #2817 and the description of this pull request. PTAL @viirya |
d7541f3
to
56c79ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @liukun4515 -- I reviewed the tests carefully and I think this looks good.
The code "feels" a little more complicated than needed but I think it is doing what it needs to and the test coverage is good.
I wonder if it is worth just 1 test in sql_integration
somewhere that ensures this implementation is hooked up correctly rather than just relying on tests in in_list.rs
I think it would be good if @Ted-Jiang was also able to review this PR prior to merging it, but it isn't necessary.
lit(ScalarValue::Boolean(Some(true))), | ||
lit(ScalarValue::Boolean(None)), | ||
]; | ||
for _ in 1..(OPTIMIZER_INSET_THRESHOLD + 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the choice of bounds of 1
... OPTIMIZER_INSET_THRESHOLD + 1
-- why not 0..OPTIMIZER_INSET_THRESHOLD
? Not that this way is wrong, I just don't understand it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to 0..OPTIMIZER_INSET_THRESHOLD
let col_a = col("a", &schema)?; | ||
let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?; | ||
|
||
// expression: "a in (0,3,4....)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// expression: "a in (0,3,4....)" | |
// expression: "a in (0,Null,3,4....)" |
batch, | ||
list.clone(), | ||
&false, | ||
vec![Some(true), None, None], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let col_a = col("a", &schema)?; | ||
let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?; | ||
|
||
// expression: "a in (0.0,3.0,4.0 ....)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// expression: "a in (0.0,3.0,4.0 ....)" | |
// expression: "a in (0.0,Null,3.0,4.0 ....)" |
@@ -1139,4 +1242,192 @@ mod tests { | |||
|
|||
Ok(()) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the test coverage
.collect::<BooleanArray>(), | ||
))); | ||
.map(|vop| { | ||
match vop.map(|v| !$SET_VALUES.contains(&v.try_into().unwrap())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere in DataFusion (including in ScalarValue
) we use ordered_float
to compare floating point numbers
It might be possible to use set::<OrderedFloat<f32>>
, which would be more space efficient (fewer bytes than ScalarValue
) as well as faster (as the comparison doens't have to dispatch on the type each time)
}, | ||
ColumnarValue::Array(_) => { | ||
unimplemented!("InList does not yet support nested columns.") | ||
} | ||
}) | ||
.collect::<Vec<_>>(); | ||
|
||
// TODO do we need to replace this below logic by `in_list_primitive`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?
#2832 issue to track it |
56c79ca
to
384277b
Compare
collection_contains_check!(array, native_set, negated, contains_null) | ||
} | ||
|
||
fn set_contains_utf<OffsetSize: OffsetSizeTrait>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn set_contains_utf<OffsetSize: OffsetSizeTrait>( | |
fn make_set_contains_utf8<OffsetSize: OffsetSizeTrait>( |
@@ -532,66 +521,131 @@ impl PhysicalExpr for InListExpr { | |||
match value_data_type { | |||
DataType::Boolean => { | |||
let array = array.as_any().downcast_ref::<BooleanArray>().unwrap(); | |||
set_contains_with_negated!(array, set, self.negated) | |||
Ok(set_contains_for_primitive!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why set_contains_for_primitive
is macro, but set_contains_utf
and make_set_contains_decimal
are methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_contains_utf
and make_set_contains_decimal
are just apply to a specified data type case which are not compatible with other primitive case.
set_contains_for_primitive
will apply to many similar case and same logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. The logic handling null looks correct to me. I have a few comments about method naming and macro/method.
return Ok(ColumnarValue::Array(Arc::new( | ||
macro_rules! collection_contains_check { | ||
($ARRAY:expr, $VALUES:expr, $NEGATED:expr, $CONTAINS_NULL:expr) => {{ | ||
let bool_array = if $NEGATED { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two $NEGATED
cases look very similar, is it possible to combine them to save code?
e.g.
if $CONTAINS_NULL {
$ARRAY
.iter()
.map(|vop| match vop.map(|v| {
if $NEGATED { !$VALUES.contains(&v) } else { $VALUES.contains(&v) }
}) {
Some(true) if $NEGATED => None,
Some(false) if !$NEGATED => None,
x => x,
})
.collect::<BooleanArray>()
} else {
$ARRAY
.iter()
.map(|vop| vop.map(|v| {
if $NEGATED { !$VALUES.contains(&v) } else { $VALUES.contains(&v) }
})
.collect::<BooleanArray>()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya
these two branches can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2833 track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have definitely seen cases where
if $NEGATED {
// loop
} else {
// loop
}
Was faster than
loop {
if $NEGATED {
// ..
} else {
// ..
}
}
But I think the only way to find out if that applies in this case would be with a benchmark
Which issue does this PR close?
close #2817
In this pr: #2156, @Ted-Jiang add the set to optimize the Inlist.
But the implementation does not consider the
NULL
case forIN
orNOT IN
expr.In the SQL system like Mysql or spark, NULL is not equal NULL and NULL can't be compared with other data type.
If
A
compare with NULL, the result must be NULL.Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?