-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support IN list on Dictionary #3975
Conversation
@@ -489,6 +496,196 @@ impl InListExpr { | |||
contains_null | |||
)) | |||
} | |||
|
|||
fn evaluate_non_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a refactor to move all available on into this function so it can be reused with Dictionary data
@@ -714,188 +912,31 @@ impl PhysicalExpr for InListExpr { | |||
}; | |||
|
|||
match value_data_type { | |||
DataType::Float32 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the moved code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that the implementation as written will have extremely poor memory and CPU performance.
I think a more optimised implementation might do the following:
- Compute the set of used dictionary keys to a BooleanArray, e.g. using
BooleanBufferBuilder
- And null mask of the Dictionary values with the computed mask
- Call InList on that new array of values
.unwrap(); | ||
let mut dict_vals = Vec::with_capacity(dict_array.len()); | ||
for i in 0..dict_array.len() { | ||
let (values_array, values_index) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will perform a downcast for each element, it would be better to access dict_array.keys
, and dict_array.values
, and then iterate over the keys
.as_any() | ||
.downcast_ref::<DictionaryArray<Int32Type>>() | ||
.unwrap(); | ||
let mut dict_vals = Vec::with_capacity(dict_array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This acts to hydrate the dictionary which is likely extremely inefficient
// Get values from the dictionary that include nulls for none values | ||
let dict_array = array | ||
.as_any() | ||
.downcast_ref::<DictionaryArray<Int32Type>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will panic if key_type
is not DataType::Int32
. We should either add this to the match block, or use something like https://docs.rs/arrow/latest/arrow/macro.downcast_dictionary_array.html
to handle all cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I forgot to make it general. Working on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make this general after the null data work. I tried to move this inside a function with trait to make if <DictionaryArray<K>>
. It works for the downcast_ref
but it does not work with take
. I need to write more code for all cases if we use take
// Look up value from Index | ||
let value = match values_index { | ||
Some(values_index) => { | ||
ScalarValue::try_from_array(values_array, values_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also performs a fairly expensive downcast operation for every dictionary element
@@ -436,6 +436,13 @@ impl InListExpr { | |||
ScalarValue::Utf8(None) => None, | |||
ScalarValue::LargeUtf8(Some(v)) => Some(v.as_str()), | |||
ScalarValue::LargeUtf8(None) => None, | |||
ScalarValue::Dictionary(_, v) => match v.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this modification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows getting the underlying value out of a (typed) ScalarValue::Dictionary
@@ -428,9 +428,8 @@ async fn csv_in_set_test() -> Result<()> { | |||
} | |||
|
|||
#[tokio::test] | |||
#[ignore] | |||
// https://github.com/apache/arrow-datafusion/issues/3936 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// https://github.com/apache/arrow-datafusion/issues/3936 |
@@ -436,6 +436,13 @@ impl InListExpr { | |||
ScalarValue::Utf8(None) => None, | |||
ScalarValue::LargeUtf8(Some(v)) => Some(v.as_str()), | |||
ScalarValue::LargeUtf8(None) => None, | |||
ScalarValue::Dictionary(_, v) => match v.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows getting the underlying value out of a (typed) ScalarValue::Dictionary
) | ||
DataType::Dictionary(_key_type, value_type) => { | ||
// Get values from the dictionary that include nulls for none values | ||
let dict_array = array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you should be able to evaluate the IN list only on the dictionary values rather than continually looking up the same elements over and over again in the dictionary and use 'take' https://docs.rs/arrow/latest/arrow/compute/kernels/take/fn.take.html to form the final array
Something like thus pseudo code maybe
let values = dict_array.values();
// recursively evaluate IN <..> on the value array
let values_result = evaluate_set(values, list_values);
// Then form the final boolean array by calling take on the indices
compute::take(values_result, dict_array.keys())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
DataType::Dictionary(_key_type, value_type) => { | ||
// Get values from the dictionary that include nulls for none values | ||
let dict_array = array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Get values from the dictionary that include nulls for none values | ||
let dict_array = array | ||
.as_any() | ||
.downcast_ref::<DictionaryArray<Int32Type>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make this general after the null data work. I tried to move this inside a function with trait to make if <DictionaryArray<K>>
. It works for the downcast_ref
but it does not work with take
. I need to write more code for all cases if we use take
} | ||
|
||
#[tokio::test] | ||
async fn in_list_string_dictionaries_with_null() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test do not pass with method // 1
.unwrap(); | ||
let keys = dict_array.keys(); | ||
|
||
let values_result = evaluate_set(&array, list_values).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let values_result = evaluate_set(&array, list_values).unwrap(); | |
let values_result = evaluate_set(dict_array.values().as_ref(), list_values).unwrap(); |
? I'm surprised as written this doesn't result in a stack overflow?
} | ||
|
||
// Return a boolean array indicating whether the value is in list_values | ||
fn evaluate_set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a new function, I think @alamb 's suggestion was to recurse into InListExpr::evaluate (by pulling it into a free function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was suggesting that since InList was already implemented for non dictionary types, we re-use that implementation (though I think that will require some restructuring of how evaluate
is written
.collect::<Vec<_>>(); | ||
let list_array = ScalarValue::iter_to_array(scalars).unwrap(); | ||
|
||
let cmp = build_compare(&array, &list_array).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is known not to handle nulls correctly - apache/arrow-rs#2687
Chatted with @NGA-TRAN, I'm going to take a stab at this first thing tomorrow (NZ time) |
Which issue does this PR close?
Closes #3936
Rationale for this change
To support the predicate
where col IN (values_of_dictionary_col)