-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of DictionaryArray::try_new() #1435
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1435 +/- ##
==========================================
- Coverage 82.68% 82.67% -0.01%
==========================================
Files 185 185
Lines 53822 53825 +3
==========================================
- Hits 44502 44501 -1
- Misses 9320 9324 +4
Continue to review full report at Codecov.
|
DataType::Int8 => self.check_bounds::<i8>(max_value), | ||
DataType::Int16 => self.check_bounds::<i16>(max_value), | ||
DataType::Int32 => self.check_bounds::<i32>(max_value), | ||
DataType::Int64 => self.check_bounds::<i64>(max_value), | ||
_ => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a possible solution would be to extract the dictionary validation logic out of ArrayData::validate_full into a separate function. DictionaryArray::try_new could then use ArrayDataBuilder::build_unchecked and afterwards call the new function which only validates that the keys are in bounds.
I think "the dictionary validation logic" is only for the logic inside DataType::Dictionary
pattern branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation logic for other data types are not for dictionary offset.
arrow/src/array/array_dictionary.rs
Outdated
let array = unsafe { data.build_unchecked() }; | ||
|
||
array.validate_dictionary_offest()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually validate_full
also contains "cheap" validation (validate
), like buffer length, key type, etc. I think they seems necessary still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that since the input to this function is a valid array, then the array data itself is also valid.
Thus, we know that the dictionary keys are valid integers of type K
but we don't know that they are all within the range 0..dictionary_values.len()
so that validation is still required. I think this PR makes this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we only have a limit that K: ArrowPrimitiveType
. But in validate
, it will further check if K
is dictionary key type by DataType::is_dictionary_key_type
, so I think the check is missed after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an excellent point @viirya -- perhaps we can add a test case that tries to do something crazy like use Float64
keys to verify.
Perhaps the test could look like this (untested):
#[test]
#[should_panic(
expected = "Type is not valid dictionary type"
)]
fn test_try_new_index_too_large() {
let values: StringArray = [Some("foo"), Some("bar")].into_iter().collect();
let keys: Float32Array = [Some(0), None, Some(3)].into_iter().collect();
DictionaryArray::<Float32Type>::try_new(&keys, &values).unwrap();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya are you suggesting we call array.validate()
in addition to array.validate_dictionary_offest()
? If so that makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, looks like we also need validate
in additional to validate_dictionary_offest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jackwener ! I think this looks good. 👍
Could you also perhaps update the comments in try_new
to reflect what you have changed? For example this is likely no longer relevant:
// Note: This does more work than necessary by rebuilding /
// revalidating all the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also double checked and there is already test coverage for this scenario:
👍
arrow/src/array/array_dictionary.rs
Outdated
let array = unsafe { data.build_unchecked() }; | ||
|
||
array.validate_dictionary_offest()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that since the input to this function is a valid array, then the array data itself is also valid.
Thus, we know that the dictionary keys are valid integers of type K
but we don't know that they are all within the range 0..dictionary_values.len()
so that validation is still required. I think this PR makes this change
@@ -112,7 +113,12 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> { | |||
_ => data = data.null_count(0), | |||
} | |||
|
|||
Ok(data.build()?.into()) | |||
let array = unsafe { data.build_unchecked() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let array = unsafe { data.build_unchecked() }; | |
// Safety: `validate` ensures key type is correct, and | |
// `validate_dictionary_offset` ensures all offsets are within range | |
let array = unsafe { data.build_unchecked() }; |
Looks good -- thank you @jackwener and @viirya |
lgtm |
Which issue does this PR close?
Closes #1313.
What changes are included in this PR?
extract the dictionary validation logic out of
ArrayData::validate_full
into a separate function.DictionaryArray::try_new
use theArrayDataBuilder::build_unchecked
and afterwards call the new function which only validates that the keys are in bounds.Are there any user-facing changes?
None