Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of DictionaryArray::try_new()  #1435

Merged
merged 5 commits into from
Mar 22, 2022

Conversation

jackwener
Copy link
Member

Which issue does this PR close?

Closes #1313.

What changes are included in this PR?

extract the dictionary validation logic out of ArrayData::validate_full into a separate function.

DictionaryArray::try_new use the ArrayDataBuilder::build_unchecked and afterwards call the new function which only validates that the keys are in bounds.

Are there any user-facing changes?

None

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 13, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #1435 (b9b79cf) into master (d9099c4) will decrease coverage by 0.00%.
The diff coverage is 89.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1435      +/-   ##
==========================================
- Coverage   82.68%   82.67%   -0.01%     
==========================================
  Files         185      185              
  Lines       53822    53825       +3     
==========================================
- Hits        44502    44501       -1     
- Misses       9320     9324       +4     
Impacted Files Coverage Δ
arrow/src/array/data.rs 82.82% <88.00%> (-0.34%) ⬇️
arrow/src/array/array_dictionary.rs 91.73% <100.00%> (+0.07%) ⬆️
arrow/src/datatypes/field.rs 53.79% <0.00%> (-0.31%) ⬇️
arrow/src/array/transform/mod.rs 86.31% <0.00%> (-0.12%) ⬇️
arrow/src/array/array_list.rs 95.52% <0.00%> (ø)
parquet/src/encodings/encoding.rs 93.71% <0.00%> (+0.19%) ⬆️
parquet_derive/src/parquet_field.rs 66.43% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9099c4...b9b79cf. Read the comment docs.

DataType::Int8 => self.check_bounds::<i8>(max_value),
DataType::Int16 => self.check_bounds::<i16>(max_value),
DataType::Int32 => self.check_bounds::<i32>(max_value),
DataType::Int64 => self.check_bounds::<i64>(max_value),
_ => unreachable!(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a possible solution would be to extract the dictionary validation logic out of ArrayData::validate_full into a separate function. DictionaryArray::try_new could then use ArrayDataBuilder::build_unchecked and afterwards call the new function which only validates that the keys are in bounds.

I think "the dictionary validation logic" is only for the logic inside DataType::Dictionary pattern branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic for other data types are not for dictionary offset.

Comment on lines 115 to 117
let array = unsafe { data.build_unchecked() };

array.validate_dictionary_offest()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually validate_full also contains "cheap" validation (validate), like buffer length, key type, etc. I think they seems necessary still.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that since the input to this function is a valid array, then the array data itself is also valid.

Thus, we know that the dictionary keys are valid integers of type K but we don't know that they are all within the range 0..dictionary_values.len() so that validation is still required. I think this PR makes this change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we only have a limit that K: ArrowPrimitiveType. But in validate, it will further check if K is dictionary key type by DataType::is_dictionary_key_type, so I think the check is missed after this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an excellent point @viirya -- perhaps we can add a test case that tries to do something crazy like use Float64 keys to verify.

Perhaps the test could look like this (untested):

    #[test]
    #[should_panic(
        expected = "Type is not valid dictionary type"
    )]
    fn test_try_new_index_too_large() {
        let values: StringArray = [Some("foo"), Some("bar")].into_iter().collect();
        let keys: Float32Array = [Some(0), None, Some(3)].into_iter().collect();
        DictionaryArray::<Float32Type>::try_new(&keys, &values).unwrap();
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya are you suggesting we call array.validate() in addition to array.validate_dictionary_offest()? If so that makes sense to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks like we also need validate in additional to validate_dictionary_offest.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jackwener ! I think this looks good. 👍

Could you also perhaps update the comments in try_new to reflect what you have changed? For example this is likely no longer relevant:

        // Note: This does more work than necessary by rebuilding /
        // revalidating all the data

arrow/src/array/data.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also double checked and there is already test coverage for this scenario:

https://github.com/apache/arrow-rs/blob/master/arrow/src/array/array_dictionary.rs#L557-L577

👍

Comment on lines 115 to 117
let array = unsafe { data.build_unchecked() };

array.validate_dictionary_offest()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that since the input to this function is a valid array, then the array data itself is also valid.

Thus, we know that the dictionary keys are valid integers of type K but we don't know that they are all within the range 0..dictionary_values.len() so that validation is still required. I think this PR makes this change

@jackwener
Copy link
Member Author

jackwener commented Mar 22, 2022

I have read the dialogue above. thanks @viirya @alamb ❤😃.
I add the cheap validate() and a unit test for wrong dictionary key.

@@ -112,7 +113,12 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
_ => data = data.null_count(0),
}

Ok(data.build()?.into())
let array = unsafe { data.build_unchecked() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let array = unsafe { data.build_unchecked() };
// Safety: `validate` ensures key type is correct, and
// `validate_dictionary_offset` ensures all offsets are within range
let array = unsafe { data.build_unchecked() };

@alamb
Copy link
Contributor

alamb commented Mar 22, 2022

Looks good -- thank you @jackwener and @viirya

@alamb alamb merged commit 3ee2e67 into apache:master Mar 22, 2022
@viirya
Copy link
Member

viirya commented Mar 22, 2022

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of DictionaryArray::try_new()
4 participants