Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: Optimized ByteArrayReader, Add UTF-8 Validation (#1040) #1082

Merged
merged 18 commits into from
Jan 18, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 21, 2021

Draft as builds on #1054

Which issue does this PR close?

Adds an optimized ByteArrayReader as part of proving out the generics added in #1041, and as a precursor to #171.

This also adds UTF-8 validation and support for DELTA_BYTE_ARRAY, neither of which are currently supported.

Closes #786

Rationale for this change

Depending on the benchmark, this can be anything from approximately the same to significantly (2x) faster than the ArrowArrayReader implementation added in #384. This is largely down to slightly more efficient null padding, and avoiding dynamic dispatch. The dominating factor in the benchmarks is the string value copy, which is makes me optimistic for the returns #171 wil yield.

I didn't benchmark the results for DELTA_BYTE_ARRAY encoding but the returns are likely to be even more significant, as the layout is more optimal for decode

The major benefit over the ArrowArrayReader implementation, aside from the speed bump, is the ability to share the existing ColumnReaderImpl and RecordReader logic, and the ability to work with all types of variable length strings and byte arrays.

This logic also forms the basis for #1180

What changes are included in this PR?

Adds a new ByteArrayReader that implements ArrayReader for variable length byte arrays

Are there any user-facing changes?

No

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Dec 21, 2021
@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2021

Codecov Report

Merging #1082 (22c090e) into master (e45d118) will increase coverage by 0.01%.
The diff coverage is 88.88%.

❗ Current head 22c090e differs from pull request most recent head c941606. Consider uploading reports for the commit c941606 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1082      +/-   ##
==========================================
+ Coverage   82.64%   82.65%   +0.01%     
==========================================
  Files         173      175       +2     
  Lines       50865    51512     +647     
==========================================
+ Hits        42037    42578     +541     
- Misses       8828     8934     +106     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader.rs 76.93% <37.03%> (-0.23%) ⬇️
parquet/src/column/reader.rs 68.80% <41.17%> (-1.09%) ⬇️
parquet/src/arrow/record_reader.rs 94.02% <60.00%> (-0.73%) ⬇️
arrow/src/array/array_string.rs 97.61% <66.66%> (ø)
parquet/src/arrow/array_reader/byte_array.rs 87.06% <87.06%> (ø)
parquet/src/arrow/arrow_reader.rs 91.86% <90.00%> (+0.25%) ⬆️
arrow/src/array/array_binary.rs 93.13% <92.30%> (-0.42%) ⬇️
parquet/src/arrow/array_reader/offset_buffer.rs 93.10% <93.10%> (ø)
parquet/src/arrow/levels.rs 84.56% <95.45%> (+0.28%) ⬆️
arrow/src/compute/kernels/take.rs 95.35% <97.95%> (+0.18%) ⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e45d118...c941606. Read the comment docs.

@yordan-pavlov
Copy link
Contributor

@tustvold this sounds exciting, would you be able to share some performance benchmark results?

@tustvold
Copy link
Contributor Author

would you be able to share some performance benchmark results?

They're very preliminary at this stage, I'm not totally confident this code is correct nor have I spent any time trying to optimise it, but here you go. My primary focus has been proving out the interface from #1041, not polishing up the specific optimisations yet.

"Old" is the new ByteArrayReader, "new" is the StringArrayReader

arrow_array_reader/read StringArray, plain encoded, mandatory, no NULLs - old                                                                            
                        time:   [110.98 us 111.00 us 111.03 us]
arrow_array_reader/read StringArray, plain encoded, mandatory, no NULLs - new                                                                            
                        time:   [124.77 us 124.99 us 125.32 us]
arrow_array_reader/read StringArray, plain encoded, optional, no NULLs - old                                                                            
                        time:   [122.15 us 122.17 us 122.20 us]
arrow_array_reader/read StringArray, plain encoded, optional, no NULLs - new                                                                            
                        time:   [136.72 us 136.76 us 136.81 us]
arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - old                                                                            
                        time:   [117.26 us 117.35 us 117.43 us]
arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - new                                                                            
                        time:   [258.05 us 258.17 us 258.30 us]
arrow_array_reader/read StringArray, dictionary encoded, mandatory, no NULLs - old                                                                            
                        time:   [145.30 us 145.35 us 145.41 us]
arrow_array_reader/read StringArray, dictionary encoded, mandatory, no NULLs - new                                                                            
                        time:   [117.14 us 117.18 us 117.22 us]
arrow_array_reader/read StringArray, dictionary encoded, optional, no NULLs - old                                                                            
                        time:   [159.07 us 159.11 us 159.15 us]
arrow_array_reader/read StringArray, dictionary encoded, optional, no NULLs - new                                                                            
                        time:   [134.39 us 134.41 us 134.43 us]
arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - old                                                                            
                        time:   [108.28 us 108.30 us 108.33 us]
arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - new                                                                            
                        time:   [230.15 us 230.23 us 230.32 us]

Aside from dictionary encoded columns with no nulls, it performs better. This is probably just something suboptimal in the way I'm decoding RLE data, and should be rectifiable.

@yordan-pavlov
Copy link
Contributor

yordan-pavlov commented Dec 22, 2021

@tustvold what about performance with primitive types (e.g. int32)? - this is where I have been struggling to make the ArrowArrayReader faster (compared to old array reader implementation) for dictionary-encoded primitive values

@tustvold
Copy link
Contributor Author

tustvold commented Dec 22, 2021

what about performance with primitive types (e.g. int32)

This PR builds on #1054 which yields a 2-6x speed up when using PrimitiveArrayReader on non-nested columns compared to current master. This is purely through better null handling, which this PR also benefits from.

I do have some reservations about drawing too much from these benchmarks, I have found them to have strange interactions with my system's memory allocator (see here), but its certainly not slower and is likely significantly faster.

compared to old array reader implementation

That's the key thing about #1041 it doesn't replace this array reader implementation, it just adds the ability to extend it. For primitive types the performance of #1041 is therefore unchanged, it just gives the ability to add optimisations such as #1054 and this PR

@yordan-pavlov
Copy link
Contributor

yordan-pavlov commented Dec 22, 2021

@tustvold you are probably aware of this, but just to make sure it's not missed, when I run this branch with datafusion against a parquet file I get an error Parquet argument error: Parquet error: unsupported encoding for byte array: PLAIN_DICTIONARY

Other than that, the performance benchmark results look impressive - I was able to run the benchmark and this branch is faster than the ArrowArrayReader, sometimes several times faster, in almost all cases (exceptions listed below). And the ArrowArrayReader was already several times faster in many cases than the old array reader implementation, making these performance results even more impressive.

A major reason, why I only implemented ArrowArrayReader for string arrays is because I have been struggling to make it faster for dictionary-encoded primitive arrays, but it looks like this isn't going to be a problem with this new implementation.
So if we can make it faster in all benchmarks, I am happy to abandon the ArrowArrayReader in favor of this new implementation.

Where it is still a bit slower is in these two cases:

read StringArray, plain encoded, mandatory, no NULLs - old: time: [306.10 us 342.14 us 377.28 us]
read StringArray, plain encoded, mandatory, no NULLs - new: time: [310.84 us 337.49 us 368.74 us]

read StringArray, dictionary encoded, mandatory, no NULLs - old: time: [286.61 us 320.07 us 354.74 us]
read StringArray, dictionary encoded, mandatory, no NULLs - new: time: [222.87 us 240.56 us 260.93 us]

The reason why ArrowArrayReader is fast in those cases, I suspect, is because when there are no nulls / def levels, the def level buffers are not read or processed at all, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L566 . This also means that the bit of code that produces the null bitmap also doesn't run, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L595 and the main path in the code is not concerned with null values at all, which is why it's so fast when there are no null / def levels, see here: https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L592 , see string converter here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1164 .

@tustvold
Copy link
Contributor Author

tustvold commented Jan 12, 2022

I've added UTF-8 validation, including @jorgecarleitao 's very helpful test case, so this should fix #786 also 🎉

fn try_push(&mut self, data: &[u8]) -> Result<()> {
fn try_push(&mut self, data: &[u8], validate_utf8: bool) -> Result<()> {
if validate_utf8 {
if let Err(e) = std::str::from_utf8(data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if something like https://github.com/rusticstuff/simdutf8 could be used for faster UTF8 validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely something to look into. It would also be interesting to see if it is faster to validate the entire string buffer and do codepoint validation at the offsets separately, or to validate each individual string as is done here. I'm not honestly sure which will be faster

Copy link
Contributor Author

@tustvold tustvold Jan 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did some experimentation:

It is significantly faster to verify on push that the first byte is a valid start UTF-8 codepoint, and then do UTF-8 validation on the larger buffer in one go, it takes the performance hit on PLAIN encoded strings to ~1.1x down from ~2x. I have modified the code to do this.

With this optimisation applied, changing to simdutf8 made only a very minor ~6% improvement on PLAIN encoded strings, which reduced to no appreciable difference with RLE encoded strings. This may be my machine, or the lack of non-ASCII characters in the input, but I'm going to leave this out for now.

Copy link
Contributor

@Dandandan Dandandan Jan 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we tried to do something similar with parquet2 but concluded that the individual strings should be checked instead. simdutf8 is more impressive at checking non ASCII strings btw (e.g. try Chinese or emojis)
Checking the code points at the offsets seems an interesting approach!
Also FYI @jorgecarleitao

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be sufficient, but I'm not an expert on UTF-8. My reasoning is that when you slice a str all it validates are that the start and end offsets pass std::str::is_char_boundary - here. Taking that the standard library is correct, and the only invariant of str is that the bytes are UTF-8 as a whole, I think this is no different?

@@ -273,7 +274,7 @@ fn build_dictionary_encoded_string_page_iterator(
InMemoryPageIterator::new(schema, column_desc, pages)
}

fn bench_array_reader(mut array_reader: impl ArrayReader) -> usize {
fn bench_array_reader(mut array_reader: Box<dyn ArrayReader>) -> usize {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is necessary because byte_array_reader hides its implementing type, this is both to make the API more ergonomic for clients and also to aid future crate evolution

@@ -368,10 +366,10 @@ fn add_benches(c: &mut Criterion) {
mandatory_int32_column_desc.clone(),
);
count = bench_array_reader(array_reader);
})
});
assert_eq!(count, EXPECTED_VALUE_COUNT);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allows for running a subset of the benchmarks, without this the assertion fails if the bench function is filtered out.

For example, this would run just the string array benchmarks

cargo criterion --bench arrow_array_reader --features test_common,experimental -- StringArray

arrow_type,
)?))
PhysicalType::BYTE_ARRAY => match arrow_type {
// TODO: Replace with optimised dictionary reader (#171)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1180

/// - `num_levels` - the number of levels contained within the page, i.e. values including nulls
/// - `num_values` - the number of non-null values contained within the page (V2 page only)
///
/// Note: data encoded with [`Encoding::RLE`] may not know its exact length, as the final
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be explicit about this to avoid a resurgence of this style of bug - #1111

This is a crate-private API, and the necessary null counting dance is performed by RecordReader, but I wanted to call it out for the avoidance of confusion.

@@ -968,4 +969,40 @@ mod tests {
assert_eq!(batch.num_rows(), 4);
assert_eq!(batch.column(0).data().null_count(), 2);
}

#[test]
fn test_invalid_utf8() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test sourced from #786

@alamb alamb changed the title Optimized ByteArrayReader (#1040) parquet: Optimized ByteArrayReader (#1040) Jan 17, 2022
@alamb alamb changed the title parquet: Optimized ByteArrayReader (#1040) parquet: Optimized ByteArrayReader, UTF-8 Validation (#1040) Jan 17, 2022
@alamb alamb changed the title parquet: Optimized ByteArrayReader, UTF-8 Validation (#1040) parquet: Optimized ByteArrayReader, Add UTF-8 Validation (#1040) Jan 17, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR pretty thoroughly -- and while I am not anywhere near as much of an expert as it is my judgement that this is ready to merge.

What is the plan for the ArrowArrayReader implementation added in #384? Should we plan to remove it from this crate (if so I can file a ticket)

Thank you very much @tustvold and for the effort in reviewing @yordan-pavlov.

Any remaining thoughts or people who want to comment prior to merging?


/// A buffer of variable-sized byte arrays that can be converted into
/// a corresponding [`ArrayRef`]
pub struct OffsetBuffer<I: ScalarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost wonder if this is valuable itself to put into the arrow crate and use to create GenericStringArrays from iterators of &str etc. Not for this PR, I am just musing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought similar, lifting this and ScalarBuffer into arrow-rs would likely remove a non-trivial amount of unsafe

if let Some(&b) = data.first() {
// A valid code-point iff it does not start with 0b10xxxxxx
// Bit-magic taken from `std::str::is_char_boundary`
if (b as i8) < -0x40 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// UTF-8. This should be done by calling [`Self::values_as_str`] after
/// all data has been written
pub fn try_push(&mut self, data: &[u8], validate_utf8: bool) -> Result<()> {
if validate_utf8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for anyone else following along, I double checked the code and validate_utf8 is disabled for DataType::Binary as one would expect. It is always enabled for DataType::Utf8

type Slice = Self;

fn split_off(&mut self, len: usize) -> Self::Output {
let remaining_offsets = self.offsets.len() - len - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend an assert here that self.offsets.len() > len for clarity, but I think that the offsets[len] would panic below if this were not the case, so I don't think it is a safety issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

let mut new_offsets = ScalarBuffer::new();
new_offsets.reserve(remaining_offsets + 1);
for v in &offsets[len..] {
new_offsets.push(*v - end_offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice


Self {
offsets: std::mem::replace(&mut self.offsets, new_offsets),
values: self.values.take(end_offset.to_usize().unwrap()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it a little confusing that values.take() does the same thing as split_off -- maybe it is worth renaming ScalarBuffer<T>::take() to ScalarBuffer<T>::split_off()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a backwards compatibility thing, ScalarBuffer::Output must be Buffer to avoid changing the API of ColumnReaderImpl. Perhaps this could be included in a future breaking change cleanup PR 🤔


let values_range = read_offset..read_offset + values_read;
for (value_pos, level_pos) in values_range.clone().rev().zip(rev_position_iter) {
assert!(level_pos >= value_pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is definitely a tricky bit of logic, looks reasonable to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was much wailing and gnashing of teeth in its creation 😅

}

#[test]
fn test_byte_array_decoder() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NULL covered anywhere? If not I think that might be valuable to cover here too

Copy link
Contributor Author

@tustvold tustvold Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null padding is technically handled and tested as part of OffsetBuffer, but I'll add something here

Edit: added

@tustvold
Copy link
Contributor Author

What is the plan for the ArrowArrayReader implementation

I don't think there is a particular reason for it to stay, but I defer the final decision to @yordan-pavlov

@yordan-pavlov
Copy link
Contributor

What is the plan for the ArrowArrayReader implementation added in #384? Should we plan to remove it from this crate (if so I can file a ticket)

I am happy for ArrowArrayReader to be removed - I have ran the benchmarks against the latest code and @tustvold 's work is now often several times faster in almost all cases and in the one or two cases where it isn't the difference is small - congratulations @tustvold ; plus I think @tustvold 's array reader could be made faster still

@alamb
Copy link
Contributor

alamb commented Jan 18, 2022

I also ran the tests from the latest master branch of datafusion against this branch and they all passed. Not that it is the most thorough coverage of the parquet format, but it adds some.

👍

@alamb
Copy link
Contributor

alamb commented Jan 18, 2022

#1197 tracks ArrowArrayReader removal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate performance security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Soundness: reading parquet with invalid utf8 results in UB
5 participants