-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet: Optimized ByteArrayReader, Add UTF-8 Validation (#1040) #1082
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1082 +/- ##
==========================================
+ Coverage 82.64% 82.65% +0.01%
==========================================
Files 173 175 +2
Lines 50865 51512 +647
==========================================
+ Hits 42037 42578 +541
- Misses 8828 8934 +106
Continue to review full report at Codecov.
|
@tustvold this sounds exciting, would you be able to share some performance benchmark results? |
They're very preliminary at this stage, I'm not totally confident this code is correct nor have I spent any time trying to optimise it, but here you go. My primary focus has been proving out the interface from #1041, not polishing up the specific optimisations yet. "Old" is the new ByteArrayReader, "new" is the StringArrayReader
Aside from dictionary encoded columns with no nulls, it performs better. This is probably just something suboptimal in the way I'm decoding RLE data, and should be rectifiable. |
@tustvold what about performance with primitive types (e.g. int32)? - this is where I have been struggling to make the |
This PR builds on #1054 which yields a 2-6x speed up when using PrimitiveArrayReader on non-nested columns compared to current master. This is purely through better null handling, which this PR also benefits from. I do have some reservations about drawing too much from these benchmarks, I have found them to have strange interactions with my system's memory allocator (see here), but its certainly not slower and is likely significantly faster.
That's the key thing about #1041 it doesn't replace this array reader implementation, it just adds the ability to extend it. For primitive types the performance of #1041 is therefore unchanged, it just gives the ability to add optimisations such as #1054 and this PR |
@tustvold you are probably aware of this, but just to make sure it's not missed, when I run this branch with datafusion against a parquet file I get an error Other than that, the performance benchmark results look impressive - I was able to run the benchmark and this branch is faster than the A major reason, why I only implemented Where it is still a bit slower is in these two cases: read StringArray, plain encoded, mandatory, no NULLs - old: time: [306.10 us 342.14 us 377.28 us] read StringArray, dictionary encoded, mandatory, no NULLs - old: time: [286.61 us 320.07 us 354.74 us] The reason why |
56c3bdc
to
578ca91
Compare
I've added UTF-8 validation, including @jorgecarleitao 's very helpful test case, so this should fix #786 also 🎉 |
fn try_push(&mut self, data: &[u8]) -> Result<()> { | ||
fn try_push(&mut self, data: &[u8], validate_utf8: bool) -> Result<()> { | ||
if validate_utf8 { | ||
if let Err(e) = std::str::from_utf8(data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if something like https://github.com/rusticstuff/simdutf8 could be used for faster UTF8 validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely something to look into. It would also be interesting to see if it is faster to validate the entire string buffer and do codepoint validation at the offsets separately, or to validate each individual string as is done here. I'm not honestly sure which will be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did some experimentation:
It is significantly faster to verify on push that the first byte is a valid start UTF-8 codepoint, and then do UTF-8 validation on the larger buffer in one go, it takes the performance hit on PLAIN encoded strings to ~1.1x down from ~2x. I have modified the code to do this.
With this optimisation applied, changing to simdutf8 made only a very minor ~6% improvement on PLAIN encoded strings, which reduced to no appreciable difference with RLE encoded strings. This may be my machine, or the lack of non-ASCII characters in the input, but I'm going to leave this out for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we tried to do something similar with parquet2 but concluded that the individual strings should be checked instead. simdutf8
is more impressive at checking non ASCII strings btw (e.g. try Chinese or emojis)
Checking the code points at the offsets seems an interesting approach!
Also FYI @jorgecarleitao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be sufficient, but I'm not an expert on UTF-8. My reasoning is that when you slice a str
all it validates are that the start and end offsets pass std::str::is_char_boundary
- here. Taking that the standard library is correct, and the only invariant of str
is that the bytes are UTF-8 as a whole, I think this is no different?
UTF-8 Validation (apache#786)
8572713
to
585d0f5
Compare
Review feedback
@@ -273,7 +274,7 @@ fn build_dictionary_encoded_string_page_iterator( | |||
InMemoryPageIterator::new(schema, column_desc, pages) | |||
} | |||
|
|||
fn bench_array_reader(mut array_reader: impl ArrayReader) -> usize { | |||
fn bench_array_reader(mut array_reader: Box<dyn ArrayReader>) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is necessary because byte_array_reader hides its implementing type, this is both to make the API more ergonomic for clients and also to aid future crate evolution
@@ -368,10 +366,10 @@ fn add_benches(c: &mut Criterion) { | |||
mandatory_int32_column_desc.clone(), | |||
); | |||
count = bench_array_reader(array_reader); | |||
}) | |||
}); | |||
assert_eq!(count, EXPECTED_VALUE_COUNT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows for running a subset of the benchmarks, without this the assertion fails if the bench function is filtered out.
For example, this would run just the string array benchmarks
cargo criterion --bench arrow_array_reader --features test_common,experimental -- StringArray
arrow_type, | ||
)?)) | ||
PhysicalType::BYTE_ARRAY => match arrow_type { | ||
// TODO: Replace with optimised dictionary reader (#171) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1180
/// - `num_levels` - the number of levels contained within the page, i.e. values including nulls | ||
/// - `num_values` - the number of non-null values contained within the page (V2 page only) | ||
/// | ||
/// Note: data encoded with [`Encoding::RLE`] may not know its exact length, as the final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to be explicit about this to avoid a resurgence of this style of bug - #1111
This is a crate-private API, and the necessary null counting dance is performed by RecordReader, but I wanted to call it out for the avoidance of confusion.
@@ -968,4 +969,40 @@ mod tests { | |||
assert_eq!(batch.num_rows(), 4); | |||
assert_eq!(batch.column(0).data().null_count(), 2); | |||
} | |||
|
|||
#[test] | |||
fn test_invalid_utf8() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test sourced from #786
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR pretty thoroughly -- and while I am not anywhere near as much of an expert as it is my judgement that this is ready to merge.
What is the plan for the ArrowArrayReader
implementation added in #384? Should we plan to remove it from this crate (if so I can file a ticket)
Thank you very much @tustvold and for the effort in reviewing @yordan-pavlov.
Any remaining thoughts or people who want to comment prior to merging?
|
||
/// A buffer of variable-sized byte arrays that can be converted into | ||
/// a corresponding [`ArrayRef`] | ||
pub struct OffsetBuffer<I: ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost wonder if this is valuable itself to put into the arrow
crate and use to create GenericStringArray
s from iterators of &str
etc. Not for this PR, I am just musing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought similar, lifting this and ScalarBuffer into arrow-rs would likely remove a non-trivial amount of unsafe
if let Some(&b) = data.first() { | ||
// A valid code-point iff it does not start with 0b10xxxxxx | ||
// Bit-magic taken from `std::str::is_char_boundary` | ||
if (b as i8) < -0x40 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// UTF-8. This should be done by calling [`Self::values_as_str`] after | ||
/// all data has been written | ||
pub fn try_push(&mut self, data: &[u8], validate_utf8: bool) -> Result<()> { | ||
if validate_utf8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for anyone else following along, I double checked the code and validate_utf8
is disabled for DataType::Binary
as one would expect. It is always enabled for DataType::Utf8
type Slice = Self; | ||
|
||
fn split_off(&mut self, len: usize) -> Self::Output { | ||
let remaining_offsets = self.offsets.len() - len - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend an assert here that self.offsets.len() > len
for clarity, but I think that the offsets[len]
would panic below if this were not the case, so I don't think it is a safety issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
let mut new_offsets = ScalarBuffer::new(); | ||
new_offsets.reserve(remaining_offsets + 1); | ||
for v in &offsets[len..] { | ||
new_offsets.push(*v - end_offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
|
||
Self { | ||
offsets: std::mem::replace(&mut self.offsets, new_offsets), | ||
values: self.values.take(end_offset.to_usize().unwrap()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it a little confusing that values.take()
does the same thing as split_off
-- maybe it is worth renaming ScalarBuffer<T>::take()
to ScalarBuffer<T>::split_off()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a backwards compatibility thing, ScalarBuffer::Output
must be Buffer to avoid changing the API of ColumnReaderImpl
. Perhaps this could be included in a future breaking change cleanup PR 🤔
|
||
let values_range = read_offset..read_offset + values_read; | ||
for (value_pos, level_pos) in values_range.clone().rev().zip(rev_position_iter) { | ||
assert!(level_pos >= value_pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is definitely a tricky bit of logic, looks reasonable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was much wailing and gnashing of teeth in its creation 😅
} | ||
|
||
#[test] | ||
fn test_byte_array_decoder() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is NULL covered anywhere? If not I think that might be valuable to cover here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null padding is technically handled and tested as part of OffsetBuffer, but I'll add something here
Edit: added
I don't think there is a particular reason for it to stay, but I defer the final decision to @yordan-pavlov |
I am happy for |
I also ran the tests from the latest master branch of datafusion against this branch and they all passed. Not that it is the most thorough coverage of the parquet format, but it adds some. 👍 |
#1197 tracks ArrowArrayReader removal |
Draft as builds on #1054Which issue does this PR close?
Adds an optimized ByteArrayReader as part of proving out the generics added in #1041, and as a precursor to #171.
This also adds UTF-8 validation and support for
DELTA_BYTE_ARRAY
, neither of which are currently supported.Closes #786
Rationale for this change
Depending on the benchmark, this can be anything from approximately the same to significantly (2x) faster than the ArrowArrayReader implementation added in #384. This is largely down to slightly more efficient null padding, and avoiding dynamic dispatch. The dominating factor in the benchmarks is the string value copy, which is makes me optimistic for the returns #171 wil yield.
I didn't benchmark the results for
DELTA_BYTE_ARRAY
encoding but the returns are likely to be even more significant, as the layout is more optimal for decodeThe major benefit over the ArrowArrayReader implementation, aside from the speed bump, is the ability to share the existing ColumnReaderImpl and RecordReader logic, and the ability to work with all types of variable length strings and byte arrays.
This logic also forms the basis for #1180
What changes are included in this PR?
Adds a new
ByteArrayReader
that implementsArrayReader
for variable length byte arraysAre there any user-facing changes?
No