PARQUET-642: Improve performance of ByteBuffer based read / write paths #347
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: 6b605a4 and 6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.
Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).
Seems like this seems to be due to the encoding / decoding of Strings in the Binary class:
toStringUsingUTF8() - for reads
encodeUTF8() - for writes
With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster.
Added some microbenchmark details to the jira.
Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders.