PARQUET-642: Improve performance of ByteBuffer based read / write paths #347

piyushnarang · 2016-06-24T23:14:29Z

While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: 6b605a4 and 6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.

Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).

Seems like this seems to be due to the encoding / decoding of Strings in the Binary class:
toStringUsingUTF8() - for reads
encodeUTF8() - for writes

With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster.

Added some microbenchmark details to the jira.

Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders.

piyushnarang · 2016-06-24T23:14:47Z

@rdblue - please take a look when you get the time.

julienledem · 2016-06-30T16:50:32Z

+1 LGTM

piyushnarang · 2016-06-30T16:55:57Z

Thanks @julienledem :-)

While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: apache@6b605a4 and apache@6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit. Read overhead: 4-6% (in MB_Millis) Write overhead: 6-10% (MB_Millis). Seems like this seems to be due to the encoding / decoding of Strings in the [Binary class](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java): [toStringUsingUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L388) - for reads [encodeUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L236) - for writes With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster. Added some microbenchmark details to the jira. Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders. Author: Piyush Narang <pnarang@twitter.com> Closes apache#347 from piyushnarang/bytebuffer-encoding-fix-pr and squashes the following commits: 43c5bdd [Piyush Narang] Keep avro on char sequence 2d50c8c [Piyush Narang] Update Binary approach 9e58237 [Piyush Narang] Proof of concept fixes Conflicts: parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java Resolution: Use String encoding/decoding where possible. Updated Avro to use fromCharSequence to avoid two copies

Piyush Narang added 3 commits June 24, 2016 14:55

Proof of concept fixes

9e58237

Update Binary approach

2d50c8c

Keep avro on char sequence

43c5bdd

asfgit closed this in 7f8e952 Jun 30, 2016

piyushnarang mentioned this pull request Jul 21, 2016

PARQUET-400: Replace CompatibilityUtil with SeekableInputStream. #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-642: Improve performance of ByteBuffer based read / write paths #347

PARQUET-642: Improve performance of ByteBuffer based read / write paths #347

Uh oh!

piyushnarang commented Jun 24, 2016

Uh oh!

piyushnarang commented Jun 24, 2016

Uh oh!

julienledem commented Jun 30, 2016

Uh oh!

piyushnarang commented Jun 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PARQUET-642: Improve performance of ByteBuffer based read / write paths #347

PARQUET-642: Improve performance of ByteBuffer based read / write paths #347

Uh oh!

Conversation

piyushnarang commented Jun 24, 2016

Uh oh!

piyushnarang commented Jun 24, 2016

Uh oh!

julienledem commented Jun 30, 2016

Uh oh!

piyushnarang commented Jun 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants