Skip to content

Conversation

@piyushnarang
Copy link

While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: 6b605a4 and 6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.

Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).

Seems like this seems to be due to the encoding / decoding of Strings in the Binary class:
toStringUsingUTF8() - for reads
encodeUTF8() - for writes

With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster.

Added some microbenchmark details to the jira.

Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders.

@piyushnarang
Copy link
Author

@rdblue - please take a look when you get the time.

@julienledem
Copy link
Member

+1 LGTM

@asfgit asfgit closed this in 7f8e952 Jun 30, 2016
@piyushnarang
Copy link
Author

Thanks @julienledem :-)

rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: apache@6b605a4 and apache@6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.

Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).

Seems like this seems to be due to the encoding / decoding of Strings in the [Binary class](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java):
[toStringUsingUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L388) - for reads
[encodeUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L236) - for writes

With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster.

Added some microbenchmark details to the jira.

Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders.

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#347 from piyushnarang/bytebuffer-encoding-fix-pr and squashes the following commits:

43c5bdd [Piyush Narang] Keep avro on char sequence
2d50c8c [Piyush Narang] Update Binary approach
9e58237 [Piyush Narang] Proof of concept fixes

Conflicts:
    parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java
    parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
Resolution:
    Use String encoding/decoding where possible.
    Updated Avro to use fromCharSequence to avoid two copies
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: apache@6b605a4 and apache@6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.

Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).

Seems like this seems to be due to the encoding / decoding of Strings in the [Binary class](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java):
[toStringUsingUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L388) - for reads
[encodeUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L236) - for writes

With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster.

Added some microbenchmark details to the jira.

Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders.

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#347 from piyushnarang/bytebuffer-encoding-fix-pr and squashes the following commits:

43c5bdd [Piyush Narang] Keep avro on char sequence
2d50c8c [Piyush Narang] Update Binary approach
9e58237 [Piyush Narang] Proof of concept fixes

Conflicts:
    parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java
    parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
Resolution:
    Use String encoding/decoding where possible.
    Updated Avro to use fromCharSequence to avoid two copies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants