PARQUET-1355: Improvement Binary write performance #505

wangyum · 2018-07-24T03:05:07Z

Details can be found here: PARQUET-1355.
The write performance will be increased from 50983 ms to 45423 ms, close to 44432 ms.

gszadovszky

The whole purpose of Binary is to encapsulate the byte[] or the ByteBuffer. However, the current implementation seems to be unnecessarily complicated (or I don't understand the concept well enough), I don't like the idea to expose the internal byte[] without any control.
@rdblue, what do you think?

rdblue · 2018-07-24T15:58:45Z

I agree that we want to avoid exposing the internal buffer. If we did that, we would lose information about whether or not it is reused and we would also break the abstraction. Is there a way to get the performance gain without exposing the underlying byte array? I'm surprised that using a ByteBuffer, which should be backed by the same bytes, is so much slower.

wangyum · 2018-08-04T08:26:48Z

@rednaxelafx Do you have a good idea?

scottcarey · 2018-08-29T16:37:06Z

ByteBuffer has always been significantly slower, in my experience, if you are not reading/writing in large chunks (e.g. any reading of singular values like ints/longs versus copying out or in large byte ranges).

The JVM can optimize loops over byte[] much more easily, the ByteBuffer abstraction gets in the way -- firstly by having at least two implementations in memory it introduces virtual dispatch that is usually optimized away but not always, and its harder for the JVM to see through the interface to elide redundant bounds-checks and similar.

A ByteBuffer can match a byte array if you use certain sun.misc.Unsafe access methods, maybe the Java 9+ VarHandle stuff can also match it, but I have not tried.

gatorsmile · 2018-09-12T02:19:07Z

@rdblue Does this mean a performance regression is introduced in the write path of binary data after we upgrade Parquet from 1.8 to 1.10 in Spark?

wangyum added 3 commits July 24, 2018 10:50

Improvement parquet Binary write performance.

5928da6

Fix a typo

8cfc8a3

Remove duplicate Serializable

65b45a1

gszadovszky requested changes Jul 24, 2018

View reviewed changes

wangyum mentioned this pull request Oct 19, 2018

[SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method apache/spark#22501

Closed

wangyum closed this May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-1355: Improvement Binary write performance #505

PARQUET-1355: Improvement Binary write performance #505

Uh oh!

wangyum commented Jul 24, 2018

Uh oh!

gszadovszky left a comment

Uh oh!

rdblue commented Jul 24, 2018

Uh oh!

wangyum commented Aug 4, 2018

Uh oh!

scottcarey commented Aug 29, 2018

Uh oh!

gatorsmile commented Sep 12, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PARQUET-1355: Improvement Binary write performance #505

PARQUET-1355: Improvement Binary write performance #505

Uh oh!

Conversation

wangyum commented Jul 24, 2018

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 24, 2018

Uh oh!

wangyum commented Aug 4, 2018

Uh oh!

scottcarey commented Aug 29, 2018

Uh oh!

gatorsmile commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gatorsmile commented Sep 12, 2018 •

edited

Loading