Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jul 24, 2018

Details can be found here: PARQUET-1355.
The write performance will be increased from 50983 ms to 45423 ms, close to 44432 ms.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole purpose of Binary is to encapsulate the byte[] or the ByteBuffer. However, the current implementation seems to be unnecessarily complicated (or I don't understand the concept well enough), I don't like the idea to expose the internal byte[] without any control.
@rdblue, what do you think?

@rdblue
Copy link
Contributor

rdblue commented Jul 24, 2018

I agree that we want to avoid exposing the internal buffer. If we did that, we would lose information about whether or not it is reused and we would also break the abstraction. Is there a way to get the performance gain without exposing the underlying byte array? I'm surprised that using a ByteBuffer, which should be backed by the same bytes, is so much slower.

@wangyum
Copy link
Member Author

wangyum commented Aug 4, 2018

@rednaxelafx Do you have a good idea?

@scottcarey
Copy link

ByteBuffer has always been significantly slower, in my experience, if you are not reading/writing in large chunks (e.g. any reading of singular values like ints/longs versus copying out or in large byte ranges).

The JVM can optimize loops over byte[] much more easily, the ByteBuffer abstraction gets in the way -- firstly by having at least two implementations in memory it introduces virtual dispatch that is usually optimized away but not always, and its harder for the JVM to see through the interface to elide redundant bounds-checks and similar.

A ByteBuffer can match a byte array if you use certain sun.misc.Unsafe access methods, maybe the Java 9+ VarHandle stuff can also match it, but I have not tried.

@gatorsmile
Copy link
Member

gatorsmile commented Sep 12, 2018

@rdblue Does this mean a performance regression is introduced in the write path of binary data after we upgrade Parquet from 1.8 to 1.10 in Spark?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants