PARQUET-77: ByteBuffer use in read and write paths by jaltekruse · Pull Request #267 · apache/parquet-java

jaltekruse · 2015-08-31T23:42:52Z

This work is based on the GSOC project from the summer of 2014. We have expanded on it to fix bugs and change the write path to use ByteBuffers as well. This PR replaces several earlier PRs.

closes #6, closes #49, closes #50, closes #267

of byte array.

…copy through read path.

…vious verision.

API.

Use reflect to call new API to keep compatible.

Fix bugs in Binary.

add compatible method initFromPage in ValueReaders. add toByteBuffer method in ByteBufferInputStream. add V21FileAPI class to encapsulate v21 APIs and make it a singlton. add ByteBuffer based equal and compareto method in Binary.

Add compatibility function to read directly into a byte buffer

… memory can be released before stats are written.

…tor to allocate the ByteBuffer. Conflicts: parquet-column/src/main/java/parquet/column/ColumnWriteStore.java parquet-column/src/main/java/parquet/column/ColumnWriter.java parquet-column/src/main/java/parquet/column/ParquetProperties.java parquet-column/src/main/java/parquet/column/impl/ColumnWriteStoreV1.java parquet-column/src/main/java/parquet/column/impl/ColumnWriterV1.java parquet-column/src/main/java/parquet/column/values/dictionary/DictionaryValuesWriter.java parquet-column/src/main/java/parquet/column/values/rle/RunLengthBitPackingHybridValuesWriter.java parquet-column/src/test/java/parquet/column/values/dictionary/TestDictionary.java parquet-column/src/test/java/parquet/io/TestColumnIO.java parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordWriter.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetRecordWriter.java parquet-hadoop/src/test/java/parquet/hadoop/TestParquetFileWriter.java

Conflicts: parquet-column/src/main/java/parquet/column/values/dictionary/DictionaryValuesWriter.java

julienledem · 2015-10-31T06:17:45Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/CompatibilityUtil.java

this will lose the original cause of the error.

I'm not sure what you mean here. We didn't get an error out of the method we called, throwing here will at least give a stacktrace, but I don't see where we are going to get more information about why it failed.

Sorry that was not very clear. I was referring to the catch block line 98. if we catch an exception there then res == 0
And we throw an exception without the cause.
The other case is line 104 returned 0. In which case maybe there a better message than "Null ByteBuffer returned".

I don't think the condition checking if res == 0 is valid considering what the docs on this method says about how this method should work concerning 0 length requests. In this case it is for zero length result, possibly when there was a non-zero length request, but it seems like we should not be considering this return value erroneous. https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html#read%28java.nio.ByteBuffer%29

I have removed this check

…to Hadoop 2.0 compression APIs.

…od for getting a compressor.

…e got lost somewhere.

…red in the newer version.

julienledem · 2015-11-02T05:22:49Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectCodecFactory.java

does it need to be public? make it package protected if possible.

Drill creates one of these to compress and decompress page data itself. I could add a factory method that takes an allocator on the CodecFactory to allow creating without exposing the whole class. We don't need to subclass this over in Drill or anything.

I made the change described above and pushed a new commit

julienledem · 2015-11-02T05:25:55Z

This looks fine to me.
@jaltekruse I made a few final comments. please take a look.
I'm planning to merge this soon after you take care of those.
Thanks!

… on the allocators used by a DirectCodecFactory. Moved the DirectCodecFactory class to package private access and added a factory method to create one on the CodecFactory class.

… and is no longer accessible in this class.

julienledem · 2015-11-02T23:25:48Z

+1

I had been a little too aggressive hiding things from the outside world, we still need access to the codec factory itself in Drill. Most of the new code has been hidden from the public interface.

…m that does not implement the byte buffer based read method in the Hadoop 2.x API.

… name for the class that was being used to detect if the Hadoop 2.x API was available. Additionally the check for actual implementation of the read method was not functioning properly. The UnsupportedOperationException that will be thrown from the method will actually be wrapped in an InvocationTargetException now that the method is being invoked with reflection. The code to detect if a call fails has been moved back down to where the actual read method is called, because making it work properly in the static block was too much of a headache, creating an instance of an FSDataInputStream that fulfilled the correct interfaces would have required more reflection hacks. I do properly set the flag used to track availability of the API to avoid the previous behavior of always relying on exception based control flow for fallback, it just happens more lazily than was attempted with the earlier work to simplify this class.

…g is very wrong, so it shouldn't get wrapped in a ShouldNeverHappenException. This invocationTargetException will wrap any kind of exception coming out of the method, including an IOException.

gerashegalov and others added 30 commits February 10, 2015 14:18

Use ByteBuf-based api to read magic.

686d598

Reading file metadata using zero-copy API

2d32f49

Reading chunk using zero-copy API

df1ad93

Add ByteBufferInputStream and modify Chunk to consume ByteBuffer instead

53500d4

of byte array.

Read from ByteBuffer instead of ByteArray to avoid unnecessary array …

36aba13

…copy through read path.

Using Writable Channel to replace write to OutputStream one by one.

7ac1df5

Add original readIntLittleEndian function to keep compatible with pre…

4f399aa

…vious verision.

Add a Hadoop compatible layer to abstract away the zero copy API and old

970fc8b

API.

Move CompatibilityUtil to parquet.hadoop.util.

47b177d

Use reflect to call new API to keep compatible.

Implement FSDISTransport in Compatible layer.

01c2ae5

Fix bugs in Binary.

Make BytePacker consume ByteBuffer directly.

a7bcfbb

disable enforcer to pass build.

26dc879

remove some unncessary codes.

016e89c

add compatible method initFromPage in ValueReaders. add toByteBuffer method in ByteBufferInputStream. add V21FileAPI class to encapsulate v21 APIs and make it a singlton. add ByteBuffer based equal and compareto method in Binary.

fix a bug in equals in ByteBuffer Binary with offset and length

912cbaf

enable enforcer check.

861e541

Address tsdeng's comments

8be638a

merging with master

0d22908

Update Snappy Codec to implement DirectDecompressionCodec interface

5bc8774

Add compatibility function to read directly into a byte buffer

Make a copy of Min and Max values for BinaryStatistics so that direct…

7bc2a4d

… memory can be released before stats are written.

Remove Zero Copy read path while reading footers

2c2b183

update pig.version to build with Hadoop 2 jars

8143174

Update Binary to make a copy of data for initial statistics.

2187697

after merge code cleanup

e488924

Make constructor for PrimitiveType that takes decimalMetadata public.

a6389db

Revert readFooter to not use ZeroCopy path.

98b99ea

Fix allocation in DictionaryValuesWriter

48cceef

Conflicts: parquet-column/src/main/java/parquet/column/values/dictionary/DictionaryValuesWriter.java

fixing bug related to testDictionaryError_419

6943536

disabled enforcer and changed version to -drill

e1df3b9

cherry pick pull#188

51cf2f1

julienledem reviewed Oct 31, 2015
View reviewed changes

jaltekruse added 6 commits November 1, 2015 15:00

Fix error message

192c717

Limit access to classes and methods used for reflection based access …

df7fd9c

…to Hadoop 2.0 compression APIs.

Move pageSize to the constructor of codecfactory rather than the meth…

40714a4

…od for getting a compressor.

Address review comments.

a8d2dc1

Thought I had fixed this double deallocation earlier, guess the chang…

d6501b1

…e got lost somewhere.

Delete older version of test file, all of these tests look to be cove…

57491a2

…red in the newer version.

julienledem reviewed Nov 2, 2015
View reviewed changes

jaltekruse added 4 commits November 2, 2015 08:57

Remove unneeded TODO

10b5ba3

Adding isDirect interface to ByteBufferAllocator to add a restriction…

723701c

… on the allocators used by a DirectCodecFactory. Moved the DirectCodecFactory class to package private access and added a factory method to create one on the CodecFactory class.

Fix logging and restrict access to classes inside of CodecFactory.

a44fdba

Remove unused imports, one of which has been moved to package private…

bd7aa97

… and is no longer accessible in this class.

Make CodecFactory public

269daef

I had been a little too aggressive hiding things from the outside world, we still need access to the codec factory itself in Drill. Most of the new code has been hidden from the public interface.

jaltekruse changed the title ~~ByteBuffer use in read and write paths~~ PARQUET-77: ByteBuffer use in read and write paths Nov 3, 2015

Properly set the byte buffer position when reading out of a filesyste…

96e19a8

…m that does not implement the byte buffer based read method in the Hadoop 2.x API.

jaltekruse mentioned this pull request Nov 4, 2015

DRILL-4028: Get off parquet fork apache/drill#236

Closed

jaltekruse added 2 commits November 4, 2015 00:34

An exception out of the read method doesn't necessarily mean somethin…

56316d0

…g is very wrong, so it shouldn't get wrapped in a ShouldNeverHappenException. This invocationTargetException will wrap any kind of exception coming out of the method, including an IOException.

asfgit closed this in 6b605a4 Nov 4, 2015

This was referenced Jul 13, 2016

Using Zero-Copy API to read #6

Closed

PARQUET-77 zero copy improvements #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-77: ByteBuffer use in read and write paths#267

PARQUET-77: ByteBuffer use in read and write paths#267
jaltekruse wants to merge 102 commits intoapache:masterfrom
jaltekruse:1.6.0rc3-drill-r0.3-merge

jaltekruse commented Aug 31, 2015

Uh oh!

julienledem Oct 31, 2015

Uh oh!

jaltekruse Nov 1, 2015

Uh oh!

julienledem Nov 2, 2015

Uh oh!

jaltekruse Nov 2, 2015

Uh oh!

jaltekruse Nov 2, 2015

Uh oh!

julienledem Nov 2, 2015

Uh oh!

jaltekruse Nov 2, 2015

Uh oh!

jaltekruse Nov 2, 2015

Uh oh!

julienledem commented Nov 2, 2015

Uh oh!

julienledem commented Nov 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Comments

Conversation

jaltekruse commented Aug 31, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienledem commented Nov 2, 2015

Uh oh!

julienledem commented Nov 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Comments