Skip to content

DRILL-4028: Get off parquet fork#236

Closed
jaltekruse wants to merge 5 commits intoapache:masterfrom
jaltekruse:parquet-update-squash
Closed

DRILL-4028: Get off parquet fork#236
jaltekruse wants to merge 5 commits intoapache:masterfrom
jaltekruse:parquet-update-squash

Conversation

@jaltekruse
Copy link
Contributor

No description provided.

…kson @JsonCreator annotation from parquet, replace with proper fasterxml version.
…after rebase.

clean up imports in generated source template
…th after merging the Drill parquet fork back into mainline.

Fixed the issue with the writer, needed to flush the RecordConsumer in the ParquetRecordWriter.

Consolidate page reading code

Fix buffer sizes, uncompressed and compressed sizes were backwards

The issue was a mismatch in the usage of byte buffers. Even though the position of a buffer was being set, that seemed to be ignored in the setSafe method on the varbinary vector. I needed to pass in the offset as it seems to just read from the beginning of the buffer. I'm not sure this is how ByteBuffers are supposed to be used, but we seem to make use of this pattern commonly so I'm not sure it could be easily refactored.

Added some test to print out some additional context when an ordered comparison of two datasets fails in a test.

Removing usage of Drill classes from DirectCodecFactory, getting it ready to be moved into the parquet codebase.

Fix up parquet API usage in Hive Module.

Fix dictionary reading, the changes made I think may speed up reading dictionary encoded files by avoiding an extra copy.

Adding unit test to read a write all types in parquet, the decimal types and interval year have some issues.

Use direct codec factory from new package in the parquet library now that it has been moved.

Moving the test for Direct Codec Factory out of the Drill source as the class itself has been moved.

Small fix after consolidating two different ByteBuffer based implementations of BytesInput.

Small fixes to accommodate interface changes.

Small changes to remove direct references to DirectCodecFactory, this class is not accessible outside of parquet, but an instance with the same contract is now accessible with a new factory method on CodecFactory.

Fixed failing test using miniDFS when reading a larger parquet file.
@jaltekruse
Copy link
Contributor Author

I need to update the POM version number once the parquet changes are actually merged and we can publish a new version of the "fork" (which will just be us hosting the current tip of parquet with a hard maven version number rather than a snapshot, we will upgrade to an official release of parquet when one is released by the parquet community). I am doing a final run of the regression tests now and will push any changes needed to address any remaining failures. I believe this is close to the final version of this work so I wanted to get it available for starting the review process. You can see the work done to get the parquet fork merged into the mainline here: apache/parquet-java#267

@jacques-n
Copy link
Contributor

other than bytesinput and fixing parquet version number, this looks good to go. +1

The pom.xml points to our drill "release" of the current tip of parquet master.

I added back the ByteBufBytesInput class that was lost moving the Direct
CodecFactory over to parquet and used it from the PageReader where I had
previously refactored to use a java.nio.ByteBuffer, which requires
requesting one of these from the Netty ByteBuf we are managing in Drill,
which could be non-trivial.
Fix a bug in method for including nearby records when a result set comparison fails.
I had not run the full set of unit tests since I added this utility function,
it caused some of the tests of the test framework itself to fail.

Fix tests to check that the message contains the comparison the fails, rather than
checking strict string equality on the message.

Go back to using the BytesInput.from(ByteBuffer) factory method rather than using
the ByteBufBytesInput. Both of them were actually calling nioBuffer() on the
Netty ByteBufs anyway, and the change actually caused a number of tests to fail
because the ByteBuffers positions and offsets were no longer being managed
correctly to work with the parquet code.
jacques-n pushed a commit to jacques-n/drill that referenced this pull request Nov 5, 2015
- Remove references to the shaded version of a Jackson @JsonCreator annotation from parquet, replace with proper fasterxml version.
- Fixing imports using the wrong parquet packages after rebase.
- Fixing issues with Drill parquet read a write path after merging the Drill parquet fork back into mainline.
- Fixed the issue with the writer, needed to flush the RecordConsumer in the ParquetRecordWriter.
- Consolidate page reading code
- Added some test to print out some additional context when an ordered comparison of two datasets fails in a test.
- Fix up parquet API usage in Hive Module.
- Adding unit test to read a write all types in parquet, the decimal types and interval year have some issues.
- Use direct codec factory from new package in the parquet library now that it has been moved.
- Moving the test for Direct Codec Factory out of the Drill source as the class itself has been moved.
- Small fix after consolidating two different ByteBuffer based implementations of BytesInput.
- Small fixes to accommodate interface changes.
- Small changes to remove direct references to DirectCodecFactory, this class is not accessible outside of parquet, but an instance with the same contract is now accessible with a new factory method on CodecFactory.
- Fixed failing test using miniDFS when reading a larger parquet file.

This closes apache#236
StevenMPhillips pushed a commit to StevenMPhillips/drill that referenced this pull request Nov 5, 2015
- Remove references to the shaded version of a Jackson @JsonCreator annotation from parquet, replace with proper fasterxml version.
- Fixing imports using the wrong parquet packages after rebase.
- Fixing issues with Drill parquet read a write path after merging the Drill parquet fork back into mainline.
- Fixed the issue with the writer, needed to flush the RecordConsumer in the ParquetRecordWriter.
- Consolidate page reading code
- Added some test to print out some additional context when an ordered comparison of two datasets fails in a test.
- Fix up parquet API usage in Hive Module.
- Adding unit test to read a write all types in parquet, the decimal types and interval year have some issues.
- Use direct codec factory from new package in the parquet library now that it has been moved.
- Moving the test for Direct Codec Factory out of the Drill source as the class itself has been moved.
- Small fix after consolidating two different ByteBuffer based implementations of BytesInput.
- Small fixes to accommodate interface changes.
- Small changes to remove direct references to DirectCodecFactory, this class is not accessible outside of parquet, but an instance with the same contract is now accessible with a new factory method on CodecFactory.
- Fixed failing test using miniDFS when reading a larger parquet file.

This closes apache#236
@asfgit asfgit closed this in 39582bd Nov 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments