PARQUET-212: Add thrift support for list reading rules#144
Closed
rdblue wants to merge 8 commits intoapache:masterfrom
Closed
PARQUET-212: Add thrift support for list reading rules#144rdblue wants to merge 8 commits intoapache:masterfrom
rdblue wants to merge 8 commits intoapache:masterfrom
Conversation
00e4a87 to
92ed64c
Compare
Contributor
Author
|
Rebased and fixed all test failures. I think this is ready for review. Would you mind taking a look @julienledem or @isnotinvain? Thanks! |
92ed64c to
c95253d
Compare
Contributor
There was a problem hiding this comment.
rename to isListElementType ?
c95253d to
14ad00a
Compare
This adds convenience methods for writing to files using the RecordConsumer API directly. This is useful for mimicing files from other writers for compatibility tests.
Parquet-thrift can now read files not written by parquet-thrift if an appropriate Thrift class is supplied. This adds a check to derive the necessary StructType from a class. Previously, attempting to read a file without Thrift metadata in its properties would return a null ThriftMetaData and throw NPE. This also updates the logic in ThriftMetaData.fromExtraMetaData to avoid NPE when the class is present by the descriptor property is not.
This mirrors the support in parquet-avro and uses an isElementType check in the schema converter. Thrift classes are compiled and must always be present, so there is no ambiguous case because the expected structure is always known. This initial version suppresses nulls when list elements are optional. There is no way to pass a null list element to thrift when it constructs records. This does not change how parquet-thrift writes or converts schemas, so there are no projection changes needed.
This adds parquet.thrift.ignore-null-elements to suppress an exception thrown when reading a list with optional elements. This makes reading Hive lists an opt-in so that the caller must be aware that null values are ignored.
This should be a temporary change, until 1.6.0 is released. The Semver check is catching that the ThriftRecordConverter now implements Configurable. This is a binary-compatible change that should not require a major version update.
Previously, this added Configurable to ThriftRecordConverter, but that caused problems with semver and test failures because a separate initialize method had to be called. This is brittle because callers might not know to call initialize and the class is part of the API because subclassing is allowed. To avoid the issue, this replaces the Configurable interface and initialize method with a new constructor that takes a Configuration.
This fixes parquet-avro and parquet-thrift list handling for the case where the Parquet schema columns are projected and only one remains. This appears to not match the element schema because the element schema has more than one field. The solution is to match the element schema if the Parquet schema is a subset.
14ad00a to
d6e77ad
Compare
Contributor
Author
|
@tsdeng, could you have another look at this? I think it is ready to go. |
piyushnarang
pushed a commit
to piyushnarang/parquet-mr
that referenced
this pull request
Jun 15, 2016
This implements the read-side compatibility rules for 2-level and 3-level lists in Thrift. Thrift doesn't allow null elements inside lists, but 3-level lists may have optional elements. This PR adds a property, parquet.thrift.ignore-null-elements, that allows thrift to read lists with optional elements by ignoring nulls. This is off by default, but is provided as an opt-in for compatibility with data written by Hive. Thrift's schema conversion does not change because a Thrift class (or Scrooge etc.) must be set in a file's metadata or provided when constructing a reader. This replaces and closes apache#144. Author: Ryan Blue <blue@apache.org> Closes apache#300 from rdblue/PARQUET-212-fix-thrift-3-level-lists and squashes the following commits: ac7c405 [Ryan Blue] PARQUET-212: Add tests for list of list cases from PARQUET-364. 356fdb7 [Ryan Blue] PARQUET-212: Rename isElementType => isListElementType. 5d3b094 [Ryan Blue] PARQUET-212: Fix list handling with projection. b5f207f [Ryan Blue] PARQUET-212: Add Configuration to the ThriftRecordConverter ctor. b87eb65 [Ryan Blue] PARQUET-212: Add property to ignore nulls in lists. 3d1e92f [Ryan Blue] PARQUET-212: Update thrift reads for LIST compatibility rules. 0bf2b45 [Ryan Blue] PARQUET-212: Read non-thrift files if a Thrift class is supplied. 4e148dc [Ryan Blue] PARQUET-212: Add DirectWriterTest base class.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This updates parquet-thrift so that it implements the list reading rules specified by PARQUET-113. Null values are not allowed in Thrift, and the protocol cannot be amended to include them. Optional list elements are handled in this PR by throwing an exception (default) or by ignoring null values if
parquet.thrift.ignore-null-elementsis set to true in theConfiguration.