Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 10, 2015

This updates parquet-thrift so that it implements the list reading rules specified by PARQUET-113. Null values are not allowed in Thrift, and the protocol cannot be amended to include them. Optional list elements are handled in this PR by throwing an exception (default) or by ignoring null values if parquet.thrift.ignore-null-elements is set to true in the Configuration.

@rdblue rdblue force-pushed the PARQUET-212-thrift-nested-types branch from 00e4a87 to 92ed64c Compare March 11, 2015 21:30
@rdblue
Copy link
Contributor Author

rdblue commented Mar 11, 2015

Rebased and fixed all test failures. I think this is ready for review. Would you mind taking a look @julienledem or @isnotinvain? Thanks!

@rdblue rdblue force-pushed the PARQUET-212-thrift-nested-types branch from 92ed64c to c95253d Compare March 13, 2015 01:18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to isListElementType ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rdblue rdblue force-pushed the PARQUET-212-thrift-nested-types branch from c95253d to 14ad00a Compare April 7, 2015 23:24
rdblue added 8 commits June 1, 2015 13:34
This adds convenience methods for writing to files using the
RecordConsumer API directly. This is useful for mimicing files from
other writers for compatibility tests.
Parquet-thrift can now read files not written by parquet-thrift if an
appropriate Thrift class is supplied. This adds a check to derive the
necessary StructType from a class. Previously, attempting to read a file
without Thrift metadata in its properties would return a null
ThriftMetaData and throw NPE.

This also updates the logic in ThriftMetaData.fromExtraMetaData to avoid
NPE when the class is present by the descriptor property is not.
This mirrors the support in parquet-avro and uses an isElementType check
in the schema converter. Thrift classes are compiled and must always be
present, so there is no ambiguous case because the expected structure is
always known.

This initial version suppresses nulls when list elements are optional.
There is no way to pass a null list element to thrift when it constructs
records.

This does not change how parquet-thrift writes or converts schemas, so
there are no projection changes needed.
This adds parquet.thrift.ignore-null-elements to suppress an exception
thrown when reading a list with optional elements. This makes reading
Hive lists an opt-in so that the caller must be aware that null values
are ignored.
This should be a temporary change, until 1.6.0 is released. The Semver
check is catching that the ThriftRecordConverter now implements
Configurable. This is a binary-compatible change that should not require
a major version update.
Previously, this added Configurable to ThriftRecordConverter, but that
caused problems with semver and test failures because a separate
initialize method had to be called. This is brittle because callers
might not know to call initialize and the class is part of the API
because subclassing is allowed. To avoid the issue, this replaces the
Configurable interface and initialize method with a new constructor that
takes a Configuration.
This fixes parquet-avro and parquet-thrift list handling for the case
where the Parquet schema columns are projected and only one remains.
This appears to not match the element schema because the element schema
has more than one field. The solution is to match the element schema if
the Parquet schema is a subset.
@rdblue rdblue force-pushed the PARQUET-212-thrift-nested-types branch from 14ad00a to d6e77ad Compare June 1, 2015 21:12
@rdblue
Copy link
Contributor Author

rdblue commented Jun 1, 2015

@tsdeng, could you have another look at this? I think it is ready to go.

@asfgit asfgit closed this in 37f72dc Jan 12, 2016
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This implements the read-side compatibility rules for 2-level and 3-level lists in Thrift.

Thrift doesn't allow null elements inside lists, but 3-level lists may have optional elements. This PR adds a property, parquet.thrift.ignore-null-elements, that allows thrift to read lists with optional elements by ignoring nulls. This is off by default, but is provided as an opt-in for compatibility with data written by Hive.

Thrift's schema conversion does not change because a Thrift class (or Scrooge etc.) must be set in a file's metadata or provided when constructing a reader.

This replaces and closes apache#144.

Author: Ryan Blue <blue@apache.org>

Closes apache#300 from rdblue/PARQUET-212-fix-thrift-3-level-lists and squashes the following commits:

ac7c405 [Ryan Blue] PARQUET-212: Add tests for list of list cases from PARQUET-364.
356fdb7 [Ryan Blue] PARQUET-212: Rename isElementType => isListElementType.
5d3b094 [Ryan Blue] PARQUET-212: Fix list handling with projection.
b5f207f [Ryan Blue] PARQUET-212: Add Configuration to the ThriftRecordConverter ctor.
b87eb65 [Ryan Blue] PARQUET-212: Add property to ignore nulls in lists.
3d1e92f [Ryan Blue] PARQUET-212: Update thrift reads for LIST compatibility rules.
0bf2b45 [Ryan Blue] PARQUET-212: Read non-thrift files if a Thrift class is supplied.
4e148dc [Ryan Blue] PARQUET-212: Add DirectWriterTest base class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants