[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists #8070

liancheng · 2015-08-10T17:14:09Z

This PR is inspired by #8063 authored by @dguy. Especially, testing Parquet files added here are all taken from that PR.

Committer who merges this PR should attribute it to Damian Guy <damian.guy@gmail.com>.

SPARK-6776 and SPARK-6777 followed parquet-avro to implement backwards-compatibility rules defined in parquet-format spec. However, both Spark SQL and parquet-avro neglected the following statement in parquet-format:

This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field.

One of the consequences is that, Parquet files generated by parquet-protobuf containing unannotated repeated fields are not correctly converted to Catalyst arrays.

This PR fixes this issue by

Handling unannotated repeated fields in CatalystSchemaConverter.
Converting this kind of special repeated fields to Catalyst arrays in CatalystRowConverter.

Two special converters, RepeatedPrimitiveConverter and RepeatedGroupConverter, are added. They delegate actual conversion work to a child elementConverter and accumulates elements in an ArrayBuffer.

Two extra methods, start() and end(), are added to ParentContainerUpdater. So that they can be used to initialize new ArrayBuffers for unannotated repeated fields, and propagate converted array values to upstream.

SparkQA · 2015-08-10T17:21:08Z

Test build #40299 has finished for PR 8070 at commit a11b7c0.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-10T17:44:31Z

Test build #40300 has finished for PR 8070 at commit 1c68b55.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

dguy · 2015-08-10T18:37:03Z

LGTM - working as expected.

rdblue · 2015-08-10T19:57:56Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala

Why isn't this a CatalystPrimitiveConverter with RepeatedConverter?

It's because CatalystPrimitiveConverter is defined as:

private[parquet] class CatalystPrimitiveConverter(val updater: ParentContainerUpdater) extends PrimitiveConverter with HasParentContainerUpdater { ... }

the val updater part has two meanings:

updater is made a constructor argument, and

def updater in HasParentContainerUpdater is overriden since updater is a read-only val.

The 2nd fact prevents subclasses of CatalystPrimitiveConverter to override the updater field.

SparkQA · 2015-08-10T20:09:58Z

Test build #40304 has finished for PR 8070 at commit 450b606.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2015-08-10T23:54:59Z

This looks good to me overall.

liancheng · 2015-08-11T00:41:45Z

@dguy @rdblue Thanks for the review! I just rebased this PR to resolve conflicts introduced by #8056. Will merge this pending Jenkins.

SparkQA · 2015-08-11T02:48:37Z

Test build #40355 has finished for PR 8070 at commit ace6df7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists (cherry picked from commit 071bbad) Signed-off-by: Cheng Lian <lian@databricks.com>

liancheng · 2015-08-11T04:47:36Z

Merged to master and branch-1.5.

This PR is inspired by apache#8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes apache#8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists

liancheng changed the title ~~[SPARK-9340] [SQ] Fixes converting unannotated Parquet lists~~ [SPARK-9340] [SQL] Fixes converting unannotated Parquet lists Aug 10, 2015

liancheng mentioned this pull request Aug 10, 2015

[SparkSPARK-9340] - make SparkSQL work with nested types in parquet-protobuf #8063

Closed

rdblue reviewed Aug 10, 2015
View reviewed changes

liancheng added 3 commits August 11, 2015 07:55

Fixes converting unannotated Parquet lists

420ad2b

Updates .rat-excludes

f1c7bfd

Moves ParquetProtobufCompatibilitySuite

ace6df7

liancheng force-pushed the spark-9340/unannotated-parquet-list branch from 450b606 to ace6df7 Compare August 11, 2015 00:03

asfgit closed this in 071bbad Aug 11, 2015

liancheng deleted the spark-9340/unannotated-parquet-list branch August 11, 2015 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists #8070

[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists #8070

Uh oh!

liancheng commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

dguy commented Aug 10, 2015

Uh oh!

rdblue Aug 10, 2015

Uh oh!

liancheng Aug 11, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

rdblue commented Aug 10, 2015

Uh oh!

liancheng commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

liancheng commented Aug 11, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists #8070

[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists #8070

Uh oh!

Conversation

liancheng commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

dguy commented Aug 10, 2015

Uh oh!

rdblue Aug 10, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Aug 11, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

rdblue commented Aug 10, 2015

Uh oh!

liancheng commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

liancheng commented Aug 11, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants