-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists #8070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists #8070
Conversation
|
Test build #40299 has finished for PR 8070 at commit
|
|
Test build #40300 has finished for PR 8070 at commit
|
|
LGTM - working as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this a CatalystPrimitiveConverter with RepeatedConverter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because CatalystPrimitiveConverter is defined as:
private[parquet] class CatalystPrimitiveConverter(val updater: ParentContainerUpdater)
extends PrimitiveConverter with HasParentContainerUpdater {
...
}the val updater part has two meanings:
updateris made a constructor argument, anddef updaterinHasParentContainerUpdateris overriden sinceupdateris a read-onlyval.
The 2nd fact prevents subclasses of CatalystPrimitiveConverter to override the updater field.
|
Test build #40304 has finished for PR 8070 at commit
|
|
This looks good to me overall. |
450b606 to
ace6df7
Compare
|
Test build #40355 has finished for PR 8070 at commit
|
This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists (cherry picked from commit 071bbad) Signed-off-by: Cheng Lian <lian@databricks.com>
|
Merged to master and branch-1.5. |
This PR is inspired by apache#8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes apache#8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists
This PR is inspired by #8063 authored by @dguy. Especially, testing Parquet files added here are all taken from that PR.
Committer who merges this PR should attribute it to
Damian Guy <damian.guy@gmail.com>.SPARK-6776 and SPARK-6777 followed
parquet-avroto implement backwards-compatibility rules defined inparquet-formatspec. However, both Spark SQL andparquet-avroneglected the following statement inparquet-format:One of the consequences is that, Parquet files generated by
parquet-protobufcontaining unannotated repeated fields are not correctly converted to Catalyst arrays.This PR fixes this issue by
Handling unannotated repeated fields in
CatalystSchemaConverter.Converting this kind of special repeated fields to Catalyst arrays in
CatalystRowConverter.Two special converters,
RepeatedPrimitiveConverterandRepeatedGroupConverter, are added. They delegate actual conversion work to a childelementConverterand accumulates elements in anArrayBuffer.Two extra methods,
start()andend(), are added toParentContainerUpdater. So that they can be used to initialize newArrayBuffers for unannotated repeated fields, and propagate converted array values to upstream.