[SPARK-15804][SQL]Include metadata in the toStructType #13555

kevinyu98 · 2016-06-08T07:27:28Z

What changes were proposed in this pull request?

The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata.

How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

get latest code from upstream

adding trim characters support

get latest code for pr12646

merge latest code

merge upstream/master

jaceklaskowski · 2016-06-08T10:58:36Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

+
+  test("SPARK-15804: write out the metadata to parquet file") {
+    val data = (1, "abc") ::(2, "helloabcde") :: Nil
+    val df = spark.createDataFrame(data).toDF("a", "b")


Merge the lines 630 and 631 to Seq((1,"abc"),(2,"hello")).toDF("a", "b") instead

sure, I will do that.

viirya · 2016-06-08T18:02:22Z

Actually I already did this in #13371.

cloud-fan · 2016-06-08T21:44:07Z

ok to test

cloud-fan · 2016-06-08T21:46:28Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

+    val dfWithmeta = df.select('a, 'b.as("b", md))
+
+    withTempPath { dir =>
+      val path = s"${dir.getCanonicalPath}/data"


I think dir.getCanonicalPath is good here.

ok, I will make change

SparkQA · 2016-06-08T23:33:16Z

Test build #60200 has finished for PR 13555 at commit 200a923.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-09T02:35:03Z

Test build #60211 has finished for PR 13555 at commit a47dad4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-09T04:44:12Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

+    val dfWithmeta = df.select('a, 'b.as("b", md))
+
+    withTempPath { dir =>
+      val path = s"${dir.getCanonicalPath}"


nit: just dir.getCanonicalPath, no need to wrap it inside ""

Done, Thanks very much.

cloud-fan · 2016-06-09T05:07:14Z

LGTM, pending jenkins

SparkQA · 2016-06-09T06:49:04Z

Test build #60222 has finished for PR 13555 at commit 229ca27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource. The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata. ## How was this patch tested? added test case in ParquetQuerySuite.scala (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kevin Yu <qyu@us.ibm.com> Closes #13555 from kevinyu98/spark-15804. (cherry picked from commit 99386fe) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2016-06-09T16:51:12Z

thanks, merging to master and 2.0!

## What changes were proposed in this pull request? The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource. The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata. ## How was this patch tested? added test case in ParquetQuerySuite.scala (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kevin Yu <qyu@us.ibm.com> Closes apache#13555 from kevinyu98/spark-15804.

kevinyu98 added 27 commits April 20, 2016 11:06

adding testcase

3b44c59

Merge remote-tracking branch 'upstream/master'

18b4a31

Merge remote-tracking branch 'upstream/master'

4f4d1c8

get latest code from upstream

Merge remote-tracking branch 'upstream/master'

f5f0cbe

adding trim characters support

Merge remote-tracking branch 'upstream/master'

d8b2edb

get latest code for pr12646

Merge remote-tracking branch 'upstream/master'

196b6c6

merge latest code

Merge remote-tracking branch 'upstream/master'

f37a01e

merge upstream/master

Merge remote-tracking branch 'upstream/master'

bb5b01f

Merge remote-tracking branch 'upstream/master'

bde5820

Merge remote-tracking branch 'upstream/master'

5f7cd96

Merge remote-tracking branch 'upstream/master'

893a49a

Merge remote-tracking branch 'upstream/master'

4bbe1fd

Merge remote-tracking branch 'upstream/master'

b2dd795

Merge remote-tracking branch 'upstream/master'

8c3e5da

Merge remote-tracking branch 'upstream/master'

a0eaa40

Merge remote-tracking branch 'upstream/master'

d03c940

Merge remote-tracking branch 'upstream/master'

d728d5e

Merge remote-tracking branch 'upstream/master'

ea104dd

Merge remote-tracking branch 'upstream/master'

6ab1215

Merge remote-tracking branch 'upstream/master'

0c56653

Merge remote-tracking branch 'upstream/master'

d7a1874

Merge remote-tracking branch 'upstream/master'

85d3500

Merge remote-tracking branch 'upstream/master'

c056f91

Merge remote-tracking branch 'upstream/master'

0b8189d

Merge remote-tracking branch 'upstream/master'

c2ea31d

Merge remote-tracking branch 'upstream/master'

a2d3056

include metadata when writeRelation

1d451d2

jaceklaskowski reviewed Jun 8, 2016
View reviewed changes

address comments for the testcase

200a923

cloud-fan reviewed Jun 8, 2016
View reviewed changes

address test case comments

a47dad4

cloud-fan reviewed Jun 9, 2016
View reviewed changes

address comment

229ca27

asfgit closed this in 99386fe Jun 9, 2016

[SPARK-15804][SQL]Include metadata in the toStructType #13555

[SPARK-15804][SQL]Include metadata in the toStructType #13555

Conversation

kevinyu98 commented Jun 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jaceklaskowski Jun 8, 2016

Choose a reason for hiding this comment

Uh oh!

kevinyu98 Jun 8, 2016

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 8, 2016

Uh oh!

cloud-fan commented Jun 8, 2016

Uh oh!

cloud-fan Jun 8, 2016

Choose a reason for hiding this comment

Uh oh!

kevinyu98 Jun 9, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 8, 2016

Uh oh!

SparkQA commented Jun 9, 2016

Uh oh!

cloud-fan Jun 9, 2016

Choose a reason for hiding this comment

Uh oh!

kevinyu98 Jun 9, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 9, 2016

Uh oh!

SparkQA commented Jun 9, 2016

Uh oh!

cloud-fan commented Jun 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants