Skip to content

Conversation

@kevinyu98
Copy link
Contributor

What changes were proposed in this pull request?

The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata.

How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

get latest code from upstream
adding trim characters support

test("SPARK-15804: write out the metadata to parquet file") {
val data = (1, "abc") ::(2, "helloabcde") :: Nil
val df = spark.createDataFrame(data).toDF("a", "b")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge the lines 630 and 631 to Seq((1,"abc"),(2,"hello")).toDF("a", "b") instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I will do that.

@viirya
Copy link
Member

viirya commented Jun 8, 2016

Actually I already did this in #13371.

@cloud-fan
Copy link
Contributor

ok to test

val dfWithmeta = df.select('a, 'b.as("b", md))

withTempPath { dir =>
val path = s"${dir.getCanonicalPath}/data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dir.getCanonicalPath is good here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will make change

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60200 has finished for PR 13555 at commit 200a923.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2016

Test build #60211 has finished for PR 13555 at commit a47dad4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val dfWithmeta = df.select('a, 'b.as("b", md))

withTempPath { dir =>
val path = s"${dir.getCanonicalPath}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just dir.getCanonicalPath, no need to wrap it inside ""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, Thanks very much.

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Jun 9, 2016

Test build #60222 has finished for PR 13555 at commit 229ca27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 9, 2016
## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata.

## How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Kevin Yu <qyu@us.ibm.com>

Closes #13555 from kevinyu98/spark-15804.

(cherry picked from commit 99386fe)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master and 2.0!

@asfgit asfgit closed this in 99386fe Jun 9, 2016
zjffdu pushed a commit to zjffdu/spark that referenced this pull request Jun 10, 2016
## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata.

## How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Kevin Yu <qyu@us.ibm.com>

Closes apache#13555 from kevinyu98/spark-15804.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants