[SPARK-25313][SQL]Fix regression in FileFormatWriter output names #22320

gengliangwang · 2018-09-03T07:18:24Z

What changes were proposed in this pull request?

Let's see the follow example:

        val location = "/tmp/t"
        val df = spark.range(10).toDF("id")
        df.write.format("parquet").saveAsTable("tbl")
        spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
        spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location")
        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
        println(spark.read.parquet(location).schema)
        spark.table("tbl2").show()

The output column name in schema will be id instead of ID, thus the last query shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.

To guarantee correctness, we should change the output columns from Seq[Attribute] to Seq[String] to avoid its names being replaced by optimizer.

I will fix project elimination related rules in #22311 after this one.

How was this patch tested?

Unit test.

gengliangwang · 2018-09-03T07:18:55Z

@wangyum @cloud-fan @maropu

SparkQA · 2018-09-03T07:47:28Z

Test build #95606 has finished for PR 22320 at commit bbd572c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-09-03T07:52:36Z

retest this please.

maropu · 2018-09-03T08:21:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-   *                      instead of `data.output`.
+   * @param outputColumnNames The original output column names of the input query plan. The
+   *                      optimizer may not preserve the output column's names' case, so we need
+   *                      this parameter instead of `data.output`.


nit:

* @param outputColumnNames The original output column names of the input query plan. The * optimizer may not preserve the output column's names' case, so we need * this parameter instead of `data.output`.

maropu · 2018-09-03T09:39:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+      element.withName(names(index))
+    }
+  }
+


If #22311 merged, we don't need this function anymore? If so, IMHO it'd be better to fix this issue in the FileFormatWriter side as a workaround?

or make it a util function

It seems overkill to add a function here. But in FileFormatWriter we can't not access LogicalPlan to get the attributes.
Another way is to put this method in a Util.
Do you have a good suggestion?

I was thinking...

object FileFormatWriter { ... // workaround: a helper function... def outputWithNames(outputAttributes: Seq[Attribute], names: Seq[String]): Seq[Attribute] = { assert(outputAttributes.length == names.length, "The length of provided names doesn't match the length of output attributes.") outputAttributes.zipWithIndex.map { case (element, index) => element.withName(names(index)) } }

Then, in each callsite, just say FileFormatWriter. outputWithNames(logicalPlan.output, names)?

@maropu Thanks! I have create object DataWritingCommand for this.

cloud-fan · 2018-09-03T10:38:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-        val resolved = cmd.copy(partitionColumns = resolvedPartCols, outputColumns = outputColumns)
+        val resolved = cmd.copy(
+          partitionColumns = resolvedPartCols,
+          outputColumnNames = outputColumns.map(_.name))


why can't we use outputColumnNames directly here?

maropu · 2018-09-03T11:21:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+    assert(outputAttributes.length == names.length,
+      "The length of provided names doesn't match the length of output attributes.")
+    outputAttributes.zipWithIndex.map { case (element, index) =>
+      element.withName(names(index))


outputAttributes.zip(names).map { case (attr, outputName) => attr.withName(outputName) }?

@gengliangwang In what situations would outputAttributes.length != names.length,could u give me an example?

maropu · 2018-09-03T11:29:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala

+      query: LogicalPlan,
+      names: Seq[String]): Seq[Attribute] = {
+    // Save the output attributes to a variable to avoid duplicated function calls.
+    val outputAttributes = query.output


query: LogicalPlan -> outputAttributes: Seq[Attribute] in the function argument, then drop the line above?

I think both are OK. The current way makes it easier to call this Util function, and it is easier to understand what the parameter should be. While the ways you suggests makes the argument carrying minimal information.

SparkQA · 2018-09-03T11:45:52Z

Test build #95609 has finished for PR 22320 at commit bbd572c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-03T11:51:52Z

Test build #95610 has finished for PR 22320 at commit bbd572c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-03T12:16:35Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+      }
+    }
+  }
+


better to move these tests into DataFrameReaderWriterSuite?

maropu · 2018-09-03T12:17:36Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

    overwrite: Boolean,
    ifPartitionNotExists: Boolean,
-    outputColumns: Seq[Attribute]) extends SaveAsHiveFile {
+    outputColumnNames: Seq[String]) extends SaveAsHiveFile {


For better test coverage, can you add tests for hive tables?

No problem 👍

cloud-fan · 2018-09-03T15:27:40Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala


 import org.apache.spark.{AccumulatorSuite, SparkException}
 import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart}
+import org.apache.spark.sql.catalyst.TableIdentifier


unnecessary change

cloud-fan · 2018-09-03T15:29:14Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+        spark.sql("CREATE TABLE tbl2(COL1 long, COL2 int, COL3 int) USING parquet PARTITIONED " +
+          "BY (COL2) CLUSTERED BY (COL3) INTO 3 BUCKETS")
+        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT COL1, COL2, COL3 " +
+          "FROM view1 CLUSTER BY COL3")


is it legal to put CLUSTER BY in the INSERT statement?

cloud-fan · 2018-09-03T15:31:17Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+  test("Insert overwrite Hive table should output correct schema") {
+    withTable("tbl", "tbl2") {
+      withView("view1") {
+        spark.sql("CREATE TABLE tbl(id long)")


please run this test within withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET -> false)

I am not familiar with Hive. But as I look at the debug message of this logical plan, the top level is InsertIntoHiveTable default.tbl2, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, true, false, [ID]. It should not be related to this configuration, right?

cloud-fan · 2018-09-03T15:32:09Z

LGTM except some minor comments

SparkQA · 2018-09-03T17:22:49Z

Test build #95619 has finished for PR 22320 at commit 5bce8a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-03T17:40:53Z

Test build #95620 has finished for PR 22320 at commit 16bb457.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-03T17:59:38Z

Test build #95627 has finished for PR 22320 at commit 3c282ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2018-09-03T19:59:24Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+  test("Insert overwrite table command should output correct schema: basic") {
+    withTable("tbl", "tbl2") {
+      withView("view1") {
+        val df = spark.range(10).toDF("id")


Why is toDF("id") required? Why not spark.range(10) alone?

This is trivial...As the column name id is case sensitive and used below, I would like to show it explicitly.

"case sensitive"? How is so since Spark SQL is case-insensitive by default?

I think @gengliangwang meant case preserving, which is the behavior we are testing against.

spark.range(10).toDF("id") is same as spark.range(10), it's just clearer to people who don't know spark.range outputs a single column named "id".

jaceklaskowski · 2018-09-03T20:02:31Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+        spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
+        spark.sql("CREATE TABLE tbl2(ID long) USING parquet")
+        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
+        val identifier = TableIdentifier("tbl2", Some("default"))


default is the default database name, isn't it? I'd remove it from the test or use spark.catalog.currentDatabase.

jaceklaskowski · 2018-09-03T20:04:44Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+        spark.sql("CREATE TABLE tbl2(COL1 long, COL2 int, COL3 int) USING parquet PARTITIONED " +
+          "BY (COL2) CLUSTERED BY (COL3) INTO 3 BUCKETS")
+        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT COL1, COL2, COL3 FROM view1")
+        val identifier = TableIdentifier("tbl2", Some("default"))


Same as above.

jaceklaskowski · 2018-09-03T20:05:24Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+        val identifier = TableIdentifier("tbl2", Some("default"))
+        val location = spark.sessionState.catalog.getTableMetadata(identifier).location.toString
+        val expectedSchema = StructType(Seq(
+          StructField("COL1", LongType, true),


nullable is true by default.

Keeping it should be OK.

jaceklaskowski · 2018-09-03T20:06:43Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+        val location = spark.sessionState.catalog.getTableMetadata(identifier).location.toString
+        val expectedSchema = StructType(Seq(
+          StructField("COL1", LongType, true),
+          StructField("COL3", IntegerType, true),


You could use a little magic here: $"COL1".int

scala> $"COL1".int res1: org.apache.spark.sql.types.StructField = StructField(COL1,IntegerType,true)

jaceklaskowski · 2018-09-03T20:13:28Z

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala

        overwrite = false,
        ifPartitionNotExists = false,
-        outputColumns = outputColumns).run(sparkSession, child)
+        outputColumnNames = outputColumnNames).run(sparkSession, child)


Can you remove one outputColumnNames?

jaceklaskowski · 2018-09-03T20:15:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      withView("view1") {
+        withTempPath { path =>
+          spark.sql("CREATE TABLE tbl(id long)")
+          spark.sql("INSERT OVERWRITE TABLE tbl SELECT 4")


s/SELECT/VALUES as it could be a bit more Spark-idiomatic?

SparkQA · 2018-09-03T20:33:45Z

Test build #95633 has finished for PR 22320 at commit 98bf027.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-09-04T04:56:34Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+        spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
+        spark.sql("CREATE TABLE tbl2(ID long)")
+        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
+        checkAnswer(spark.table("tbl2"), Seq(Row(4)))


Add schema assert please. We can read data since SPARK-25132.

Good point. I found that CreateHiveTableAsSelectCommand output wrong schema after adding a new test case.

SparkQA · 2018-09-04T07:05:01Z

Test build #95649 has finished for PR 22320 at commit 538fea9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-04T07:06:37Z

retest this please

…t case

gengliangwang · 2018-09-04T08:32:09Z

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala

      assert(tableDesc.schema.isEmpty)
-      catalog.createTable(tableDesc.copy(schema = query.schema), ignoreIfExists = false)
+      val schema = DataWritingCommand.logicalPlanSchemaWithNames(query, outputColumnNames)
+      catalog.createTable(tableDesc.copy(schema = schema), ignoreIfExists = false)


The schema naming need to be consistent with outputColumnNames here.

SparkQA · 2018-09-04T10:46:09Z

Test build #95657 has finished for PR 22320 at commit 538fea9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-04T12:21:54Z

Test build #95663 has finished for PR 22320 at commit 3ca072d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-09-05T00:39:07Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

+    outputColumnNames: Seq[String])
  extends DataWritingCommand {
  import org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName



Line 66: query.schema should be DataWritingCommand.logicalPlanSchemaWithNames(query, outputColumnNames).

Oh, then we can use this method instead.

def checkColumnNameDuplication( columnNames: Seq[String], colType: String, caseSensitiveAnalysis: Boolean): Unit

SparkQA · 2018-09-05T07:05:01Z

Test build #95692 has finished for PR 22320 at commit 4590c98.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-05T07:08:13Z

retest this please

jaceklaskowski · 2018-09-05T10:22:01Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+  test("Insert overwrite table command should output correct schema: basic") {
+    withTable("tbl", "tbl2") {
+      withView("view1") {
+        val df = spark.range(10).toDF("id")


"case sensitive"? How is so since Spark SQL is case-insensitive by default?

jaceklaskowski · 2018-09-05T10:23:08Z

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala

          overwrite = true,
          ifPartitionNotExists = false,
-          outputColumns = outputColumns).run(sparkSession, child)
+          outputColumnNames = outputColumnNames).run(sparkSession, child)


Why is this duplication needed here?

what's the duplication?

outputColumnNames themselves. Specyfing outputColumnNames as the name of the property to set using outputColumnNames does nothing but introduces a duplication. If you removed one outputColumnNames the comprehension should not be lowered whatsoever, shouldn't it?

I feel it's better to specify parameters by name if the previous parameter is already specified by name, e.g. ifPartitionNotExists = false

jaceklaskowski · 2018-09-05T10:25:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      withTable("tbl", "tbl2") {
+        withView("view1") {
+          spark.sql("CREATE TABLE tbl(id long)")
+          spark.sql("INSERT OVERWRITE TABLE tbl VALUES 4")


I might be missing something, but why does this test use SQL statements not DataFrameWriter API, e.g. Seq(4).toDF("id").write.mode(SaveMode.Overwrite).saveAsTable("tbl")?

We can, but it's important to keep the code style consistent with the existing code in the same file. In this test suite, seems SQL statements are prefered.

SparkQA · 2018-09-05T10:41:03Z

Test build #95702 has finished for PR 22320 at commit 4590c98.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-09-05T12:07:15Z

retest this please

SparkQA · 2018-09-05T16:04:41Z

Test build #95711 has finished for PR 22320 at commit 4590c98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-06T02:39:35Z

thanks, merging to master!

wangyum · 2018-09-06T02:57:00Z

@gengliangwang We need backport this pr to branch-2.3.

## What changes were proposed in this pull request? Let's see the follow example: ``` val location = "/tmp/t" val df = spark.range(10).toDF("id") df.write.format("parquet").saveAsTable("tbl") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location") spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") println(spark.read.parquet(location).schema) spark.table("tbl2").show() ``` The output column name in schema will be `id` instead of `ID`, thus the last query shows nothing from `tbl2`. By enabling the debug message we can see that the output naming is changed from `ID` to `id`, and then the `outputColumns` in `InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`. ![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg) ![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg) **To guarantee correctness**, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer. I will fix project elimination related rules in apache#22311 after this one. ## How was this patch tested? Unit test. Closes apache#22320 from gengliangwang/fixOutputSchema. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…put names Port #22320 to branch-2.3 ## What changes were proposed in this pull request? Let's see the follow example: ``` val location = "/tmp/t" val df = spark.range(10).toDF("id") df.write.format("parquet").saveAsTable("tbl") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location") spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") println(spark.read.parquet(location).schema) spark.table("tbl2").show() ``` The output column name in schema will be `id` instead of `ID`, thus the last query shows nothing from `tbl2`. By enabling the debug message we can see that the output naming is changed from `ID` to `id`, and then the `outputColumns` in `InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`. ![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg) ![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg) **To guarantee correctness**, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer. I will fix project elimination related rules in #22311 after this one. ## How was this patch tested? Unit test. Closes #22346 from gengliangwang/portSchemaOutputName2.3. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Fix regression in FileFormatWriter output schema

bbd572c

maropu reviewed Sep 3, 2018

View reviewed changes

cloud-fan reviewed Sep 3, 2018

View reviewed changes

maropu reviewed Sep 3, 2018

View reviewed changes

gengliangwang added 2 commits September 3, 2018 19:21

address comments

5bce8a0

address more comment

16bb457

maropu reviewed Sep 3, 2018

View reviewed changes

add more test cases

3c282ef

cloud-fan reviewed Sep 3, 2018

View reviewed changes

revise tests

98bf027

jaceklaskowski reviewed Sep 3, 2018

View reviewed changes

revise test

538fea9

wangyum reviewed Sep 4, 2018

View reviewed changes

revise

45d2a20

fix CreateHiveTableAsSelectCommand output schema and verify it in tes…

3ca072d

…t case

gengliangwang commented Sep 4, 2018

View reviewed changes

jaceklaskowski approved these changes Sep 4, 2018

View reviewed changes

wangyum reviewed Sep 5, 2018

View reviewed changes

revise

4590c98

cloud-fan approved these changes Sep 5, 2018

View reviewed changes

jaceklaskowski suggested changes Sep 5, 2018

View reviewed changes

asfgit closed this in 3d6b68b Sep 6, 2018

gengliangwang mentioned this pull request Sep 6, 2018

[branch-2.3][SPARK-25313][SQL] Fix regression in FileFormatWriter output names #22346

Closed

maropu mentioned this pull request Sep 8, 2018

[WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning #22311

Closed

[SPARK-25313][SQL]Fix regression in FileFormatWriter output names #22320

[SPARK-25313][SQL]Fix regression in FileFormatWriter output names #22320

Uh oh!

Conversation

gengliangwang commented Sep 3, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Sep 3, 2018

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

gengliangwang commented Sep 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Sep 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 3, 2018

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Sep 3, 2018 •

edited

Loading

gengliangwang Sep 4, 2018 •

edited

Loading