[SPARK-19005][SQL] Keep column ordering when a schema is explicitly specified #16410

maropu · 2016-12-27T00:06:29Z

What changes were proposed in this pull request?

This pr is to keep column ordering when a schema is explicitly specified.
A concrete example is as follows;

scala> import org.apache.spark.sql.types._
scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/")
scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/")
scala> df.printSchema
root
 |-- a: integer (nullable = true)
 |-- b: integer (nullable = true)

scala> val schema = new StructType().add("a", LongType).add("b", IntegerType)
scala> val df = spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/")
scala> df.printSchema
root
 |-- b: integer (nullable = true)
 |-- a: long (nullable = true)

This fix removes the code to filter out the overlapped fields of a data schema in getOrInferFileFormatSchema. Then, it respects column ordering in HadoopFsRelation#schema.

How was this patch tested?

Added tests in ParquetPartitionDiscoverySuite.
This pr comes from SPARK-18108(#16030).

SparkQA · 2016-12-27T01:50:53Z

Test build #70613 has finished for PR 16410 at commit 9174e7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-12-27T03:49:47Z

I'm looking into the failure...

SparkQA · 2016-12-27T11:09:51Z

Test build #70630 has finished for PR 16410 at commit 466c590.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-12-27T14:06:41Z

This fix change some existing behaviour in datasource.
For instance,

scala> sql("""CREATE TABLE testTable(a INT, b INT, c INT, d INT) USING PARQUET PARTITIONED BY (b, c)""")
scala> sql("""INSERT INTO TABLE testTable PARTITION (b=14, c) SELECT 13, 15, 16""").explain
scala> sql("""SELECT * FROM testTable""").show

// A new column ordering
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
| 13| 14| 15| 16|
+---+---+---+---+

// An old column ordering
+---+---+---+---+
|  a|  d|  b|  c|
+---+---+---+---+
| 13| 15| 14| 16|
+---+---+---+---+

I'm not sure this is acceptable, so welcome any advices.
cc: @cloud-fan

cloud-fan · 2016-12-28T06:13:40Z

create table t(a int, b int) partitioned by (a), the schema of table t is: <b int, a int>.

This behavior is intentional and already published, we can not change it. What we should do is to find out other places that don't follow this rule and respect the given schema, i.e. you are doing the opposite thing.

maropu · 2016-12-28T06:26:43Z

Aha, okay and I'll fix that way. Thanks!

SparkQA · 2017-01-12T16:42:13Z

Test build #71269 has finished for PR 16410 at commit a522ea3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 466c59067042538619a0ddc9e89014d9f97482ea.

This reverts commit 9174e7c0686ddcd963a0232250dbcc50062d750b.

SparkQA · 2017-01-13T10:10:08Z

Test build #71308 has finished for PR 16410 at commit 1d92a8f.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T10:14:39Z

Test build #71309 has finished for PR 16410 at commit 41d257e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T10:24:29Z

Test build #71311 has finished for PR 16410 at commit abdea3a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T12:06:53Z

Test build #71312 has finished for PR 16410 at commit 4e09c0e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T16:09:59Z

Test build #71326 has finished for PR 16410 at commit c8be9de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-01-13T16:27:02Z

Jenkins, retest this please.

maropu · 2017-01-13T17:52:51Z

Jenkins, retest this please.

SparkQA · 2017-01-13T19:40:40Z

Test build #71336 has finished for PR 16410 at commit c8be9de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-14T06:48:39Z

Test build #71367 has started for PR 16410 at commit ba7bc17.

maropu · 2017-01-14T08:41:43Z

Jenkins, retest this please.

SparkQA · 2017-01-14T11:10:44Z

Test build #71369 has finished for PR 16410 at commit ba7bc17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-01-15T01:16:54Z

I looked around the code and then I though this is an expected behaviour, so I'll close this. Thanks!

SparkQA · 2017-01-15T02:52:26Z

Test build #71385 has finished for PR 16410 at commit e3e095a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu added 5 commits January 13, 2017 19:06

Keep column ordering when a schema is explicitly specified

a6d734f

Fix the test failures

876be76

Revert "Fix the test failures"

4356aac

This reverts commit 466c59067042538619a0ddc9e89014d9f97482ea.

Revert "Keep column ordering when a schema is explicitly specified"

d82791f

This reverts commit 9174e7c0686ddcd963a0232250dbcc50062d750b.

Add a test

e3e095a

maropu force-pushed the SPARK-19005 branch from 1d92a8f to 41d257e Compare January 13, 2017 10:09

maropu force-pushed the SPARK-19005 branch from 41d257e to abdea3a Compare January 13, 2017 10:20

maropu force-pushed the SPARK-19005 branch from abdea3a to 4e09c0e Compare January 13, 2017 10:27

maropu force-pushed the SPARK-19005 branch from 4e09c0e to c8be9de Compare January 13, 2017 14:37

maropu force-pushed the SPARK-19005 branch from c8be9de to ba7bc17 Compare January 14, 2017 06:45

maropu force-pushed the SPARK-19005 branch from ba7bc17 to e3e095a Compare January 15, 2017 01:12

maropu closed this Jan 15, 2017

maropu deleted the SPARK-19005 branch July 5, 2017 11:44

[SPARK-19005][SQL] Keep column ordering when a schema is explicitly specified #16410

[SPARK-19005][SQL] Keep column ordering when a schema is explicitly specified #16410

Uh oh!

Conversation

maropu commented Dec 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 27, 2016

Uh oh!

maropu commented Dec 27, 2016

Uh oh!

SparkQA commented Dec 27, 2016

Uh oh!

maropu commented Dec 27, 2016

Uh oh!

cloud-fan commented Dec 28, 2016

Uh oh!

maropu commented Dec 28, 2016

Uh oh!

SparkQA commented Jan 12, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

maropu commented Jan 13, 2017

Uh oh!

maropu commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 14, 2017

Uh oh!

maropu commented Jan 14, 2017

Uh oh!

SparkQA commented Jan 14, 2017

Uh oh!

maropu commented Jan 15, 2017

Uh oh!

SparkQA commented Jan 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maropu commented Dec 27, 2016 •

edited

Loading