Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Dec 27, 2016

What changes were proposed in this pull request?

This pr is to keep column ordering when a schema is explicitly specified.
A concrete example is as follows;

scala> import org.apache.spark.sql.types._
scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/")
scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/")
scala> df.printSchema
root
 |-- a: integer (nullable = true)
 |-- b: integer (nullable = true)

scala> val schema = new StructType().add("a", LongType).add("b", IntegerType)
scala> val df = spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/")
scala> df.printSchema
root
 |-- b: integer (nullable = true)
 |-- a: long (nullable = true)

This fix removes the code to filter out the overlapped fields of a data schema in getOrInferFileFormatSchema. Then, it respects column ordering in HadoopFsRelation#schema.

How was this patch tested?

Added tests in ParquetPartitionDiscoverySuite.
This pr comes from SPARK-18108(#16030).

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70613 has finished for PR 16410 at commit 9174e7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Dec 27, 2016

I'm looking into the failure...

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70630 has finished for PR 16410 at commit 466c590.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Dec 27, 2016

This fix change some existing behaviour in datasource.
For instance,

scala> sql("""CREATE TABLE testTable(a INT, b INT, c INT, d INT) USING PARQUET PARTITIONED BY (b, c)""")
scala> sql("""INSERT INTO TABLE testTable PARTITION (b=14, c) SELECT 13, 15, 16""").explain
scala> sql("""SELECT * FROM testTable""").show
// A new column ordering
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
| 13| 14| 15| 16|
+---+---+---+---+
// An old column ordering
+---+---+---+---+
|  a|  d|  b|  c|
+---+---+---+---+
| 13| 15| 14| 16|
+---+---+---+---+

I'm not sure this is acceptable, so welcome any advices.
cc: @cloud-fan

@cloud-fan
Copy link
Contributor

create table t(a int, b int) partitioned by (a), the schema of table t is: <b int, a int>.

This behavior is intentional and already published, we can not change it. What we should do is to find out other places that don't follow this rule and respect the given schema, i.e. you are doing the opposite thing.

@maropu
Copy link
Member Author

maropu commented Dec 28, 2016

Aha, okay and I'll fix that way. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71269 has finished for PR 16410 at commit a522ea3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71308 has finished for PR 16410 at commit 1d92a8f.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71309 has finished for PR 16410 at commit 41d257e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71311 has finished for PR 16410 at commit abdea3a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71312 has finished for PR 16410 at commit 4e09c0e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71326 has finished for PR 16410 at commit c8be9de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 13, 2017

Jenkins, retest this please.

1 similar comment
@maropu
Copy link
Member Author

maropu commented Jan 13, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71336 has finished for PR 16410 at commit c8be9de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2017

Test build #71367 has started for PR 16410 at commit ba7bc17.

@maropu
Copy link
Member Author

maropu commented Jan 14, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 14, 2017

Test build #71369 has finished for PR 16410 at commit ba7bc17.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 15, 2017

I looked around the code and then I though this is an expected behaviour, so I'll close this. Thanks!

@maropu maropu closed this Jan 15, 2017
@SparkQA
Copy link

SparkQA commented Jan 15, 2017

Test build #71385 has finished for PR 16410 at commit e3e095a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu deleted the SPARK-19005 branch July 5, 2017 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants