[SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet reader fail to read data #16030

maropu · 2016-11-28T08:08:30Z

What changes were proposed in this pull request?

A vectorized parquet reader fails to read column data if data schema and partition schema overlap with each other and inferred types in the partition schema differ from ones in the data schema. An example code to reproduce this bug is as follows;

scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> spark.createDataFrame(as).write.parquet("/data/a=1/")
scala> val df = spark.read.parquet("/data/")
scala> df.printSchema
root
 |-- a: long (nullable = true)
 |-- b: integer (nullable = true)
scala> df.collect
java.lang.NullPointerException
        at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283)
        at org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

The root cause is that a logical layer (HadoopFsRelation) and a physical layer (VectorizedParquetRecordReader) have a different assumption on partition schema; the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. To fix this bug, this pr simply updates HadoopFsRelation.schema to respect the partition columns position in data schema and respect the partition columns type in partition schema.

How was this patch tested?

Add tests in ParquetPartitionDiscoverySuite

maropu · 2016-11-28T08:09:01Z

This query passed in the released spark-2.0.2, so it seems this regression is involved with SPARK-18510.

SparkQA · 2016-11-28T10:21:32Z

Test build #69230 has finished for PR 16030 at commit 6bd8b4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-11-28T10:26:16Z

@brkyvz @tdas Could you check this?

brkyvz · 2016-11-28T18:23:44Z

...cala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala

@maropu I think you're doing the wrong thing. You're having the value a=1 as part of your data schema and your partition directory, which is not allowed. I would say the correct behavior here would be to fail the reading case, because the value is both in the partition schema and the data schema, and it's possible that they may not be equal.
For example:

val df = Seq((1L, 2.0)).toDF("a", "b") df.write.parquet(s"$path/a=2") // this should fail because `a` is both part of the partitioning and schema checkAnswer(spark.read.parquet(s"$path"), Seq(Row(1L, 2.0)))

brkyvz · 2016-11-28T18:25:21Z

@maropu I wouldn't say this is a regression. I would say that this working for 2.0.2 was a bug in 2.0.2. If you want the column a to be interpreted as a LongType instead of IntegerType, you should provide the schema with df.read.schema(schemaWhereAIsLong).parquet("path")

brkyvz · 2016-11-28T18:27:25Z

Or the thing that we should fix here is that if a partition column is found also as part of the dataSchema, to throw an exception.

maropu · 2016-11-28T23:32:04Z

@brkyvz Thanks for your comment! okay, I'll fix in a that way.

SparkQA · 2016-11-29T07:27:38Z

Test build #69307 has started for PR 16030 at commit 43f028d.

maropu · 2016-11-29T08:10:58Z

Jenkins, retest this please.

SparkQA · 2016-11-29T09:40:00Z

Test build #69312 has finished for PR 16030 at commit 43f028d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-11-29T10:31:45Z

I'm looking into the failures.

SparkQA · 2016-11-29T19:12:25Z

Test build #69336 has finished for PR 16030 at commit 43b4eb0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-11-30T00:55:35Z

@brkyvz How about this fix?

liancheng · 2016-11-30T20:12:43Z

@brkyvz @maropu Actually, we do allow users to create partitioned tables that allow data schema to contain (part of) the partition columns, and there are test cases for this use case.

This use case is mostly useful when you are trying to reorganize an existing dataset into a partitioned form. Say you have a JSON dataset containing all the tweets in 2016 and you'd like to partition it by date. By allowing the data schema to contain partitioned columns, you may simply put JSON files of the same date into the same directory. Otherwise, you'll have to run an ETL job to erase the date column from the dataset, which can be time-consuming.

As for the query @maropu mentioned in the PR description, the query itself is problematic, because it lacks a user-specified schema to override the data type of the partitioned column a. Ideally, partition discovery should be able to fill in the correct data type LongType, but it's impossible since the directory path doesn't really expose that information. That's why a user-specified schema is necessary.

This query works in 2.0.2 because of the bug @brkyvz fixed in PR #15951: Spark ignores the data types of partitioned columns specified in the user-specified schema. Now the bug is fixed and this query doesn't work as expected. (UPDATE: Made a mistake in this paragraph. Verified locally that this query doesn't work in 2.0.2 either.)

In short:

This isn't a regression, the original query itself is problematic.
For this PR, we can either just close it or try to provide a better error message in the read path (ask the user to provide a user-specified schema) when:
- A partitioned columns p also appears in the data schema, and
- The discovered data type of p is different from the data type specified in the data schema.
An alternative is to override the discovered partition column data type using the one in the data schema if any. But I'd say this change is probably too risky at this moment for 2.1.

liancheng · 2016-11-30T22:47:21Z

@maropu I tried your snippet (with minor modifications). It works for 1.6.0 instead of 2.0.2:

case class A(a: Long, b: Int)
val as = Seq(A(1, 2))
val path = "/tmp/part"
sqlContext.createDataFrame(as).write.mode("overwrite").parquet(s"$path/a=1/")
val df = sqlContext.read.parquet(path)
df.printSchema()
df.collect()

For 2.0.2, it throws exactly the same NPE.

brkyvz · 2016-11-30T22:57:42Z

I also made this query work on 2.1 branch by configuring spark.sql.parquet.enableVectorizedReader = false. It seems to be an issue with the VectorizedParquetReader.

liancheng · 2016-11-30T22:59:14Z

My hunch is that we somehow passed a wrong requested schema containing the partition column down to the vectorized Parquet reader. IIRC, we prune partition columns from the data schema when generating the requested schema for the underlying reader since partition values are directly available in the directory path, there's no need to read and decode them from the physical file.

maropu · 2016-12-01T02:16:48Z

This is not a bug in VectorizedParquetReader as @liancheng said, and the root cause is that wrongly inferred types are passed into the reader in VectorizedParquetReader#initBatch https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L176

yea, I know this functionality is helpful for skilful users, and, on the other hand, newbies easily could break query results via this interface as @brkyvz said. Therefore, in case of that we allow data schema to contain (part of) the partition columns, IMO it'd be better to alert that risk to users via logWarning or something.

BTW this is the original fix of this bug (Sorry, but I wrongly overrode and remove this commit ): master...maropu:SPARK-18108-2. The fix of this commit is to fill in correct data types when data schema contains (part of) the partition columns.

brkyvz · 2016-12-01T18:44:37Z

@maropu The reason I call it a bug with the VectorizedParquetReader is that all other data sources always replace the value read from the data with the partition value. The VectorizedReader also doesn't throw an exception if you try the following example:

case class A(a: Long, b: Int)
val as = Seq(A(1, 2))
val path = "/tmp/part"
sqlContext.createDataFrame(as).write.mode("overwrite").parquet(s"$path/a=1480617712537/")
val df = sqlContext.read.parquet(path)
df.printSchema()
df.collect()

returns

root
 |-- a: long (nullable = true)
 |-- b: integer (nullable = true)

df: org.apache.spark.sql.DataFrame = [a: bigint, b: int]
res7: Array[org.apache.spark.sql.Row] = Array([1480617712537,2])

brkyvz · 2016-12-01T19:02:41Z

@maropu I think I found the simplest fix!
Would you mind changing:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Line 180 in 2ab8551

val dataSchema = userSpecifiedSchema.map { schema =>

to something like:

val dataSchema = userSpecifiedSchema.orElse {
    format.inferSchema(
      sparkSession,
      caseInsensitiveOptions
      tempFileIndex.allFiles())
}.getOrElse {
  throw new AnalysisException(
    s"Unable to infer schema for $format. It must be specified manually.")
}
val dataWithoutPartitions = dataSchema.filterNot { field =>
  partitionSchema.exists(p => equality(p.name, field.name))
}

What we were missing was removing the partition columns from the data schema when we infer the format

maropu · 2016-12-02T01:50:39Z

@brkyvz Thanks! Does the latest fix apply your suggestion?

SparkQA · 2016-12-02T04:23:15Z

Test build #69528 has finished for PR 16030 at commit 7e068dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2016-12-02T05:15:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

I would still keep this below the if (justPartitioning) area, because otherwise everytime someone performs a df.mode("append").saveAsTable() we will perform schema inference.

brkyvz · 2016-12-02T05:16:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

this change is not necessary

brkyvz · 2016-12-02T05:16:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

I would keep the code here like I mentioned above, but keep your changes.

okay, I reverted this.

brkyvz · 2016-12-02T05:17:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

no need for this else if

yea, removed this.

brkyvz · 2016-12-02T05:19:36Z

...cala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala

since this test is broken only when the vectorizedReader is on, I would also keep the conf here as enabled, just in case anyone changes the default implementation later.
so I would also wrap this as:

withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") { withTempPath { dir => ... } }

brkyvz · 2016-12-02T05:21:15Z

@maropu I would still keep the changes I proposed below L180 like I commented before. We don't need to use the inferred data type as the partition type

maropu · 2016-12-15T04:41:41Z

@cloud-fan Does the latest fix satisfy what you suggested?

maropu · 2016-12-15T05:12:38Z

@liancheng As for DataFrameReader.dataSchema() and DataFrameReader.partitoinSchema(), did you mean we add new interfaces there for users to set user-defined data and partition schema, respectively? This made me image like this fix (maropu@039884e#diff-f70bda59304588cc3abfa3a9840653f4R78);

scala> import org.apache.spark.sql.types._
scala> val partitionSchema = new StructType().add("a", LongType)
scala> val df = Seq((1L, 0), (2L, 0)).toDF("a", "b")
scala> df.write.save("/Users/maropu/Desktop/data/a=1")
scala> spark.read.partitionSchema(partitionSchema).load("/Users/maropu/Desktop/data").printSchema
root
 |-- a: long (nullable = true)
 |-- b: integer (nullable = true)

cloud-fan · 2016-12-15T06:29:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

-    StructType(dataSchema ++ partitionSchema.filterNot { column =>
-      dataSchemaColumnNames.contains(column.name.toLowerCase)
+    val equality = sparkSession.sessionState.conf.resolver
+    val overriddenDataSchema = dataSchema.map { dataField =>


how about

val getColName: (StructField => String) = if (conf.caseSensitive) _.name else _.name.toLowerCase val overlappedPartCols = mutable.Map.empty[String, StructField] for { dataField <- dataSchema partitionField <- partitionSchema if getColName(dataField) == getColName(partitionField) } overlappedPartCols += getColName(partitionField) -> partitionField StructType(dataSchema.map(f => overlappedPartCols.getOrElse(getColName(f), f)) ++ partitionSchema.filterNot(f => overlappedPartCols.contains(getColName(f))))

Why didn't you use sparkSession.sessionState.conf.resolver? Any reason I missed?
I just wrote this code with the same style with DataSource#getOrInferFileFormatSchema and is this a bad idea? Anyway, since the two patterns have the same output, both is okay to me.

Because the current code will iterate dataSchema many times(depend on the number of partition columns), while my proposal only iterate it 2 times.

a bit modified, how about this?

SparkQA · 2016-12-15T07:07:59Z

Test build #70177 has finished for PR 16030 at commit 5b23b89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-15T07:27:40Z

Test build #70178 has finished for PR 16030 at commit dc54b69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-15T12:28:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

+      if (sparkSession.sessionState.conf.caseSensitiveAnalysis) _.name else _.name.toLowerCase
+    val overlappedPartCols = mutable.Map.empty[String, StructField]
+    partitionSchema.foreach { partitionField =>
+      dataSchema.find(getColName(_) == getColName(partitionField)).map { overlappedCol =>


a bit clearer:

if (dataSchema.find(getColName(_) == getColName(partitionField)).isDefined) { overlappedPartCols += getColName(partitionField) -> partitionField }

Fixed and thanks!

cloud-fan · 2016-12-15T12:59:01Z

LGTM, pending jenkins. Can you also update the PR title and description? thanks!

maropu · 2016-12-15T13:06:12Z

okay

SparkQA · 2016-12-15T13:57:44Z

Test build #70186 has finished for PR 16030 at commit 0601ccd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-12-15T14:05:19Z

@cloud-fan okay, I updated the desc.

SparkQA · 2016-12-15T14:40:49Z

Test build #70190 has finished for PR 16030 at commit 8c7d3b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-15T17:29:48Z

Can you also update the title? And the description has a mistake: the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string.

maropu · 2016-12-16T00:31:27Z

oh, I wrongly wrote in an opposite way..., okay, fixed cc: @cloud-fan

cloud-fan · 2016-12-16T14:46:08Z

thanks, merging to master/2.1!

… reader fail to read data ## What changes were proposed in this pull request? A vectorized parquet reader fails to read column data if data schema and partition schema overlap with each other and inferred types in the partition schema differ from ones in the data schema. An example code to reproduce this bug is as follows; ``` scala> case class A(a: Long, b: Int) scala> val as = Seq(A(1, 2)) scala> spark.createDataFrame(as).write.parquet("/data/a=1/") scala> val df = spark.read.parquet("/data/") scala> df.printSchema root |-- a: long (nullable = true) |-- b: integer (nullable = true) scala> df.collect java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283) at org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) ``` The root cause is that a logical layer (`HadoopFsRelation`) and a physical layer (`VectorizedParquetRecordReader`) have a different assumption on partition schema; the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. To fix this bug, this pr simply updates `HadoopFsRelation.schema` to respect the partition columns position in data schema and respect the partition columns type in partition schema. ## How was this patch tested? Add tests in `ParquetPartitionDiscoverySuite` Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #16030 from maropu/SPARK-18108. (cherry picked from commit dc2a4d4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

maropu · 2016-12-16T15:01:05Z

Thanks! I found another wired behaviour related to this issue. Is this expected? or, should we fix?

scala> import org.apache.spark.sql.types._
scala> val schema = new StructType().add("a", LongType).add("b", IntegerType)
scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/")
scala> val df = spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/")
scala> df.printSchema
root
 |-- b: integer (nullable = true)
 |-- a: long (nullable = true)

scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/")
scala> df.printSchema
root
 |-- a: integer (nullable = true)
 |-- b: integer (nullable = true)

cloud-fan · 2016-12-17T04:58:08Z

This was the behavior change your PR proposed before, I think it makes sense, you can send a PR to fix it in Spark 2.2

maropu · 2016-12-17T05:05:50Z

okay! I'll make a JIRA later and thanks!

… reader fail to read data ## What changes were proposed in this pull request? A vectorized parquet reader fails to read column data if data schema and partition schema overlap with each other and inferred types in the partition schema differ from ones in the data schema. An example code to reproduce this bug is as follows; ``` scala> case class A(a: Long, b: Int) scala> val as = Seq(A(1, 2)) scala> spark.createDataFrame(as).write.parquet("/data/a=1/") scala> val df = spark.read.parquet("/data/") scala> df.printSchema root |-- a: long (nullable = true) |-- b: integer (nullable = true) scala> df.collect java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283) at org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) ``` The root cause is that a logical layer (`HadoopFsRelation`) and a physical layer (`VectorizedParquetRecordReader`) have a different assumption on partition schema; the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. To fix this bug, this pr simply updates `HadoopFsRelation.schema` to respect the partition columns position in data schema and respect the partition columns type in partition schema. ## How was this patch tested? Add tests in `ParquetPartitionDiscoverySuite` Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#16030 from maropu/SPARK-18108.

brkyvz reviewed Nov 28, 2016

View reviewed changes

maropu force-pushed the SPARK-18108 branch from 6bd8b4c to 43f028d Compare November 29, 2016 07:26

maropu force-pushed the SPARK-18108 branch from 4f37a86 to 43b4eb0 Compare November 29, 2016 16:53

brkyvz reviewed Dec 2, 2016

View reviewed changes

Replace data column types with partition ones if the overlap exits

dc54b69

maropu force-pushed the SPARK-18108 branch from 5b23b89 to dc54b69 Compare December 15, 2016 04:38

cloud-fan reviewed Dec 15, 2016

View reviewed changes

Apply comments

0601ccd

cloud-fan reviewed Dec 15, 2016

View reviewed changes

Simplify code

8c7d3b8

maropu changed the title ~~[SPARK-18108][SQL] Fix a bug to fail partition schema inference~~ [SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet reader fail to read data Dec 16, 2016

asfgit closed this in dc2a4d4 Dec 16, 2016

maropu mentioned this pull request Dec 27, 2016

[SPARK-19005][SQL] Keep column ordering when a schema is explicitly specified #16410

Closed

maropu mentioned this pull request Jun 21, 2017

[SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns #18375

Closed

maropu deleted the SPARK-18108 branch July 5, 2017 11:44

[SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet reader fail to read data #16030

[SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet reader fail to read data #16030

Uh oh!

Conversation

maropu commented Nov 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented Nov 28, 2016

Uh oh!

SparkQA commented Nov 28, 2016

Uh oh!

maropu commented Nov 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Nov 28, 2016

Uh oh!

brkyvz commented Nov 28, 2016

Uh oh!

maropu commented Nov 28, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

maropu commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

maropu commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

maropu commented Nov 30, 2016

Uh oh!

liancheng commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liancheng commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brkyvz commented Nov 30, 2016

Uh oh!

liancheng commented Nov 30, 2016

Uh oh!

maropu commented Dec 1, 2016

Uh oh!

brkyvz commented Dec 1, 2016

Uh oh!

brkyvz commented Dec 1, 2016

Uh oh!

maropu commented Dec 2, 2016

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Dec 2, 2016

Uh oh!

maropu commented Dec 15, 2016

Uh oh!

maropu commented Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

maropu commented Nov 28, 2016 •

edited

Loading

liancheng commented Nov 30, 2016 •

edited

Loading

liancheng commented Nov 30, 2016 •

edited

Loading

maropu commented Dec 15, 2016 •

edited

Loading

cloud-fan Dec 15, 2016 •

edited

Loading