-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet reader fail to read data #16030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This query passed in the released spark-2.0.2, so it seems this regression is involved with SPARK-18510. |
|
Test build #69230 has finished for PR 16030 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu I think you're doing the wrong thing. You're having the value a=1 as part of your data schema and your partition directory, which is not allowed. I would say the correct behavior here would be to fail the reading case, because the value is both in the partition schema and the data schema, and it's possible that they may not be equal.
For example:
val df = Seq((1L, 2.0)).toDF("a", "b")
df.write.parquet(s"$path/a=2")
// this should fail because `a` is both part of the partitioning and schema
checkAnswer(spark.read.parquet(s"$path"), Seq(Row(1L, 2.0)))|
@maropu I wouldn't say this is a regression. I would say that this working for 2.0.2 was a bug in 2.0.2. If you want the column |
|
Or the thing that we should fix here is that if a partition column is found also as part of the dataSchema, to throw an exception. |
|
@brkyvz Thanks for your comment! okay, I'll fix in a that way. |
|
Test build #69307 has started for PR 16030 at commit |
|
Jenkins, retest this please. |
|
Test build #69312 has finished for PR 16030 at commit
|
|
I'm looking into the failures. |
|
Test build #69336 has finished for PR 16030 at commit
|
|
@brkyvz How about this fix? |
|
@brkyvz @maropu Actually, we do allow users to create partitioned tables that allow data schema to contain (part of) the partition columns, and there are test cases for this use case. This use case is mostly useful when you are trying to reorganize an existing dataset into a partitioned form. Say you have a JSON dataset containing all the tweets in 2016 and you'd like to partition it by date. By allowing the data schema to contain partitioned columns, you may simply put JSON files of the same date into the same directory. Otherwise, you'll have to run an ETL job to erase the date column from the dataset, which can be time-consuming. As for the query @maropu mentioned in the PR description, the query itself is problematic, because it lacks a user-specified schema to override the data type of the partitioned column
In short:
|
|
@maropu I tried your snippet (with minor modifications). It works for 1.6.0 instead of 2.0.2: case class A(a: Long, b: Int)
val as = Seq(A(1, 2))
val path = "/tmp/part"
sqlContext.createDataFrame(as).write.mode("overwrite").parquet(s"$path/a=1/")
val df = sqlContext.read.parquet(path)
df.printSchema()
df.collect()For 2.0.2, it throws exactly the same NPE. |
|
I also made this query work on 2.1 branch by configuring |
|
My hunch is that we somehow passed a wrong requested schema containing the partition column down to the vectorized Parquet reader. IIRC, we prune partition columns from the data schema when generating the requested schema for the underlying reader since partition values are directly available in the directory path, there's no need to read and decode them from the physical file. |
|
This is not a bug in yea, I know this functionality is helpful for skilful users, and, on the other hand, newbies easily could break query results via this interface as @brkyvz said. Therefore, in case of that we allow data schema to contain (part of) the partition columns, IMO it'd be better to alert that risk to users via BTW this is the original fix of this bug (Sorry, but I wrongly overrode and remove this commit ): master...maropu:SPARK-18108-2. The fix of this commit is to fill in correct data types when data schema contains (part of) the partition columns. |
|
@maropu The reason I call it a bug with the VectorizedParquetReader is that all other data sources always replace the value read from the data with the partition value. The VectorizedReader also doesn't throw an exception if you try the following example: case class A(a: Long, b: Int)
val as = Seq(A(1, 2))
val path = "/tmp/part"
sqlContext.createDataFrame(as).write.mode("overwrite").parquet(s"$path/a=1480617712537/")
val df = sqlContext.read.parquet(path)
df.printSchema()
df.collect()returns |
|
@maropu I think I found the simplest fix! spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala Line 180 in 2ab8551
to something like: val dataSchema = userSpecifiedSchema.orElse {
format.inferSchema(
sparkSession,
caseInsensitiveOptions
tempFileIndex.allFiles())
}.getOrElse {
throw new AnalysisException(
s"Unable to infer schema for $format. It must be specified manually.")
}
val dataWithoutPartitions = dataSchema.filterNot { field =>
partitionSchema.exists(p => equality(p.name, field.name))
}What we were missing was removing the partition columns from the data schema when we infer the format |
|
@brkyvz Thanks! Does the latest fix apply your suggestion? |
|
Test build #69528 has finished for PR 16030 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still keep this below the if (justPartitioning) area, because otherwise everytime someone performs a df.mode("append").saveAsTable() we will perform schema inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is not necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the code here like I mentioned above, but keep your changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I reverted this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this else if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, removed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this test is broken only when the vectorizedReader is on, I would also keep the conf here as enabled, just in case anyone changes the default implementation later.
so I would also wrap this as:
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {
withTempPath { dir =>
...
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
@maropu I would still keep the changes I proposed below L180 like I commented before. We don't need to use the inferred data type as the partition type |
|
@cloud-fan Does the latest fix satisfy what you suggested? |
|
@liancheng As for |
| StructType(dataSchema ++ partitionSchema.filterNot { column => | ||
| dataSchemaColumnNames.contains(column.name.toLowerCase) | ||
| val equality = sparkSession.sessionState.conf.resolver | ||
| val overriddenDataSchema = dataSchema.map { dataField => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
val getColName: (StructField => String) = if (conf.caseSensitive) _.name else _.name.toLowerCase
val overlappedPartCols = mutable.Map.empty[String, StructField]
for {
dataField <- dataSchema
partitionField <- partitionSchema
if getColName(dataField) == getColName(partitionField)
} overlappedPartCols += getColName(partitionField) -> partitionField
StructType(dataSchema.map(f => overlappedPartCols.getOrElse(getColName(f), f)) ++
partitionSchema.filterNot(f => overlappedPartCols.contains(getColName(f))))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why didn't you use sparkSession.sessionState.conf.resolver? Any reason I missed?
I just wrote this code with the same style with DataSource#getOrInferFileFormatSchema and is this a bad idea? Anyway, since the two patterns have the same output, both is okay to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the current code will iterate dataSchema many times(depend on the number of partition columns), while my proposal only iterate it 2 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit modified, how about this?
|
Test build #70177 has finished for PR 16030 at commit
|
|
Test build #70178 has finished for PR 16030 at commit
|
| if (sparkSession.sessionState.conf.caseSensitiveAnalysis) _.name else _.name.toLowerCase | ||
| val overlappedPartCols = mutable.Map.empty[String, StructField] | ||
| partitionSchema.foreach { partitionField => | ||
| dataSchema.find(getColName(_) == getColName(partitionField)).map { overlappedCol => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit clearer:
if (dataSchema.find(getColName(_) == getColName(partitionField)).isDefined) {
overlappedPartCols += getColName(partitionField) -> partitionField
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and thanks!
|
LGTM, pending jenkins. Can you also update the PR title and description? thanks! |
|
okay |
|
Test build #70186 has finished for PR 16030 at commit
|
|
@cloud-fan okay, I updated the desc. |
|
Test build #70190 has finished for PR 16030 at commit
|
|
Can you also update the title? And the description has a mistake: the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. |
|
oh, I wrongly wrote in an opposite way..., okay, fixed cc: @cloud-fan |
|
thanks, merging to master/2.1! |
… reader fail to read data
## What changes were proposed in this pull request?
A vectorized parquet reader fails to read column data if data schema and partition schema overlap with each other and inferred types in the partition schema differ from ones in the data schema. An example code to reproduce this bug is as follows;
```
scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> spark.createDataFrame(as).write.parquet("/data/a=1/")
scala> val df = spark.read.parquet("/data/")
scala> df.printSchema
root
|-- a: long (nullable = true)
|-- b: integer (nullable = true)
scala> df.collect
java.lang.NullPointerException
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283)
at org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
```
The root cause is that a logical layer (`HadoopFsRelation`) and a physical layer (`VectorizedParquetRecordReader`) have a different assumption on partition schema; the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. To fix this bug, this pr simply updates `HadoopFsRelation.schema` to respect the partition columns position in data schema and respect the partition columns type in partition schema.
## How was this patch tested?
Add tests in `ParquetPartitionDiscoverySuite`
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes #16030 from maropu/SPARK-18108.
(cherry picked from commit dc2a4d4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
Thanks! I found another wired behaviour related to this issue. Is this expected? or, should we fix? |
|
This was the behavior change your PR proposed before, I think it makes sense, you can send a PR to fix it in Spark 2.2 |
|
okay! I'll make a JIRA later and thanks! |
… reader fail to read data
## What changes were proposed in this pull request?
A vectorized parquet reader fails to read column data if data schema and partition schema overlap with each other and inferred types in the partition schema differ from ones in the data schema. An example code to reproduce this bug is as follows;
```
scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> spark.createDataFrame(as).write.parquet("/data/a=1/")
scala> val df = spark.read.parquet("/data/")
scala> df.printSchema
root
|-- a: long (nullable = true)
|-- b: integer (nullable = true)
scala> df.collect
java.lang.NullPointerException
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283)
at org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
```
The root cause is that a logical layer (`HadoopFsRelation`) and a physical layer (`VectorizedParquetRecordReader`) have a different assumption on partition schema; the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. To fix this bug, this pr simply updates `HadoopFsRelation.schema` to respect the partition columns position in data schema and respect the partition columns type in partition schema.
## How was this patch tested?
Add tests in `ParquetPartitionDiscoverySuite`
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes apache#16030 from maropu/SPARK-18108.
What changes were proposed in this pull request?
A vectorized parquet reader fails to read column data if data schema and partition schema overlap with each other and inferred types in the partition schema differ from ones in the data schema. An example code to reproduce this bug is as follows;
The root cause is that a logical layer (
HadoopFsRelation) and a physical layer (VectorizedParquetRecordReader) have a different assumption on partition schema; the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. To fix this bug, this pr simply updatesHadoopFsRelation.schemato respect the partition columns position in data schema and respect the partition columns type in partition schema.How was this patch tested?
Add tests in
ParquetPartitionDiscoverySuite