-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18539][SQL]: tolerate pushed-down filter on non-existing parquet columns #16156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@gatorsmile @liancheng Thanks! |
| // PARQUET-389 will resolve this issue in Parquet 1.9, which may be used | ||
| // by future Spark versions. This is a workaround for current Spark version. | ||
| // Also the assumption here is that the predicates will be applied later | ||
| blocks = footer.getBlocks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a TODO item int the comment so that we won't forget to remove this after upgrading to 1.9? Thanks!
// TODO Remove this hack after upgrading to parquet-mr 1.9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Will do. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about creating the JIRA issue first and writing the JIRA number here together in comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We may just mention SPARK-18140 here. It's for upgrading parquet-mr to 1.9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| } | ||
|
|
||
| test("SPARK-18539 - filtered by non-existing parquet column") { | ||
| Seq("parquet").foreach { format => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the parquet-only test case, we can just set format to parquet directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks!
| checkAnswer(readDf2.filter("c is null and a < 2"), | ||
| Row(1, "abc", null) :: Nil) | ||
|
|
||
| readDf2.filter("c is null and a < 2").explain(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. yeah. Forget to remove it. Will do. Thanks!
|
|
||
| // Right now, there are 2 parquet files. one with the column "c", | ||
| // the other does not have it. | ||
| val readDf2 = spark.read.schema(schema).format(format).load(path.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, we don't enable Parquet schema merging. Could you please reorganize this test case into something like this to have both code paths tested?
Seq(true, false).foreach { merge =>
test("SPARK-18539 - filtered by non-existing parquet column - schema merging enabled: $merge") {
// ...
val readDf2 = spark.read
.option("mergeSchema", merge.toString)
.schema(schema)
.format(format)
.load(path.toString)
// ...
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we probably also want to test both the normal Parquet reader and the vectorized Parquet reader using the similar trick by setting SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the code change is in SpecificParquetRecordReaderBase.initialize, which is a common code path for both parquet readers, I did not include it originally. But it is safe to test both. I will add it . Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, what I concern is that we are using the vectorized Parquet reader by default now and may not have enough test coverage of the normal Parquet reader. Would be good to have test cases to prevent regression for the normal Parquet reader code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test case. Thanks!
|
Actually, PR #9940 should have already fixed this issue. I'm checking why it doesn't work under 2.0.1 and 2.0.2. |
|
BTW, I think this PR is a cleaner fix than #9940, which introduces a temporary metadata while merging two |
| // the column(s) in the filter, we don't do filtering at this level | ||
| // PARQUET-389 will resolve this issue in Parquet 1.9, which may be used | ||
| // by future Spark versions. This is a workaround for current Spark version. | ||
| // Also the assumption here is that the predicates will be applied later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? PushedFilter seems to be eaten silently here.
// Also the assumption here is that the predicates will be applied later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No matter filters are pushed down to Parquet reader or not, Spark will always apply all the filters again at a higher level to ensure that all filters are applied as expected. Spark treats data source filter push-down in a "best effort" manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Thank you, @liancheng .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filters pushed down to parquet reader and fail Parquet's schema validation will be ignored at this level. But after SparkSQL's parquet reader is created with all the blocks for this particular parquet file, the parent physical plan operation Filter will re-apply the filters anyway.
Example:
== Physical Plan ==
*Project [a#2805, b#2806, c#2807]
+- *Filter ((isnotnull(a#2805) && isnull(c#2807)) && (a#2805 < 2))
+- *FileScan parquet [a#2805,b#2806,c#2807] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/Users/xinwu/spark/target/tmp/spark-ed6f0c12-6494-4ac5-b485-5b986ef475cc], PartitionFilters: [], PushedFilters: [IsNotNull(a), IsNull(c), LessThan(a,2)], ReadSchema: struct<a:int,b:string,c:int>
This is why my test case still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @xwu0226 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liancheng I just wonder if it is safe to rely on exception handling where it might be a common case for reading multiple Parquet files with a merged schema. Up to my understanding, we here know the schema of the target parquet file from its footer and the referenced column from the filter. Would we maybe try to check this and simply use if-else condition?
|
@xwu0226 Just verified that this issue also affects the normal Parquet reader (by setting import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.DataTypes._
import org.apache.spark.sql.types.{StructField, StructType}
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
val jsonRDD = spark.sparkContext.parallelize(Seq("""{"a":1}"""))
spark.read
.schema(StructType(Seq(StructField("a", IntegerType))))
.json(jsonRDD)
.write
.mode("overwrite")
.parquet("/tmp/test")
spark.read
.schema(StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType, nullable = true))))
.load("/tmp/test")
.createOrReplaceTempView("table")
spark.sql("select b from table where b is not null").show() |
|
@liancheng I see. In normal parquet reader, ParquetFileFormat is using hadoop's |
|
For normal parquet reader case, we have the following code } else {
logDebug(s"Falling back to parquet-mr")
// ParquetRecordReader returns UnsafeRow
val reader = pushed match {
case Some(filter) =>
new ParquetRecordReader[UnsafeRow](
new ParquetReadSupport,
FilterCompat.get(filter, null))
case _ =>
new ParquetRecordReader[UnsafeRow](new ParquetReadSupport)
}
reader.initialize(split, hadoopAttemptContext)
reader
}I am wondering we could try-catch the |
|
Hey @xwu0226 and @gatorsmile, did some investigation, and I don't think this is a bug now. Please refer to my JIRA comment for more details. |
|
Would there be another way to avoid try-catch? I think it is a normal reading path logic and it seems it might not be safe to rely on exception handling. |
|
@liancheng Ah, thank you. I should have tested this first. |
|
Test build #69688 has finished for PR 16156 at commit
|
|
Test build #72253 has finished for PR 16156 at commit
|
|
https://issues.apache.org/jira/browse/SPARK-19409 is resolved to upgrade to parquet-1.8.2 that fixes this issue. |
What changes were proposed in this pull request?
When
spark.sql.parquet.filterPushdownis on, SparkSQL's Parquet reader pushes down filters to Parquet file when creating reader, in order to start with filtered blocks. However, when the parquet file does not have the predicate column(s), Parquet-mr throw exceptions complaining the filter column does not existing. This issue will be fixed in parquet-mr 1.9. But Spark 2.1 is still on parquet 1.8.This PR is to tolerate such exception thrown by Parquet-mr and just return all the blocks from the current parquet file to the created SparkSQL parquet reader. Filters will be applied again anyway in later physical plan operation. According to following example physical plan:
How was this patch tested?
A unit test case is added.