[SPARK-11153] [SQL] Disables Parquet filter push-down for string and binary columns #9152
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Due to PARQUET-251,
BINARYcolumns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely:
StringTypeBinaryTypeDecimalType(But Spark SQL doesn't support pushing down filters involving
DecimalTypecolumns for now.)To avoid wrong query results, we should disable filter push-down for columns of
StringTypeandBinaryTypeuntil we upgrade to parquet-mr 1.8.