-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31060][SQL] Handle column names containing dots in data source Filter
#27817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -652,10 +652,11 @@ object DataSourceStrategy { | |
| */ | ||
| object PushableColumn { | ||
| def unapply(e: Expression): Option[String] = { | ||
| import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ | ||
| def helper(e: Expression) = e match { | ||
| case a: Attribute => Some(a.name) | ||
| case _ => None | ||
| } | ||
| helper(e) | ||
| helper(e).map(quoteIfNeeded) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK this is a behavior change.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each element in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah sorry I misread it with the v2 reference. So our plan is to use dot to separate nested fields in v1 reference. Is this a 3.1 feature or we plan to make it into 3.0?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to make it into 3.0 as this lays a good foundation to support features such as pushdown on column containing dots and nested predicate pushdown. Also, I feel it’s a good time to have a small breaking change in 3.0 instead of 3.1. WDYT?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this really needed to support column name containing dot? And a few thoughts to roll it out more smoothly:
If users use a v1 source with Spark 3.0, and Spark pushes down a filter with nested fields, then it would either fail or ignore it as it doesn't recognize the json array encoded column name. However, if we use dot to separate field names, then v1 source can be broken if a filter is pushed down pointing to a top-level column containing dot.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We support column names containing I feel special encoding makes the implementation really over complicated. I believe that currently, only ORC support column name containing
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dbtsai, I thought Parquet also supports the dots in column names at #27780. JSON array might not be an option either because column names themselves can be JSON array: scala> sql("""select 'a' as `["a", "'b"]`""")
res1: org.apache.spark.sql.DataFrame = [["a", "'b"]: string]It's unlikely but it's possible breaking change. The only option I can think of now is to have a SQL conf that lists datasources to support to backquote dots in column name (and don't backquote for nested column access).
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HyukjinKwon You are right. I'm just breaking #27780 into smaller PR so it's easier to review. Chat with @cloud-fan offline, and I'll close this one and work on the other bigger one to avoid the confusion. Let's discuss how we want to handle the API compatibility in the other PR.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you @dbtsai! |
||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,6 +34,7 @@ import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._ | |
|
|
||
| import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils} | ||
| import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate | ||
| import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quoteIfNeeded | ||
| import org.apache.spark.sql.sources | ||
| import org.apache.spark.unsafe.types.UTF8String | ||
|
|
||
|
|
@@ -55,7 +56,7 @@ class ParquetFilters( | |
| // and it does not support to create filters for them. | ||
| val primitiveFields = | ||
| schema.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f => | ||
| f.getName -> ParquetField(f.getName, | ||
| quoteIfNeeded(f.getName) -> ParquetField(f.getName, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since Parquet is case-sensitive, could you check if we have a test case for this?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is code handling this, and I will check if there is a test. |
||
| ParquetSchemaType(f.getOriginalType, | ||
| f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata)) | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an assumption that these column names match with the original source names already?
Suddenly, I'm wondering if this is safe for all data sources like JDBC? Some DBMS like PostgreSQL is case-sensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet has some logics to handle what you described. I was wondering if it's safe for ORC as well.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I was confused about the reviewing order. This will be a breaking change against
StableAPI - I noted in #27780 (comment) too. There are non-DBMS sources too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a behavior change or we just clarify it in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is a
dotin the column name, the existingFilterwill not quote it. As a result, to extend it to support nested column, we need to quote thedotin name to distinguish from thedotas a separator of nested column.The behavior change is only for the column name that contains
dot.