-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31060][SQL] Handle column names containing dots in data source Filter
#27817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This depends on #27814, and I'll rebase after the other PR is merged. |
dots in data source Filterdots in data source Filter
d8d8d75 to
83c2656
Compare
|
Sorry I'm a bit confused. @dbtsai can you give a list of your PRs that people should review in order? |
|
Test build #119429 has finished for PR 27817 at commit
|
|
@cloud-fan Sorry for the confusion. I am thinking to break the PR into multiple smaller PRs so reviewers can take a look easier. My plan is the following in order. I'll rebase accordingly once the PRs are merged. SPARK-31060 #27817 Thanks, |
|
Test build #119433 has finished for PR 27817 at commit
|
|
Test build #119426 has finished for PR 27817 at commit
|
dots in data source Filterdots in data source Filter
|
Retest this please. |
| sealed abstract class Filter { | ||
| /** | ||
| * List of columns that are referenced by this filter. | ||
| * Note that, if a column contains `dots` in name, it will be quoted to avoid confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an assumption that these column names match with the original source names already?
Suddenly, I'm wondering if this is safe for all data sources like JDBC? Some DBMS like PostgreSQL is case-sensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet has some logics to handle what you described. I was wondering if it's safe for ORC as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I was confused about the reviewing order. This will be a breaking change against Stable API - I noted in #27780 (comment) too. There are non-DBMS sources too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a behavior change or we just clarify it in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is a dot in the column name, the existing Filter will not quote it. As a result, to extend it to support nested column, we need to quote the dot in name to distinguish from the dot as a separator of nested column.
The behavior change is only for the column name that contains dot.
| val primitiveFields = | ||
| schema.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f => | ||
| f.getName -> ParquetField(f.getName, | ||
| quoteIfNeeded(f.getName) -> ParquetField(f.getName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Parquet is case-sensitive, could you check if we have a test case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is code handling this, and I will check if there is a test.
|
Test build #119488 has finished for PR 27817 at commit
|
|
BTW, the failure is relevant to this PR. |
|
Test build #119493 has finished for PR 27817 at commit
|
|
@dongjoon-hyun I fixed the test. I'll think about if there is a better way to handle this. |
|
Test build #119507 has finished for PR 27817 at commit
|
|
Retest this please. |
|
Test build #119538 has finished for PR 27817 at commit
|
| case _ => None | ||
| } | ||
| helper(e) | ||
| helper(e).map(quoteIfNeeded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is a behavior change.
Filter#references is an Array[String] so there is no confusion if we won't quote. What's our plan to support nested fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each element in Filter#references is a column that this Filter refers to. There can be multiple of them. As a result, quoting on column name that contains dot is still needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I misread it with the v2 reference.
So our plan is to use dot to separate nested fields in v1 reference. Is this a 3.1 feature or we plan to make it into 3.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to make it into 3.0 as this lays a good foundation to support features such as pushdown on column containing dots and nested predicate pushdown. Also, I feel it’s a good time to have a small breaking change in 3.0 instead of 3.1. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed to support column name containing dot?
And a few thoughts to roll it out more smoothly:
- if we refer to a top-level column, don't quote it (so no breaking change)
- if we refer to a nested field, use a special encoding (like json array?)
If users use a v1 source with Spark 3.0, and Spark pushes down a filter with nested fields, then it would either fail or ignore it as it doesn't recognize the json array encoded column name.
However, if we use dot to separate field names, then v1 source can be broken if a filter is pushed down pointing to a top-level column containing dot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support column names containing dots in ORC.
I feel special encoding makes the implementation really over complicated. I believe that currently, only ORC support column name containing dots, so the impact is not much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbtsai, I thought Parquet also supports the dots in column names at #27780.
Also, the problem is that there are multiple external sources. Dots can be used to express namespace-like annotations just like our SQL configurations.
JSON array might not be an option either because column names themselves can be JSON array:
scala> sql("""select 'a' as `["a", "'b"]`""")
res1: org.apache.spark.sql.DataFrame = [["a", "'b"]: string]It's unlikely but it's possible breaking change.
The only option I can think of now is to have a SQL conf that lists datasources to support to backquote dots in column name (and don't backquote for nested column access).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon You are right. I'm just breaking #27780 into smaller PR so it's easier to review. Chat with @cloud-fan offline, and I'll close this one and work on the other bigger one to avoid the confusion. Let's discuss how we want to handle the API compatibility in the other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dbtsai!
What changes were proposed in this pull request?
In data source
Filter, currently, if a column name containsdots, it is not quoted. This causes couple issues.Hard to extend the
Filterto support nested column predicate pushdown as many data sources such as Parquet and ORC are usingdotsas separators for nested columns. This can be addressed if we quote the name containingdotsproperly in this PR.Because of the above issues, we are handling the quoting in data source implementations before we convert the predicates into specific implementation for a particular data source. We should handle them in data source filter to make it consistently.
Why are the changes needed?
To handle column names containing
dotsmore consistently.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UTs and new UTs.