[SPARK-31060][SQL] Handle column names containing `dots` in data source `Filter` #27817

dbtsai · 2020-03-05T23:45:17Z

What changes were proposed in this pull request?

In data source Filter, currently, if a column name contains dots, it is not quoted. This causes couple issues.

Hard to extend the Filter to support nested column predicate pushdown as many data sources such as Parquet and ORC are using dots as separators for nested columns. This can be addressed if we quote the name containing dots properly in this PR.
Because of the above issues, we are handling the quoting in data source implementations before we convert the predicates into specific implementation for a particular data source. We should handle them in data source filter to make it consistently.

Why are the changes needed?

To handle column names containing dots more consistently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs and new UTs.

dbtsai · 2020-03-05T23:45:56Z

This depends on #27814, and I'll rebase after the other PR is merged.

cloud-fan · 2020-03-06T04:43:12Z

Sorry I'm a bit confused. @dbtsai can you give a list of your PRs that people should review in order?

SparkQA · 2020-03-06T04:56:09Z

Test build #119429 has finished for PR 27817 at commit d8d8d75.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2020-03-06T05:13:55Z

@cloud-fan Sorry for the confusion. I am thinking to break the PR into multiple smaller PRs so reviewers can take a look easier.

My plan is the following in order. I'll rebase accordingly once the PRs are merged.

SPARK-31060 #27817
SPARK-31026 #27780
SPARK-17636 #27728

Thanks,

SparkQA · 2020-03-06T06:12:25Z

Test build #119433 has finished for PR 27817 at commit 83c2656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-06T06:19:48Z

Test build #119426 has finished for PR 27817 at commit 0a96333.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-06T22:32:08Z

Retest this please.

dongjoon-hyun · 2020-03-06T22:41:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala

 sealed abstract class Filter {
  /**
   * List of columns that are referenced by this filter.
+   * Note that, if a column contains `dots` in name, it will be quoted to avoid confusion.


Do we have an assumption that these column names match with the original source names already?
Suddenly, I'm wondering if this is safe for all data sources like JDBC? Some DBMS like PostgreSQL is case-sensitive.

Parquet has some logics to handle what you described. I was wondering if it's safe for ORC as well.

Okay, I was confused about the reviewing order. This will be a breaking change against Stable API - I noted in #27780 (comment) too. There are non-DBMS sources too.

Is this a behavior change or we just clarify it in the comment?

When there is a dot in the column name, the existing Filter will not quote it. As a result, to extend it to support nested column, we need to quote the dot in name to distinguish from the dot as a separator of nested column.

The behavior change is only for the column name that contains dot.

dongjoon-hyun · 2020-03-06T22:43:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

    val primitiveFields =
    schema.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
-      f.getName -> ParquetField(f.getName,
+      quoteIfNeeded(f.getName) -> ParquetField(f.getName,


Since Parquet is case-sensitive, could you check if we have a test case for this?

There is code handling this, and I will check if there is a test.

SparkQA · 2020-03-06T23:04:29Z

Test build #119488 has finished for PR 27817 at commit 65e2d24.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-06T23:55:33Z

BTW, the failure is relevant to this PR.

OrcV2QuerySuite.SPARK-25579 ORC PPD should support column names with dot

SparkQA · 2020-03-07T00:37:24Z

Test build #119493 has finished for PR 27817 at commit 65e2d24.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2020-03-07T06:30:55Z

@dongjoon-hyun I fixed the test. I'll think about if there is a better way to handle this.

SparkQA · 2020-03-07T08:05:01Z

Test build #119507 has finished for PR 27817 at commit c5fc25f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-08T18:44:05Z

Retest this please.

SparkQA · 2020-03-08T23:10:27Z

Test build #119538 has finished for PR 27817 at commit c5fc25f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-09T07:35:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

      case _ => None
    }
-    helper(e)
+    helper(e).map(quoteIfNeeded)


OK this is a behavior change.

Filter#references is an Array[String] so there is no confusion if we won't quote. What's our plan to support nested fields?

Each element in Filter#references is a column that this Filter refers to. There can be multiple of them. As a result, quoting on column name that contains dot is still needed.

Ah sorry I misread it with the v2 reference.

So our plan is to use dot to separate nested fields in v1 reference. Is this a 3.1 feature or we plan to make it into 3.0?

I would like to make it into 3.0 as this lays a good foundation to support features such as pushdown on column containing dots and nested predicate pushdown. Also, I feel it’s a good time to have a small breaking change in 3.0 instead of 3.1. WDYT?

Is this really needed to support column name containing dot?

And a few thoughts to roll it out more smoothly:

if we refer to a top-level column, don't quote it (so no breaking change)

if we refer to a nested field, use a special encoding (like json array?)

If users use a v1 source with Spark 3.0, and Spark pushes down a filter with nested fields, then it would either fail or ignore it as it doesn't recognize the json array encoded column name.

However, if we use dot to separate field names, then v1 source can be broken if a filter is pushed down pointing to a top-level column containing dot.

Implementation-wise, I think it's easier to use dot to separate field names, both for Spark and source developers. I don't know how common it is to see top-level column names containing dot.

cc @brkyvz @rdblue @imback82

We support column names containing dots in ORC.

I feel special encoding makes the implementation really over complicated. I believe that currently, only ORC support column name containing dots, so the impact is not much.

@dbtsai, I thought Parquet also supports the dots in column names at #27780.
Also, the problem is that there are multiple external sources. Dots can be used to express namespace-like annotations just like our SQL configurations.

JSON array might not be an option either because column names themselves can be JSON array:

scala> sql("""select 'a' as `["a", "'b"]`""") res1: org.apache.spark.sql.DataFrame = [["a", "'b"]: string]

It's unlikely but it's possible breaking change.

The only option I can think of now is to have a SQL conf that lists datasources to support to backquote dots in column name (and don't backquote for nested column access).

@HyukjinKwon You are right. I'm just breaking #27780 into smaller PR so it's easier to review. Chat with @cloud-fan offline, and I'll close this one and work on the other bigger one to avoid the confusion. Let's discuss how we want to handle the API compatibility in the other PR.

Thank you @dbtsai!

dbtsai requested review from cloud-fan, dongjoon-hyun, gengliangwang and wangyum March 5, 2020 23:45

dbtsai changed the title ~~[SPARK-31060] Handle column names containing dots in data source Filter~~ [SPARK-31060] [SQL] [test-hive1.2] Handle column names containing dots in data source Filter Mar 5, 2020

dbtsai force-pushed the SPARK-31060 branch 2 times, most recently from d8d8d75 to 83c2656 Compare March 6, 2020 00:29

SPARK-31060

65e2d24

dbtsai force-pushed the SPARK-31060 branch from 83c2656 to 65e2d24 Compare March 6, 2020 21:20

dongjoon-hyun added the SQL label Mar 6, 2020

dongjoon-hyun changed the title ~~[SPARK-31060] [SQL] [test-hive1.2] Handle column names containing dots in data source Filter~~ [SPARK-31060][SQL] Handle column names containing dots in data source Filter Mar 6, 2020

dongjoon-hyun reviewed Mar 6, 2020

View reviewed changes

Fixed test

c5fc25f

cloud-fan reviewed Mar 9, 2020

View reviewed changes

dbtsai closed this Mar 10, 2020

[SPARK-31060][SQL] Handle column names containing dots in data source Filter #27817

[SPARK-31060][SQL] Handle column names containing dots in data source Filter #27817

Uh oh!

Conversation

dbtsai commented Mar 5, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dbtsai commented Mar 5, 2020

Uh oh!

cloud-fan commented Mar 6, 2020

Uh oh!

SparkQA commented Mar 6, 2020

Uh oh!

dbtsai commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 6, 2020

Uh oh!

SparkQA commented Mar 6, 2020

Uh oh!

dongjoon-hyun commented Mar 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 6, 2020

Uh oh!

dongjoon-hyun commented Mar 6, 2020

Uh oh!

SparkQA commented Mar 7, 2020

Uh oh!

dbtsai commented Mar 7, 2020

Uh oh!

SparkQA commented Mar 7, 2020

Uh oh!

dongjoon-hyun commented Mar 8, 2020

Uh oh!

SparkQA commented Mar 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

[SPARK-31060][SQL] Handle column names containing `dots` in data source `Filter` #27817

[SPARK-31060][SQL] Handle column names containing `dots` in data source `Filter` #27817

dbtsai commented Mar 6, 2020 •

edited

Loading

HyukjinKwon Mar 9, 2020 •

edited

Loading