Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Mar 5, 2020

What changes were proposed in this pull request?

In data source Filter, currently, if a column name contains dots, it is not quoted. This causes couple issues.

  1. Hard to extend the Filter to support nested column predicate pushdown as many data sources such as Parquet and ORC are using dots as separators for nested columns. This can be addressed if we quote the name containing dots properly in this PR.

  2. Because of the above issues, we are handling the quoting in data source implementations before we convert the predicates into specific implementation for a particular data source. We should handle them in data source filter to make it consistently.

Why are the changes needed?

To handle column names containing dots more consistently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs and new UTs.

@dbtsai
Copy link
Member Author

dbtsai commented Mar 5, 2020

This depends on #27814, and I'll rebase after the other PR is merged.

@dbtsai dbtsai changed the title [SPARK-31060] Handle column names containing dots in data source Filter [SPARK-31060] [SQL] [test-hive1.2] Handle column names containing dots in data source Filter Mar 5, 2020
@dbtsai dbtsai force-pushed the SPARK-31060 branch 2 times, most recently from d8d8d75 to 83c2656 Compare March 6, 2020 00:29
@cloud-fan
Copy link
Contributor

Sorry I'm a bit confused. @dbtsai can you give a list of your PRs that people should review in order?

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119429 has finished for PR 27817 at commit d8d8d75.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member Author

dbtsai commented Mar 6, 2020

@cloud-fan Sorry for the confusion. I am thinking to break the PR into multiple smaller PRs so reviewers can take a look easier.

My plan is the following in order. I'll rebase accordingly once the PRs are merged.

SPARK-31060 #27817
SPARK-31026 #27780
SPARK-17636 #27728

Thanks,

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119433 has finished for PR 27817 at commit 83c2656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119426 has finished for PR 27817 at commit 0a96333.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31060] [SQL] [test-hive1.2] Handle column names containing dots in data source Filter [SPARK-31060][SQL] Handle column names containing dots in data source Filter Mar 6, 2020
@dongjoon-hyun
Copy link
Member

Retest this please.

sealed abstract class Filter {
/**
* List of columns that are referenced by this filter.
* Note that, if a column contains `dots` in name, it will be quoted to avoid confusion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an assumption that these column names match with the original source names already?
Suddenly, I'm wondering if this is safe for all data sources like JDBC? Some DBMS like PostgreSQL is case-sensitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet has some logics to handle what you described. I was wondering if it's safe for ORC as well.

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I was confused about the reviewing order. This will be a breaking change against Stable API - I noted in #27780 (comment) too. There are non-DBMS sources too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a behavior change or we just clarify it in the comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a dot in the column name, the existing Filter will not quote it. As a result, to extend it to support nested column, we need to quote the dot in name to distinguish from the dot as a separator of nested column.

The behavior change is only for the column name that contains dot.

val primitiveFields =
schema.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
f.getName -> ParquetField(f.getName,
quoteIfNeeded(f.getName) -> ParquetField(f.getName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Parquet is case-sensitive, could you check if we have a test case for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is code handling this, and I will check if there is a test.

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119488 has finished for PR 27817 at commit 65e2d24.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

BTW, the failure is relevant to this PR.

OrcV2QuerySuite.SPARK-25579 ORC PPD should support column names with dot

@SparkQA
Copy link

SparkQA commented Mar 7, 2020

Test build #119493 has finished for PR 27817 at commit 65e2d24.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member Author

dbtsai commented Mar 7, 2020

@dongjoon-hyun I fixed the test. I'll think about if there is a better way to handle this.

@SparkQA
Copy link

SparkQA commented Mar 7, 2020

Test build #119507 has finished for PR 27817 at commit c5fc25f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 8, 2020

Test build #119538 has finished for PR 27817 at commit c5fc25f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case _ => None
}
helper(e)
helper(e).map(quoteIfNeeded)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is a behavior change.

Filter#references is an Array[String] so there is no confusion if we won't quote. What's our plan to support nested fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each element in Filter#references is a column that this Filter refers to. There can be multiple of them. As a result, quoting on column name that contains dot is still needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I misread it with the v2 reference.

So our plan is to use dot to separate nested fields in v1 reference. Is this a 3.1 feature or we plan to make it into 3.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to make it into 3.0 as this lays a good foundation to support features such as pushdown on column containing dots and nested predicate pushdown. Also, I feel it’s a good time to have a small breaking change in 3.0 instead of 3.1. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed to support column name containing dot?

And a few thoughts to roll it out more smoothly:

  1. if we refer to a top-level column, don't quote it (so no breaking change)
  2. if we refer to a nested field, use a special encoding (like json array?)

If users use a v1 source with Spark 3.0, and Spark pushes down a filter with nested fields, then it would either fail or ignore it as it doesn't recognize the json array encoded column name.

However, if we use dot to separate field names, then v1 source can be broken if a filter is pushed down pointing to a top-level column containing dot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise, I think it's easier to use dot to separate field names, both for Spark and source developers. I don't know how common it is to see top-level column names containing dot.

cc @brkyvz @rdblue @imback82

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support column names containing dots in ORC.

I feel special encoding makes the implementation really over complicated. I believe that currently, only ORC support column name containing dots, so the impact is not much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai, I thought Parquet also supports the dots in column names at #27780.
Also, the problem is that there are multiple external sources. Dots can be used to express namespace-like annotations just like our SQL configurations.

JSON array might not be an option either because column names themselves can be JSON array:

scala> sql("""select 'a' as `["a", "'b"]`""")
res1: org.apache.spark.sql.DataFrame = [["a", "'b"]: string]

It's unlikely but it's possible breaking change.

The only option I can think of now is to have a SQL conf that lists datasources to support to backquote dots in column name (and don't backquote for nested column access).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon You are right. I'm just breaking #27780 into smaller PR so it's easier to review. Chat with @cloud-fan offline, and I'll close this one and work on the other bigger one to avoid the confusion. Let's discuss how we want to handle the API compatibility in the other PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dbtsai!

@dbtsai dbtsai closed this Mar 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants