[SPARK-26744][SQL]Support schema validation in FileDataSourceV2 framework #23714

gengliangwang · 2019-01-31T17:19:33Z

What changes were proposed in this pull request?

The file source has a schema validation feature, which validates 2 schemas:

the user-specified schema when reading.
the schema of input data when writing.

If a file source doesn't support the schema, we can fail the query earlier.

This PR is to implement the same feature in the FileDataSourceV2 framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is added in two places:

Read path: the table schema is determined in TableProvider.getTable. The actual read schema can be a subset of the table schema. This PR proposes to validate the actual read schema in FileScan.
Write path: validate the actual output schema in FileWriteBuilder.

How was this patch tested?

Unit test

gengliangwang · 2019-01-31T17:21:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -490,9 +490,6 @@ case class DataSource(
      outputColumnNames: Seq[String],
      physicalPlan: SparkPlan): BaseRelation = {
    val outputColumns = DataWritingCommand.logicalPlanOutputWithNames(data, outputColumnNames)
-    if (outputColumns.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {


@gatorsmile I think the validation here is duplicated with the check supportDataType. So I remove it. I can revert it if someone has a good reason for keeping it.

is this method only called by file sources?

gengliangwang · 2019-01-31T17:22:36Z

@cloud-fan @dongjoon-hyun @HyukjinKwon

dongjoon-hyun · 2019-01-31T18:22:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

+   * Returns whether this format supports the given [[DataType]] in write path.
+   * By default all data types are supported.
+   */
+  def supportDataType(dataType: DataType): Boolean = true


Hi, @gengliangwang and @cloud-fan .
In DSv2, I guess it would be more natural to have a Java interface for this validation API. How do you think about that?

cc @rdblue since this is DSv2.

per discussion, this is an implementation in file source, not an API. It's internal so we don't need java here.

rdblue · 2019-01-31T18:40:21Z

@gengliangwang, can you be more clear about what you are proposing to add to DSv2?

I don't think that simply porting an API from v1 isn't sufficient justification to add to v2, because v1 has so many problems. I'd like to see a description on the JIRA issue that states exactly what is added and how it changes behavior.

Until then, please consider this a -1.

SparkQA · 2019-01-31T21:57:03Z

Test build #101965 has finished for PR 23714 at commit 93ecb68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-02-01T04:58:19Z

@rdblue @dongjoon-hyun Sorry about the confuse. This PR is for an internal API in file source V2 only (abstract class FileWriteBuilder and FileScan).

I have updated the PR description.

AsmaZgo · 2019-02-08T11:31:23Z

hello,
I'm sorry my question is not directly related to the context.
In my use case, I need to access to all the logical plans generated by the optimizer (not just the optimal one). how can I do that with spark SQL?
thank you very much

rdblue · 2019-02-08T17:11:03Z

@gengliangwang, why are you proposing to add this API that applies only to internal sources? Why not design this to work with all sources?

I think you also need to be more clear about what you're trying to commit. What does this do? It sounds like it probably validates that a file format can stores a type. For example, can ORC support DECIMAL(44, 6)? That is generally useful. Why should it be a side API for internal sources?

In short:

Please be clear in your description about what this commit does. What exactly does the validation do?
Please give a reason why it should apply only to internal sources given that a goal of the DSv2 API is to avoid special cases for internal sources.

gengliangwang · 2019-02-11T04:52:17Z

@AsmaZgo I think you can see how the plan changes in log by setting spark.sql.optimizer.planChangeLog.level. For further questions, please send email to dev@spark.apache.org

gengliangwang · 2019-02-11T05:33:40Z

@rdblue Thanks for the suggestion. Overall this is a nice-to-have feature. It is simple to validate the schema without the API. It seems overkill to make it a DS V2 API.

cloud-fan · 2019-02-11T05:52:19Z

@rdblue @gengliangwang I don't think this needs an API change. This is just a schema validation feature, which can be done in any DS v2 sources. Schema validation needs to know the user-specified schema when reading, or the schema of input data when writing, which are both available in the current ds v2 APIs. It looks to me that, this PR just re-implements the schema validation feature in file source v2 framework.

gengliangwang · 2019-02-11T08:44:08Z

retest this please.

SparkQA · 2019-02-11T13:15:21Z

Test build #102188 has finished for PR 23714 at commit 93ecb68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-02-12T17:26:07Z

@cloud-fan, should this validation be done for all sources, or just for file sources for some reason?

I would also like to know exactly what validation is proposed in this PR. It hasn't been written up and I think that a summary of the changes that are proposed is required before we commit changes.

cloud-fan · 2019-02-13T04:11:57Z

every source can do schema validation if needed, the DS v2 API already allows you to do so. To save duplicated code, this PR proposes to add the entry point of schema validation in the base class of file source.

@gengliangwang Can you post the details of the schema validation for these file sources?

cloud-fan · 2019-02-13T04:14:36Z

@rdblue you can treat this PR as implementing schema validation for file sources. We only do it for file sources because for now they are the only builtin DS v2 implementation in Spark.

gengliangwang · 2019-02-13T05:54:13Z

The supported data types of file sources:

text: String
json: AtomicType/StructType/ArrayType/MapType/UDT/NULLType
CSV: AtomicType/UserDefinedType
ORC: AtomicType/StructType/ArrayType/MapType/udt
...

For details please read #21667

This is very simple abstraction.

rdblue · 2019-02-14T00:38:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala

+    OrcDataSourceV2.supportDataType(dataType)
+  }
+
+  override def toString: String = "ORC"


There is much more useful information in this class than just the file format name. This should use a formatName method instead so that toString can be used to show the object itself when debugging or logging.

rdblue · 2019-02-14T00:39:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriteBuilder.scala

+   * Returns whether this format supports the given [[DataType]] in write path.
+   * By default all data types are supported.
+   */
+  def supportDataType(dataType: DataType): Boolean = true


This name is awkward. Similar methods or traits use "supports" instead of "support". I think this should as well.

I proposed to change the name as "supports..." in previous PR: #23639

See the opposed comment here: #23639 (review)

This one is different. We are in a new class, and there is no supportXXX method in this class that we need to follow.

rdblue · 2019-02-14T00:39:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriteBuilder.scala

+    schema.foreach { field =>
+      if (!supportDataType(field.dataType)) {
+        throw new AnalysisException(
+          s"$this data source does not support ${field.dataType.catalogString} data type.")


Should use formatName instead of $this (toString)

rdblue · 2019-02-14T00:40:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala


 abstract class FileScan(
    sparkSession: SparkSession,
-    fileIndex: PartitioningAwareFileIndex) extends Scan with Batch {
+    fileIndex: PartitioningAwareFileIndex,
+    readSchema: StructType) extends Scan with Batch {


@gengliangwang, why validate the read schema here in FileScan instead of in the scan builder?

In the PR description:

the table schema is determined in TableProvider.getTable. The actual read schema can be a subset of the table schema. This PR proposes to validate the actual read schema in FileScan

rdblue · 2019-02-14T01:21:38Z

@cloud-fan, thanks for the clarifications, particularly the updated description.

I don't think we need to add type validation to v2 yet. This is something that could be done in that API, but I'm not sure that it is a good idea to standardize it because that would make assumptions about why types are not supported. For example, using a capability-based API for types like int or struct sounds reasonable, but doesn't work for a delimited format that can support some nesting, but not arbitrarily deep nesting.

@gengliangwang, I flagged a couple of review items to address. In addition, I would recommend taking more care when answering questions. It is concerning that I wasn't able to get a concise answer from you about what you're proposing to change. Of course I can go down a rabbit-hole of trying to find out what your intent is by reading code and other pull requests. But it is much easier for everyone if you clearly state what you're proposing and why.

gengliangwang · 2019-02-14T04:57:15Z

@rdblue For example, when Spark try writing data contains Array type column to CSV source:

without the validation: there will be exceptions on execution tasks.
with the validation: there will be exception before launching tasks. Also, the error message is more user-friendly.

cloud-fan · 2019-02-14T11:19:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -527,10 +524,6 @@ case class DataSource(
   * Returns a logical plan to write the given [[LogicalPlan]] out to this [[DataSource]].
   */
  def planForWriting(mode: SaveMode, data: LogicalPlan): LogicalPlan = {
-    if (data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {


SparkQA · 2019-02-14T14:24:09Z

Test build #102348 has finished for PR 23714 at commit d8240b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-14T16:00:39Z

Test build #102350 has finished for PR 23714 at commit 5f3ec83.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-14T16:10:22Z

Test build #102349 has finished for PR 23714 at commit 5b7b258.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-02-14T16:34:27Z

retest this please.

rdblue · 2019-02-14T16:54:06Z

+1

SparkQA · 2019-02-14T21:02:31Z

Test build #102357 has finished for PR 23714 at commit 5f3ec83.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-15T02:59:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -530,7 +527,6 @@ case class DataSource(
    if (data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
      throw new AnalysisException("Cannot save interval data type into external storage.")
    }
-


unnecessary change

SparkQA · 2019-02-15T07:23:45Z

Test build #102376 has finished for PR 23714 at commit ee60027.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-16T09:11:49Z

thanks, merging to master!

…ework ## What changes were proposed in this pull request? The file source has a schema validation feature, which validates 2 schemas: 1. the user-specified schema when reading. 2. the schema of input data when writing. If a file source doesn't support the schema, we can fail the query earlier. This PR is to implement the same feature in the `FileDataSourceV2` framework. Comparing to `FileFormat`, `FileDataSourceV2` has multiple layers. The API is added in two places: 1. Read path: the table schema is determined in `TableProvider.getTable`. The actual read schema can be a subset of the table schema. This PR proposes to validate the actual read schema in `FileScan`. 2. Write path: validate the actual output schema in `FileWriteBuilder`. ## How was this patch tested? Unit test Closes apache#23714 from gengliangwang/schemaValidationV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

gengliangwang commented Jan 31, 2019

View reviewed changes

dongjoon-hyun reviewed Jan 31, 2019

View reviewed changes

rdblue reviewed Feb 14, 2019

View reviewed changes

cloud-fan reviewed Feb 14, 2019

View reviewed changes

gengliangwang added 2 commits February 14, 2019 20:06

validate schema in data source v2

d83b01e

address comments

d8240b3

gengliangwang force-pushed the schemaValidationV2 branch from 93ecb68 to d8240b3 Compare February 14, 2019 13:57

gengliangwang added 2 commits February 14, 2019 22:08

fix avrosuite

5b7b258

fix

5f3ec83

cloud-fan reviewed Feb 15, 2019

View reviewed changes

remove unnecessary change

ee60027

cloud-fan closed this in 4dce45a Feb 16, 2019

gengliangwang mentioned this pull request Mar 26, 2019

[SPARK-27269][SQL] File source v2 should validate data schema only #24203

Closed

[SPARK-26744][SQL]Support schema validation in FileDataSourceV2 framework #23714

[SPARK-26744][SQL]Support schema validation in FileDataSourceV2 framework #23714

Conversation

gengliangwang commented Jan 31, 2019 • edited by cloud-fan Loading

What changes were proposed in this pull request?

How was this patch tested?

gengliangwang Jan 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Jan 31, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Jan 31, 2019

SparkQA commented Jan 31, 2019

gengliangwang commented Feb 1, 2019

AsmaZgo commented Feb 8, 2019

rdblue commented Feb 8, 2019

gengliangwang commented Feb 11, 2019

gengliangwang commented Feb 11, 2019 • edited Loading

cloud-fan commented Feb 11, 2019 • edited Loading

gengliangwang commented Feb 11, 2019

SparkQA commented Feb 11, 2019

rdblue commented Feb 12, 2019

cloud-fan commented Feb 13, 2019

cloud-fan commented Feb 13, 2019 • edited Loading

gengliangwang commented Feb 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Feb 14, 2019

gengliangwang commented Feb 14, 2019

Choose a reason for hiding this comment

SparkQA commented Feb 14, 2019

SparkQA commented Feb 14, 2019

SparkQA commented Feb 14, 2019

gengliangwang commented Feb 14, 2019

rdblue commented Feb 14, 2019

SparkQA commented Feb 14, 2019

Choose a reason for hiding this comment

SparkQA commented Feb 15, 2019

cloud-fan commented Feb 16, 2019

gengliangwang commented Jan 31, 2019 •

edited by cloud-fan

Loading

gengliangwang Jan 31, 2019 •

edited

Loading

gengliangwang commented Feb 11, 2019 •

edited

Loading

cloud-fan commented Feb 11, 2019 •

edited

Loading

cloud-fan commented Feb 13, 2019 •

edited

Loading