[SPARK-31133][SQL][DOC] fix sql ref doc for DML #27891

cloud-fan · 2020-03-12T14:02:44Z

What changes were proposed in this pull request?

INSERT OVERWRITE DIRECTORY can only use file format (class implements org.apache.spark.sql.execution.datasources.FileFormat). This PR fixes it and other minor improvement.

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan · 2020-03-12T14:03:28Z

docs/sql-ref-syntax-dml-load.md


 ### Description
-`LOAD DATA` statement loads the data into a table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the partition of the target table.
+`LOAD DATA` statement loads the data into a Hive serde table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the partition of the target table.


``LOAD DATA` only works for hive table

good catch!

cloud-fan · 2020-03-12T14:04:54Z

docs/sql-ref-syntax-dml-load.md

     + -------------- + ------------------------------ + -------------- +

- CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT);
+ CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT) USING HIVE;


This must be a hive table

Hi, @cloud-fan . This reminds me the following.

SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

@marmbrus and @gatorsmile . Do we need to revert SPARK-30098 due to its silent behavior change?

Also, cc @rxin since he is the release manager for 3.0.0.

I filed SPARK-31136 to track the discussion and the final result.

The cost of break is mostly users can't run LOAD TABLE?

Please don't get me wrong. You know that I loved that patch and tried to minimize the impact while embracing it. However, the new policy is designed to ban this kind of behavior change (SPARK-30098). Technically,

The benefit is just saving a few word USING PARQUET or something.

The downside is breaking the existing user pipelines.

I updated the JIRA description (SPARK-31136) with @cloud-fan 's example.

The benefit is just saving a few word USING PARQUET or something.

This is a completely underestimate. Please see my comment in #27894 (comment)

cloud-fan · 2020-03-12T14:05:19Z

docs/sql-ref-syntax-dml-load.md


 -- Example with partition specification.
- CREATE TABLE test_partition (c1 INT, c2 INT, c3 INT) USING HIVE PARTITIONED BY (c2, c3);
+ CREATE TABLE test_partition (c1 INT, c2 INT, c3 INT) PARTITIONED BY (c2, c3);


This is used as the source so doesn't need to be a hive table.

cloud-fan · 2020-03-12T14:05:41Z

cc @dongjoon-hyun @maropu @viirya

SparkQA · 2020-03-12T14:23:26Z

Test build #119717 has finished for PR 27891 at commit 4be92bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-03-12T14:40:57Z

docs/sql-ref-syntax-dml-insert-overwrite-directory.md

 ---
 ### Description
-The `INSERT OVERWRITE DIRECTORY` statement overwrites the existing data in the directory with the new values using Spark native format. The inserted rows can be specified by value expressions or result from a query.
+The `INSERT OVERWRITE DIRECTORY` statement overwrites the existing data in the directory with the new values using Spark file format. The inserted rows can be specified by value expressions or result from a query.


nit: Spark file format -> a given Spark file format?

maropu

LGTM except for one minor comment.

dongjoon-hyun

Shall we wait for the decision for SPARK-30098?

dongjoon-hyun · 2020-03-12T16:49:42Z

docs/sql-ref-syntax-dml-load.md

     + -------------- + ------------------------------ + -------------- +

- CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT);
+ CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT) USING HIVE;


Hi, @cloud-fan . This reminds me the following.

SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

@marmbrus and @gatorsmile . Do we need to revert SPARK-30098 due to its silent behavior change?

Also, cc @rxin since he is the release manager for 3.0.0.

viirya · 2020-03-13T04:04:36Z

docs/sql-ref-syntax-dml-insert-overwrite-directory.md

  <dt><code><em>file_format</em></code></dt>
  <dd>
-  Specifies the file format to use for the insert. Valid options are <code>TEXT</code>, <code>CSV</code>, <code>JSON</code>, <code>JDBC</code>, <code>PARQUET</code>, <code>ORC</code>, <code>HIVE</code>, <code>DELTA</code>, <code>LIBSVM</code>, or a fully qualified class name of a custom implementation of <code>org.apache.spark.sql.sources.DataSourceRegister</code>.
+  Specifies the file format to use for the insert. Valid options are <code>TEXT</code>, <code>CSV</code>, <code>JSON</code>, <code>JDBC</code>, <code>PARQUET</code>, <code>ORC</code>, <code>HIVE</code>, <code>DELTA</code>, <code>LIBSVM</code>, or a fully qualified class name of a custom implementation of <code>org.apache.spark.sql.execution.datasources.FileFormat</code>.


Isn't AVRO also valid? Btw, does DELTA implement FileFormat? If it means DeltaDataSource, looks like it only implements DataSourceRegister.

Avro implements FileFormat. I don't know if DELTA is supported in INSERT OVERWRITE [ LOCAL ] DIRECTORY, let me remove it as it's an external source.

viirya · 2020-03-13T04:06:10Z

docs/sql-ref-syntax-dml-load.md


 ### Description
-`LOAD DATA` statement loads the data into a table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the partition of the target table.
+`LOAD DATA` statement loads the data into a Hive serde table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the partition of the target table.


good catch!

dongjoon-hyun · 2020-03-13T04:21:24Z

Hi, All. I marked SPARK-31136 as a correctness issue and added the example of CHAR type.

cloud-fan · 2020-03-16T09:23:54Z

shall we merge this PR first? It correctly describes the current behavior. If we want to change the behavior, we should update the document accordingly, instead of blocking this PR and wait for it.

SparkQA · 2020-03-16T09:39:42Z

Test build #119857 has finished for PR 27891 at commit f58c2d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-18T13:43:39Z

I'll merge it in a few days if no objection. The CHAR/VARCHAR discussion is still going on in dev list.

cloud-fan · 2020-03-23T14:01:08Z

merging to master/3.0, thanks for review!

### What changes were proposed in this pull request? `INSERT OVERWRITE DIRECTORY` can only use file format (class implements `org.apache.spark.sql.execution.datasources.FileFormat`). This PR fixes it and other minor improvement. ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #27891 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? `INSERT OVERWRITE DIRECTORY` can only use file format (class implements `org.apache.spark.sql.execution.datasources.FileFormat`). This PR fixes it and other minor improvement. ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes apache#27891 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

fix sql ref doc for DML

4be92bb

cloud-fan commented Mar 12, 2020

View reviewed changes

maropu reviewed Mar 12, 2020

View reviewed changes

maropu approved these changes Mar 12, 2020

View reviewed changes

dongjoon-hyun requested changes Mar 12, 2020

View reviewed changes

viirya reviewed Mar 13, 2020

View reviewed changes

viirya approved these changes Mar 13, 2020

View reviewed changes

address comments

f58c2d3

cloud-fan closed this in d929c0d Mar 23, 2020

[SPARK-31133][SQL][DOC] fix sql ref doc for DML #27891

[SPARK-31133][SQL][DOC] fix sql ref doc for DML #27891

Uh oh!

Conversation

cloud-fan commented Mar 12, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 12, 2020

Uh oh!

SparkQA commented Mar 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Mar 16, 2020

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

cloud-fan commented Mar 18, 2020

Uh oh!

cloud-fan commented Mar 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun Mar 12, 2020 •

edited

Loading

dongjoon-hyun Mar 12, 2020 •

edited

Loading

dongjoon-hyun Mar 12, 2020 •

edited

Loading

dongjoon-hyun Mar 12, 2020 •

edited

Loading

dongjoon-hyun commented Mar 13, 2020 •

edited

Loading