-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31133][SQL][DOC] fix sql ref doc for DML #27891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| ### Description | ||
| `LOAD DATA` statement loads the data into a table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the partition of the target table. | ||
| `LOAD DATA` statement loads the data into a Hive serde table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the partition of the target table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``LOAD DATA` only works for hive table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
| + -------------- + ------------------------------ + -------------- + | ||
|
|
||
| CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT); | ||
| CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT) USING HIVE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be a hive table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan . This reminds me the following.
SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
@marmbrus and @gatorsmile . Do we need to revert SPARK-30098 due to its silent behavior change?
Also, cc @rxin since he is the release manager for 3.0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed SPARK-31136 to track the discussion and the final result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost of break is mostly users can't run LOAD TABLE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't get me wrong. You know that I loved that patch and tried to minimize the impact while embracing it. However, the new policy is designed to ban this kind of behavior change (SPARK-30098). Technically,
- The benefit is just saving a few word
USING PARQUETor something. - The downside is breaking the existing user pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the JIRA description (SPARK-31136) with @cloud-fan 's example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is just saving a few word USING PARQUET or something.
This is a completely underestimate. Please see my comment in #27894 (comment)
|
|
||
| -- Example with partition specification. | ||
| CREATE TABLE test_partition (c1 INT, c2 INT, c3 INT) USING HIVE PARTITIONED BY (c2, c3); | ||
| CREATE TABLE test_partition (c1 INT, c2 INT, c3 INT) PARTITIONED BY (c2, c3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used as the source so doesn't need to be a hive table.
|
Test build #119717 has finished for PR 27891 at commit
|
| --- | ||
| ### Description | ||
| The `INSERT OVERWRITE DIRECTORY` statement overwrites the existing data in the directory with the new values using Spark native format. The inserted rows can be specified by value expressions or result from a query. | ||
| The `INSERT OVERWRITE DIRECTORY` statement overwrites the existing data in the directory with the new values using Spark file format. The inserted rows can be specified by value expressions or result from a query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Spark file format -> a given Spark file format?
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one minor comment.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we wait for the decision for SPARK-30098?
| + -------------- + ------------------------------ + -------------- + | ||
|
|
||
| CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT); | ||
| CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT) USING HIVE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan . This reminds me the following.
SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
@marmbrus and @gatorsmile . Do we need to revert SPARK-30098 due to its silent behavior change?
Also, cc @rxin since he is the release manager for 3.0.0.
| <dt><code><em>file_format</em></code></dt> | ||
| <dd> | ||
| Specifies the file format to use for the insert. Valid options are <code>TEXT</code>, <code>CSV</code>, <code>JSON</code>, <code>JDBC</code>, <code>PARQUET</code>, <code>ORC</code>, <code>HIVE</code>, <code>DELTA</code>, <code>LIBSVM</code>, or a fully qualified class name of a custom implementation of <code>org.apache.spark.sql.sources.DataSourceRegister</code>. | ||
| Specifies the file format to use for the insert. Valid options are <code>TEXT</code>, <code>CSV</code>, <code>JSON</code>, <code>JDBC</code>, <code>PARQUET</code>, <code>ORC</code>, <code>HIVE</code>, <code>DELTA</code>, <code>LIBSVM</code>, or a fully qualified class name of a custom implementation of <code>org.apache.spark.sql.execution.datasources.FileFormat</code>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't AVRO also valid? Btw, does DELTA implement FileFormat? If it means DeltaDataSource, looks like it only implements DataSourceRegister.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avro implements FileFormat. I don't know if DELTA is supported in INSERT OVERWRITE [ LOCAL ] DIRECTORY, let me remove it as it's an external source.
|
|
||
| ### Description | ||
| `LOAD DATA` statement loads the data into a table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the partition of the target table. | ||
| `LOAD DATA` statement loads the data into a Hive serde table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the partition of the target table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
|
Hi, All. I marked SPARK-31136 as a correctness issue and added the example of |
|
shall we merge this PR first? It correctly describes the current behavior. If we want to change the behavior, we should update the document accordingly, instead of blocking this PR and wait for it. |
|
Test build #119857 has finished for PR 27891 at commit
|
|
I'll merge it in a few days if no objection. The CHAR/VARCHAR discussion is still going on in dev list. |
|
merging to master/3.0, thanks for review! |
### What changes were proposed in this pull request? `INSERT OVERWRITE DIRECTORY` can only use file format (class implements `org.apache.spark.sql.execution.datasources.FileFormat`). This PR fixes it and other minor improvement. ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #27891 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? `INSERT OVERWRITE DIRECTORY` can only use file format (class implements `org.apache.spark.sql.execution.datasources.FileFormat`). This PR fixes it and other minor improvement. ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes apache#27891 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
INSERT OVERWRITE DIRECTORYcan only use file format (class implementsorg.apache.spark.sql.execution.datasources.FileFormat). This PR fixes it and other minor improvement.Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?