[SPARK-16463][SQL] Support `truncate` option in Overwrite mode for JDBC DataFrameWriter #14086

dongjoon-hyun · 2016-07-07T08:52:01Z

What changes were proposed in this pull request?

This PR adds a boolean option, truncate, for SaveMode.Overwrite of JDBC DataFrameWriter. If this option is true, it try to take advantage of TRUNCATE TABLE instead of DROP TABLE. This is a trivial option, but will provide great convenience for BI tool users based on RDBMS tables generated by Spark.

Goal

Without CREATE/DROP privilege, we can save dataframe to database. Sometime these are not allowed for security.
It will preserve the existing table information, so users can add and keep some additional INDEX and CONSTRAINTs for the table.
Sometime, TRUNCATE is faster than the combination of DROP/CREATE.

Supported DBMS
The following is truncate-option support table. Due to the different behavior of TRUNCATE TABLE among DBMSs, it's not always safe to use TRUNCATE TABLE. Spark will ignore the truncate option for unknown and some DBMS with default CASCADING behavior. Newly added JDBCDialect should implement corresponding function to support truncate option additionally.

Spark Dialects	`truncate` OPTION SUPPORT
MySQLDialect	O
PostgresDialect	X
DB2Dialect	O
MsSqlServerDialect	O
DerbyDialect	O
OracleDialect	O

Before (TABLE with INDEX case): SparkShell & MySQL CLI are interleaved intentionally.

scala> val (url, prop)=("jdbc:mysql://localhost:3306/temp?useSSL=false", new java.util.Properties)
scala> prop.setProperty("user","root")
scala> df.write.mode("overwrite").jdbc(url, "table_with_index", prop)
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   |     | NULL    |       |
+-------+------------+------+-----+---------+-------+
mysql> CREATE UNIQUE INDEX idx_id ON table_with_index(id);
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   | PRI | NULL    |       |
+-------+------------+------+-----+---------+-------+
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   |     | NULL    |       |
+-------+------------+------+-----+---------+-------+

After (TABLE with INDEX case)

scala> spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   | PRI | NULL    |       |
+-------+------------+------+-----+---------+-------+

Error Handling

In case of exceptions, Spark will not retry. Users should turn off the truncate option.
In case of schema change:
- If one of the column names changes, this will raise exceptions intuitively.
- If there exists only type difference, this will work like Append mode.

How was this patch tested?

Pass the Jenkins tests with a updated testcase.

SparkQA · 2016-07-07T10:39:48Z

Test build #61907 has finished for PR 14086 at commit c1e4c41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-09T19:34:22Z

Due to the difference of scope, I create another JIRA issue.

dongjoon-hyun · 2016-07-09T20:06:47Z

Could you review this PR, @srowen ?

srowen · 2016-07-10T10:51:43Z

Hm, it seems like it should be another SaveMode if anything. Its semantics would be identical to Overwrite, I guess, for other non-JDBC sources. CC @yhuai ?

However... is it not possible to just TRUNCATE when the schema hasn't changed, and DROP/CREATE when it has? that seems like the best solution if it's feasible.

dongjoon-hyun · 2016-07-10T18:24:44Z

Thank you for review, @srowen . Currently, the truncate option will be ignored for other non-JDBC sources.
IMO, most of them is *file-based sources, so they do not need this.

Indeed, DROP/CREATE is the best and robust solution, but, as you see the example, this is a minor optimization for good benefit. We can not achieve the benefit with DROP.

srowen · 2016-07-11T12:08:11Z

Shouldn't truncate function like overwrite in instances where the distinction doesn't matter -- rather than be ignored?

I think the point of the issue was that DROP/CREATE isn't the best solution in all cases, specifically, where the table schema has not changed. It's at best a little slower and at worst destroys metadata.

I believe everyone sees there are use cases for both TRUNCATE and DROP/CREATE; I'm more specifically asking if we need to make the caller figure this out or whether it's easy and simple to use TRUNCATE when possible in Overwrite mode. Maybe it's not simple, or maybe we want to let people control this anyway, but it does have this cost of adding another setting to the API, and it's still possible to surprise yourself with Overwrite mode in this context if you're not aware of what Truncate does differently.

dongjoon-hyun · 2016-07-11T18:45:17Z

That's a good point of view. Right. I agree with you, DROP/CREATE isn't the best solution in all cases. We can automatically check the schema compatibility in some way.
Let me dig more on this issue. It seems not simple, but looks more valuable.

dongjoon-hyun · 2016-07-11T18:46:20Z

Thank you for guiding me. I hope this PR become more valuable in Spark, too.

dongjoon-hyun · 2016-07-13T17:48:36Z

Hi, @srowen .
I'm looking at your advice more. Here is my summary so far in general.

SaveMode.TRUNCATE

Truncate the content only, not a structure of existing data source.
I prefer this mode raises exception if there is a problem.
(I mean it's not safe for the users to have SaveMode.Overwrite as a the fallback operation.)
The (SCC) is the key.

Behavior (how to handle an check schema compatibility)

JDBC: USE TRUCATE TABLE if the source has the same dialect (SAME DB), the same schema(name, type, nullability, order).
File-based Source(like JSON): DELETE FILE ONLY if the inferred schema is identical (Directory means partitions.)

Dose this make sense to you? If I missed some of your advice, please let me know.

srowen · 2016-07-14T08:03:19Z

I suppose my key question is still: do we need to make the user choose at all? it seems like TRUNCATE is always the right choice except when schema changes. That means, ideally, no new SaveMode. Overwrite would just cause TRUNCATE where possible, otherwise DROP/CREATE.

I think this is specific to JDBC only. Other sources can leave their behavior unchanged.

Yes, I think the next question is, how hard is it to reliably detect that the schema hasn't changed? you know better than I, likely.

dongjoon-hyun · 2016-07-14T08:18:04Z

TRUNCATE is not always right choice for the existing tables, because sometime users really want to delete all information including INDEX/PARTITIONS, too.
SaveMode.Overwrite should mean SaveMode.Overwrite. We can not optimize it without user permission even for the same schema .

Recently, we add 'DELETE PURGE'. It's similar situation. We can provide only explicit option for users.

srowen · 2016-07-14T08:24:01Z

It's possible. I think it's probably much more likely the metadata is supposed to be kept, but who knows. OK this has a decent logic. I'd prefer to get someone else's opinion who has worked on this code like .. @lw-lin or @zsxwing ?

dongjoon-hyun · 2016-07-15T17:08:56Z

Just rebased to the master in order to ensure it works.

SparkQA · 2016-07-15T18:33:10Z

Test build #62380 has finished for PR 14086 at commit 86dcc2e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-15T20:34:25Z

Test build #62386 has finished for PR 14086 at commit 1679d86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-07-17T07:36:36Z

This seems pretty reasonable. Anyone else have an opinion on offering this option?

dongjoon-hyun · 2016-07-17T07:51:04Z

Thank you for keeping attention this PR, @srowen .

gatorsmile · 2016-07-17T15:34:53Z

Let me share my 2 cents here.

Truncate Table is very risky. It is fast since RDBMS does not log the individual row deletes. That means, we are unable to roll it back in most RDBMS. The behaviors are different in different RDBMS vendors. For example,

PostgreSQL could truncate all its descendant tables at default, unless we specify ONLY in the statement.
DB2 z can roll it back but the table is still in a truncated state
Oracle 11 documents this caution at the very beginning: http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10007.htm Unable to roll it back. (Note, Oracle 12 added a new clause CASCADE. By default, the descendant table will not be dropped.)

Thus, I am not confident to add such a mode or option into Spark SQL. This command is used very rarely in production systems. DBAs should do it manually, instead of using the Spark SQL interface.

If the community still thinks we should integrate it into Spark SQL, I am also fine. After a quick review, I think the implementation misses the error handling. For example, when users try to use truncate option in the other SaveMode.

srowen · 2016-07-17T18:53:07Z

@gatorsmile that all sounds reasonable, but right now a DROP/CREATE table happens. That's also not possible within a transaction and is a more drastic operation. Does this argument not apply more to existing behavior? the point indeed is to perform a more modest operation if possible.

dongjoon-hyun · 2016-07-17T18:53:17Z

Thank you for attention, @gatorsmile .

BTW, this option is for the advanced users who knows their DB and the limitation and powers of TRUNCATE.

The the following comment, I really like this point of views. And I'm sure that DBAs will allow the use of this option to Spark Architect/Developer/Users for the case they can do manually.

DBAs should do it manually, instead of using the Spark SQL interface.

dongjoon-hyun · 2016-07-17T18:54:36Z

Yep. I totally agree with @srowen 's opinions, too.

dongjoon-hyun · 2016-07-17T20:58:37Z

Just rebased to the master.

dongjoon-hyun · 2016-07-17T22:13:46Z

According to the code, this PR changed the code of @rxin .

Hi, @rxin .
Could you give us some opinion about supporting truncate option with SaveMode.Overwrite for JDBC sources?

SparkQA · 2016-07-17T22:40:11Z

Test build #62437 has finished for PR 14086 at commit b3a39fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-17T22:42:58Z

DROP/CREATE/DELETE can be rolled back. In Oracle, we are unable to roll back TRUNCATE. I am not sure whether this is a big deal to the Spark users.
If the data source is PostgreSQL, we also truncate all its descendant tables by default. This side effect does not look good to me. Not sure whether the Spark users can tolerate it.
So far, Spark does not have role-based privilege. It completely depends on the underlying data sources. If DBAs have to grant higher privileges to Spark app (for any reason), this change could be a potential issue.

I did not investigate all the RDBMS vendors and different versions might have different behaviors. My suggestion is to do more investigation before adding this support.

dongjoon-hyun · 2016-07-17T22:58:28Z

The followings are my opinions according to the priority.

First of all, TRUNCATE is the lower privilege than DROP/CREATE.
- You know that DROP/CREATE privilege do anything harm.
Second, we are assuming the original table is generated by DataFrameWriter, not any other general database table.
- Why do you consider the descendant tables? If you consider descendant tables, you should say DROP will also delete them all, too.
Third, do we use rollback with DataFrameWriter?

I'm worrying about losing the focus. The all the concerns are based on the correct facts, but the scope of arguments seems to be slightly too general to this PR. Please see the description of this PR. The context of this PR is providing truncate option for the Spark JDBC tables generated by df.write.mode("overwrite").jdbc(url, "table_with_index", prop).

gatorsmile · 2016-07-18T00:24:18Z

TRUNCATE is the lower privilege than DROP/CREATE.

This is not true. For example, in DB2 z, DROP and TRUNCATE require different privilege.
https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_truncate.html
https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_drop.html

Drop table will not drop the descendant tables, unless you explicitly specify CASCADE. The behaviors of TRUNCATE depends on the vendors. TRUNCATE is not included in the SQL ANSI standards until 2008. That might be the reason why different vendors behave very differently.
Rollback may be not an issue here, but it means the deletion log records are not written into the log. Does it matter? It really depends on how the applications use the output table of DataFrameWriter and the recovery log they generated. The product I previously worked is to parse and utilize the transaction recovery log of different RDBMS. Thus, I knew Truncate is a big issue for most replication products.

I just share what I learned here. Developing a general solution for different JDBC data sources is always very very complicated. When more and more enterprise customers start using Spark, we will get more and more strange JIRAs.

gatorsmile · 2016-07-18T00:31:36Z

I think the above discussions are important when users really use the option truncate. I am OK to let users decide whether TRUNCATE is good or not for their scenario.

dongjoon-hyun · 2016-07-19T21:40:37Z

Ur, it's strange. JDBCSuite is passed locally. I'll rebase this.

SparkQA · 2016-07-19T23:11:51Z

Test build #62555 has finished for PR 14086 at commit dcf7b02.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-19T23:31:51Z

Test build #62556 has finished for PR 14086 at commit 1713f8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-19T23:49:39Z

It's ready for review again. Could you review this PR when you have sometime, @srowen and @gatorsmile ?

gatorsmile · 2016-07-20T05:44:01Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

If truncateTable failed due to a non fatal exception, should we fall back to the previous way (i.e., drop and create)? This is a design decision. CC @srowen @rxin

I'd say no, because user has explicitly specified truncate. They can turn if off themselves.

: ) Sure. Then, the next design question to @rxin and @srowen

Should we still truncate the table if the table schema does not match the schema of new table?

We should do whatever we do with drop.

I see. Then, the current implementation looks good to me.

@dongjoon-hyun Could you summarize the previous discussion and design decision we made? Document them in the PR description. Thanks!

For my understanding, I will ask one question.

Literally, we should not do whatever we do with drop, e.g., we should not drop INDEX, right?

Sure. I'll update the document and PR description more clearly.

Thank you for guidance, @rxin and @gatorsmile .

Drop, Create and Insert: Create and Insert could fail, but we still drop the table.

Truncate and Insert: Insert could fail, but we always truncate the table.

I think it is OK to raise an exception here, but check whether the exception message is meaningful or not.

Nope, dropping index does not make sense here.

The current exception message is "Column xxx not found".

dongjoon-hyun · 2016-07-20T06:48:20Z

The descriptions of PR/code are updated.

SparkQA · 2016-07-20T08:32:50Z

Test build #62582 has finished for PR 14086 at commit 98c81c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-20T21:32:49Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Please correct this.

gatorsmile · 2016-07-20T21:33:45Z

LGTM except one minor comment.

dongjoon-hyun · 2016-07-20T21:40:38Z

Thank you, @gatorsmile .
I removed the repeated DerbyDialect and adds OracleDialect.

SparkQA · 2016-07-20T23:23:19Z

Test build #62633 has finished for PR 14086 at commit 8fcca6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-20T23:50:45Z

LGTM

dongjoon-hyun · 2016-07-21T08:59:56Z

Hi, @srowen .
Could you review this JDBC PR again?

…de for JDBC DataFrameWriter

dongjoon-hyun · 2016-07-22T22:37:09Z

Rebased.

SparkQA · 2016-07-23T00:12:31Z

Test build #62741 has finished for PR 14086 at commit e989b3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-07-23T14:50:30Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

    assert(1 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).count())
    assert(2 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).collect()(0).length)
+
+    val m = intercept[SparkException] {


To check my understanding here, this overwrites the table with a different schema (new column seq). This shows the truncate fails because the schema has changed.

I guess it would be nice to test the case where the truncate works at least, though, we can't really test whether it truncates vs drops.

Could you for example just repeat the code on line 163-166 here to verify that overwriting just results in the same results?

Sure, that would be better.

Ah, nevermind my last comment. You test the truncate succeeds path already.
OK, the assertion here makes sense though it highlights that if truncation can't succeed, then it only fails after truncating and the new dataframe can't be written. I suppose that's reasonable semantics, since it otherwise requires doing something like testing an insert.

SparkQA · 2016-07-23T20:55:03Z

Test build #62755 has finished for PR 14086 at commit 8b452cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-07-24T08:25:19Z

Merged to master

dongjoon-hyun · 2016-07-24T14:43:04Z

Thank you for review and merging, @srowen , @rxin , and @gatorsmile !

…or JDBC DataFrameWriter apache#14086

dongjoon-hyun changed the title ~~[SPARK-16410][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter~~ [SPARK-16463][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter Jul 9, 2016

JoshRosen mentioned this pull request Jul 13, 2016

Overwrite using staging table fails when table has dependencies databricks/spark-redshift#213

Open

gatorsmile reviewed Jul 20, 2016
View reviewed changes

dongjoon-hyun added 6 commits July 22, 2016 15:36

[SPARK-16410][SQL] Support truncate option in SaveMode.Overwrite mo…

8f6224c

…de for JDBC DataFrameWriter

Use more robust logic to check error message.

25decbb

Add isCascadingTruncateTable checking.

2f8cbe3

Unregister testDialect at the end of the test.

15280c3

Add more comments.

986d2a6

Fix documents.

e989b3e

srowen reviewed Jul 23, 2016
View reviewed changes

Check if TRUNCATE is called.

8b452cb

asfgit closed this in cc1d2dc Jul 24, 2016

zzcclp added a commit to zzcclp/spark that referenced this pull request Jul 25, 2016

[EXT][SPARK-16463][SQL] Support truncate option in Overwrite mode f…

126cd2e

…or JDBC DataFrameWriter apache#14086

dongjoon-hyun deleted the SPARK-16410 branch August 14, 2016 09:44

gatorsmile mentioned this pull request Dec 6, 2017

[SPARK-22729][SQL] Add getTruncateQuery to JdbcDialect #19911

Closed

[SPARK-16463][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter #14086

[SPARK-16463][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter #14086

Uh oh!

Conversation

dongjoon-hyun commented Jul 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

dongjoon-hyun commented Jul 9, 2016

Uh oh!

dongjoon-hyun commented Jul 9, 2016

Uh oh!

srowen commented Jul 10, 2016

Uh oh!

dongjoon-hyun commented Jul 10, 2016

Uh oh!

srowen commented Jul 11, 2016

Uh oh!

dongjoon-hyun commented Jul 11, 2016

Uh oh!

dongjoon-hyun commented Jul 11, 2016

Uh oh!

dongjoon-hyun commented Jul 13, 2016

Uh oh!

srowen commented Jul 14, 2016

Uh oh!

dongjoon-hyun commented Jul 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Jul 14, 2016

Uh oh!

dongjoon-hyun commented Jul 15, 2016

Uh oh!

SparkQA commented Jul 15, 2016

Uh oh!

SparkQA commented Jul 15, 2016

Uh oh!

srowen commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

gatorsmile commented Jul 17, 2016

Uh oh!

srowen commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

SparkQA commented Jul 17, 2016

Uh oh!

gatorsmile commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

gatorsmile commented Jul 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 18, 2016

Uh oh!

dongjoon-hyun commented Jul 19, 2016

Uh oh!

SparkQA commented Jul 19, 2016

Uh oh!

SparkQA commented Jul 19, 2016

Uh oh!

dongjoon-hyun commented Jul 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[SPARK-16463][SQL] Support `truncate` option in Overwrite mode for JDBC DataFrameWriter #14086

[SPARK-16463][SQL] Support `truncate` option in Overwrite mode for JDBC DataFrameWriter #14086

dongjoon-hyun commented Jul 7, 2016 •

edited

Loading

dongjoon-hyun commented Jul 14, 2016 •

edited

Loading

gatorsmile commented Jul 18, 2016 •

edited

Loading

dongjoon-hyun commented Jul 19, 2016 •

edited

Loading

dongjoon-hyun Jul 20, 2016 •

edited

Loading