Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 7, 2016

What changes were proposed in this pull request?

This PR adds a boolean option, truncate, for SaveMode.Overwrite of JDBC DataFrameWriter. If this option is true, it try to take advantage of TRUNCATE TABLE instead of DROP TABLE. This is a trivial option, but will provide great convenience for BI tool users based on RDBMS tables generated by Spark.

Goal

  • Without CREATE/DROP privilege, we can save dataframe to database. Sometime these are not allowed for security.
  • It will preserve the existing table information, so users can add and keep some additional INDEX and CONSTRAINTs for the table.
  • Sometime, TRUNCATE is faster than the combination of DROP/CREATE.

Supported DBMS
The following is truncate-option support table. Due to the different behavior of TRUNCATE TABLE among DBMSs, it's not always safe to use TRUNCATE TABLE. Spark will ignore the truncate option for unknown and some DBMS with default CASCADING behavior. Newly added JDBCDialect should implement corresponding function to support truncate option additionally.

Spark Dialects truncate OPTION SUPPORT
MySQLDialect O
PostgresDialect X
DB2Dialect O
MsSqlServerDialect O
DerbyDialect O
OracleDialect O

Before (TABLE with INDEX case): SparkShell & MySQL CLI are interleaved intentionally.

scala> val (url, prop)=("jdbc:mysql://localhost:3306/temp?useSSL=false", new java.util.Properties)
scala> prop.setProperty("user","root")
scala> df.write.mode("overwrite").jdbc(url, "table_with_index", prop)
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   |     | NULL    |       |
+-------+------------+------+-----+---------+-------+
mysql> CREATE UNIQUE INDEX idx_id ON table_with_index(id);
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   | PRI | NULL    |       |
+-------+------------+------+-----+---------+-------+
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   |     | NULL    |       |
+-------+------------+------+-----+---------+-------+

After (TABLE with INDEX case)

scala> spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   | PRI | NULL    |       |
+-------+------------+------+-----+---------+-------+

Error Handling

  • In case of exceptions, Spark will not retry. Users should turn off the truncate option.
  • In case of schema change:
    • If one of the column names changes, this will raise exceptions intuitively.
    • If there exists only type difference, this will work like Append mode.

How was this patch tested?

Pass the Jenkins tests with a updated testcase.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61907 has finished for PR 14086 at commit c1e4c41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16410][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter [SPARK-16463][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter Jul 9, 2016
@dongjoon-hyun
Copy link
Member Author

Due to the difference of scope, I create another JIRA issue.

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @srowen ?

@srowen
Copy link
Member

srowen commented Jul 10, 2016

Hm, it seems like it should be another SaveMode if anything. Its semantics would be identical to Overwrite, I guess, for other non-JDBC sources. CC @yhuai ?

However... is it not possible to just TRUNCATE when the schema hasn't changed, and DROP/CREATE when it has? that seems like the best solution if it's feasible.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @srowen . Currently, the truncate option will be ignored for other non-JDBC sources.
IMO, most of them is *file-based sources, so they do not need this.

Indeed, DROP/CREATE is the best and robust solution, but, as you see the example, this is a minor optimization for good benefit. We can not achieve the benefit with DROP.

@srowen
Copy link
Member

srowen commented Jul 11, 2016

Shouldn't truncate function like overwrite in instances where the distinction doesn't matter -- rather than be ignored?

I think the point of the issue was that DROP/CREATE isn't the best solution in all cases, specifically, where the table schema has not changed. It's at best a little slower and at worst destroys metadata.

I believe everyone sees there are use cases for both TRUNCATE and DROP/CREATE; I'm more specifically asking if we need to make the caller figure this out or whether it's easy and simple to use TRUNCATE when possible in Overwrite mode. Maybe it's not simple, or maybe we want to let people control this anyway, but it does have this cost of adding another setting to the API, and it's still possible to surprise yourself with Overwrite mode in this context if you're not aware of what Truncate does differently.

@dongjoon-hyun
Copy link
Member Author

That's a good point of view. Right. I agree with you, DROP/CREATE isn't the best solution in all cases. We can automatically check the schema compatibility in some way.
Let me dig more on this issue. It seems not simple, but looks more valuable.

@dongjoon-hyun
Copy link
Member Author

Thank you for guiding me. I hope this PR become more valuable in Spark, too.

@dongjoon-hyun
Copy link
Member Author

Hi, @srowen .
I'm looking at your advice more. Here is my summary so far in general.

SaveMode.TRUNCATE

  • Truncate the content only, not a structure of existing data source.
  • I prefer this mode raises exception if there is a problem.
    (I mean it's not safe for the users to have SaveMode.Overwrite as a the fallback operation.)
  • The (SCC) is the key.

Behavior (how to handle an check schema compatibility)

  • JDBC: USE TRUCATE TABLE if the source has the same dialect (SAME DB), the same schema(name, type, nullability, order).
  • File-based Source(like JSON): DELETE FILE ONLY if the inferred schema is identical (Directory means partitions.)

Dose this make sense to you? If I missed some of your advice, please let me know.

@srowen
Copy link
Member

srowen commented Jul 14, 2016

I suppose my key question is still: do we need to make the user choose at all? it seems like TRUNCATE is always the right choice except when schema changes. That means, ideally, no new SaveMode. Overwrite would just cause TRUNCATE where possible, otherwise DROP/CREATE.

I think this is specific to JDBC only. Other sources can leave their behavior unchanged.

Yes, I think the next question is, how hard is it to reliably detect that the schema hasn't changed? you know better than I, likely.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 14, 2016

TRUNCATE is not always right choice for the existing tables, because sometime users really want to delete all information including INDEX/PARTITIONS, too.
SaveMode.Overwrite should mean SaveMode.Overwrite. We can not optimize it without user permission even for the same schema .

Recently, we add 'DELETE PURGE'. It's similar situation. We can provide only explicit option for users.

@srowen
Copy link
Member

srowen commented Jul 14, 2016

It's possible. I think it's probably much more likely the metadata is supposed to be kept, but who knows. OK this has a decent logic. I'd prefer to get someone else's opinion who has worked on this code like .. @lw-lin or @zsxwing ?

@dongjoon-hyun
Copy link
Member Author

Just rebased to the master in order to ensure it works.

@SparkQA
Copy link

SparkQA commented Jul 15, 2016

Test build #62380 has finished for PR 14086 at commit 86dcc2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2016

Test build #62386 has finished for PR 14086 at commit 1679d86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 17, 2016

This seems pretty reasonable. Anyone else have an opinion on offering this option?

@dongjoon-hyun
Copy link
Member Author

Thank you for keeping attention this PR, @srowen .

@gatorsmile
Copy link
Member

Let me share my 2 cents here.

Truncate Table is very risky. It is fast since RDBMS does not log the individual row deletes. That means, we are unable to roll it back in most RDBMS. The behaviors are different in different RDBMS vendors. For example,

  1. PostgreSQL could truncate all its descendant tables at default, unless we specify ONLY in the statement.
  2. DB2 z can roll it back but the table is still in a truncated state
  3. Oracle 11 documents this caution at the very beginning: http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10007.htm Unable to roll it back. (Note, Oracle 12 added a new clause CASCADE. By default, the descendant table will not be dropped.)

Thus, I am not confident to add such a mode or option into Spark SQL. This command is used very rarely in production systems. DBAs should do it manually, instead of using the Spark SQL interface.

If the community still thinks we should integrate it into Spark SQL, I am also fine. After a quick review, I think the implementation misses the error handling. For example, when users try to use truncate option in the other SaveMode.

@srowen
Copy link
Member

srowen commented Jul 17, 2016

@gatorsmile that all sounds reasonable, but right now a DROP/CREATE table happens. That's also not possible within a transaction and is a more drastic operation. Does this argument not apply more to existing behavior? the point indeed is to perform a more modest operation if possible.

@dongjoon-hyun
Copy link
Member Author

Thank you for attention, @gatorsmile .

BTW, this option is for the advanced users who knows their DB and the limitation and powers of TRUNCATE.

The the following comment, I really like this point of views. And I'm sure that DBAs will allow the use of this option to Spark Architect/Developer/Users for the case they can do manually.

DBAs should do it manually, instead of using the Spark SQL interface.

@dongjoon-hyun
Copy link
Member Author

Yep. I totally agree with @srowen 's opinions, too.

@dongjoon-hyun
Copy link
Member Author

Just rebased to the master.

@dongjoon-hyun
Copy link
Member Author

According to the code, this PR changed the code of @rxin .

Hi, @rxin .
Could you give us some opinion about supporting truncate option with SaveMode.Overwrite for JDBC sources?

@SparkQA
Copy link

SparkQA commented Jul 17, 2016

Test build #62437 has finished for PR 14086 at commit b3a39fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

  • DROP/CREATE/DELETE can be rolled back. In Oracle, we are unable to roll back TRUNCATE. I am not sure whether this is a big deal to the Spark users.
  • If the data source is PostgreSQL, we also truncate all its descendant tables by default. This side effect does not look good to me. Not sure whether the Spark users can tolerate it.
  • So far, Spark does not have role-based privilege. It completely depends on the underlying data sources. If DBAs have to grant higher privileges to Spark app (for any reason), this change could be a potential issue.

I did not investigate all the RDBMS vendors and different versions might have different behaviors. My suggestion is to do more investigation before adding this support.

@dongjoon-hyun
Copy link
Member Author

The followings are my opinions according to the priority.

  • First of all, TRUNCATE is the lower privilege than DROP/CREATE.
    • You know that DROP/CREATE privilege do anything harm.
  • Second, we are assuming the original table is generated by DataFrameWriter, not any other general database table.
    • Why do you consider the descendant tables? If you consider descendant tables, you should say DROP will also delete them all, too.
  • Third, do we use rollback with DataFrameWriter?

I'm worrying about losing the focus. The all the concerns are based on the correct facts, but the scope of arguments seems to be slightly too general to this PR. Please see the description of this PR. The context of this PR is providing truncate option for the Spark JDBC tables generated by df.write.mode("overwrite").jdbc(url, "table_with_index", prop).

@gatorsmile
Copy link
Member

gatorsmile commented Jul 18, 2016

TRUNCATE is the lower privilege than DROP/CREATE.

This is not true. For example, in DB2 z, DROP and TRUNCATE require different privilege.
https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_truncate.html
https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_drop.html

  • Drop table will not drop the descendant tables, unless you explicitly specify CASCADE. The behaviors of TRUNCATE depends on the vendors. TRUNCATE is not included in the SQL ANSI standards until 2008. That might be the reason why different vendors behave very differently.
  • Rollback may be not an issue here, but it means the deletion log records are not written into the log. Does it matter? It really depends on how the applications use the output table of DataFrameWriter and the recovery log they generated. The product I previously worked is to parse and utilize the transaction recovery log of different RDBMS. Thus, I knew Truncate is a big issue for most replication products.

I just share what I learned here. Developing a general solution for different JDBC data sources is always very very complicated. When more and more enterprise customers start using Spark, we will get more and more strange JIRAs.

@gatorsmile
Copy link
Member

I think the above discussions are important when users really use the option truncate. I am OK to let users decide whether TRUNCATE is good or not for their scenario.

@dongjoon-hyun
Copy link
Member Author

Ur, it's strange. JDBCSuite is passed locally. I'll rebase this.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62555 has finished for PR 14086 at commit dcf7b02.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62556 has finished for PR 14086 at commit 1713f8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 19, 2016

It's ready for review again. Could you review this PR when you have sometime, @srowen and @gatorsmile ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If truncateTable failed due to a non fatal exception, should we fall back to the previous way (i.e., drop and create)? This is a design decision. CC @srowen @rxin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no, because user has explicitly specified truncate. They can turn if off themselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) Sure. Then, the next design question to @rxin and @srowen

Should we still truncate the table if the table schema does not match the schema of new table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do whatever we do with drop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then, the current implementation looks good to me.

@dongjoon-hyun Could you summarize the previous discussion and design decision we made? Document them in the PR description. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, I will ask one question.

Literally, we should not do whatever we do with drop, e.g., we should not drop INDEX, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll update the document and PR description more clearly.

Thank you for guidance, @rxin and @gatorsmile .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Drop, Create and Insert: Create and Insert could fail, but we still drop the table.
  • Truncate and Insert: Insert could fail, but we always truncate the table.

I think it is OK to raise an exception here, but check whether the exception message is meaningful or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, dropping index does not make sense here.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jul 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current exception message is "Column xxx not found".

@dongjoon-hyun
Copy link
Member Author

The descriptions of PR/code are updated.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62582 has finished for PR 14086 at commit 98c81c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sure.

@gatorsmile
Copy link
Member

LGTM except one minor comment.

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile .
I removed the repeated DerbyDialect and adds OracleDialect.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62633 has finished for PR 14086 at commit 8fcca6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@dongjoon-hyun
Copy link
Member Author

Hi, @srowen .
Could you review this JDBC PR again?

@dongjoon-hyun
Copy link
Member Author

Rebased.

@SparkQA
Copy link

SparkQA commented Jul 23, 2016

Test build #62741 has finished for PR 14086 at commit e989b3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(1 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).count())
assert(2 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).collect()(0).length)

val m = intercept[SparkException] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding here, this overwrites the table with a different schema (new column seq). This shows the truncate fails because the schema has changed.

I guess it would be nice to test the case where the truncate works at least, though, we can't really test whether it truncates vs drops.

Could you for example just repeat the code on line 163-166 here to verify that overwriting just results in the same results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nevermind my last comment. You test the truncate succeeds path already.
OK, the assertion here makes sense though it highlights that if truncation can't succeed, then it only fails after truncating and the new dataframe can't be written. I suppose that's reasonable semantics, since it otherwise requires doing something like testing an insert.

@SparkQA
Copy link

SparkQA commented Jul 23, 2016

Test build #62755 has finished for PR 14086 at commit 8b452cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 24, 2016

Merged to master

@asfgit asfgit closed this in cc1d2dc Jul 24, 2016
@dongjoon-hyun
Copy link
Member Author

Thank you for review and merging, @srowen , @rxin , and @gatorsmile !

zzcclp added a commit to zzcclp/spark that referenced this pull request Jul 25, 2016
@dongjoon-hyun dongjoon-hyun deleted the SPARK-16410 branch August 14, 2016 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants