-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16463][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter
#14086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #61907 has finished for PR 14086 at commit
|
truncate option in Overwrite mode for JDBC DataFrameWritertruncate option in Overwrite mode for JDBC DataFrameWriter
|
Due to the difference of scope, I create another JIRA issue. |
|
Could you review this PR, @srowen ? |
|
Hm, it seems like it should be another However... is it not possible to just TRUNCATE when the schema hasn't changed, and DROP/CREATE when it has? that seems like the best solution if it's feasible. |
|
Thank you for review, @srowen . Currently, the Indeed, DROP/CREATE is the best and robust solution, but, as you see the example, this is a minor optimization for good benefit. We can not achieve the benefit with |
|
Shouldn't truncate function like overwrite in instances where the distinction doesn't matter -- rather than be ignored? I think the point of the issue was that DROP/CREATE isn't the best solution in all cases, specifically, where the table schema has not changed. It's at best a little slower and at worst destroys metadata. I believe everyone sees there are use cases for both TRUNCATE and DROP/CREATE; I'm more specifically asking if we need to make the caller figure this out or whether it's easy and simple to use TRUNCATE when possible in Overwrite mode. Maybe it's not simple, or maybe we want to let people control this anyway, but it does have this cost of adding another setting to the API, and it's still possible to surprise yourself with Overwrite mode in this context if you're not aware of what Truncate does differently. |
|
That's a good point of view. Right. I agree with you, |
|
Thank you for guiding me. I hope this PR become more valuable in Spark, too. |
|
Hi, @srowen . SaveMode.TRUNCATE
Behavior (how to handle an check schema compatibility)
Dose this make sense to you? If I missed some of your advice, please let me know. |
|
I suppose my key question is still: do we need to make the user choose at all? it seems like TRUNCATE is always the right choice except when schema changes. That means, ideally, no new SaveMode. Overwrite would just cause TRUNCATE where possible, otherwise DROP/CREATE. I think this is specific to JDBC only. Other sources can leave their behavior unchanged. Yes, I think the next question is, how hard is it to reliably detect that the schema hasn't changed? you know better than I, likely. |
|
Recently, we add 'DELETE PURGE'. It's similar situation. We can provide only explicit option for users. |
|
Just rebased to the master in order to ensure it works. |
|
Test build #62380 has finished for PR 14086 at commit
|
|
Test build #62386 has finished for PR 14086 at commit
|
|
This seems pretty reasonable. Anyone else have an opinion on offering this option? |
|
Thank you for keeping attention this PR, @srowen . |
|
Let me share my 2 cents here.
Thus, I am not confident to add such a mode or option into Spark SQL. This command is used very rarely in production systems. DBAs should do it manually, instead of using the Spark SQL interface. If the community still thinks we should integrate it into Spark SQL, I am also fine. After a quick review, I think the implementation misses the error handling. For example, when users try to use |
|
@gatorsmile that all sounds reasonable, but right now a DROP/CREATE table happens. That's also not possible within a transaction and is a more drastic operation. Does this argument not apply more to existing behavior? the point indeed is to perform a more modest operation if possible. |
|
Thank you for attention, @gatorsmile . BTW, this option is for the advanced users who knows their DB and the limitation and powers of The the following comment, I really like this point of views. And I'm sure that DBAs will allow the use of this option to Spark Architect/Developer/Users for the case they can do manually.
|
|
Yep. I totally agree with @srowen 's opinions, too. |
|
Just rebased to the master. |
|
Test build #62437 has finished for PR 14086 at commit
|
I did not investigate all the RDBMS vendors and different versions might have different behaviors. My suggestion is to do more investigation before adding this support. |
|
The followings are my opinions according to the priority.
I'm worrying about losing the focus. The all the concerns are based on the correct facts, but the scope of arguments seems to be slightly too general to this PR. Please see the description of this PR. The context of this PR is providing truncate option for the Spark JDBC tables generated by df.write.mode("overwrite").jdbc(url, "table_with_index", prop). |
This is not true. For example, in DB2 z,
I just share what I learned here. Developing a general solution for different JDBC data sources is always very very complicated. When more and more enterprise customers start using Spark, we will get more and more |
|
I think the above discussions are important when users really use the option |
|
Ur, it's strange. |
|
Test build #62555 has finished for PR 14086 at commit
|
|
Test build #62556 has finished for PR 14086 at commit
|
|
It's ready for review again. Could you review this PR when you have sometime, @srowen and @gatorsmile ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say no, because user has explicitly specified truncate. They can turn if off themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do whatever we do with drop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then, the current implementation looks good to me.
@dongjoon-hyun Could you summarize the previous discussion and design decision we made? Document them in the PR description. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, I will ask one question.
Literally, we should not do whatever we do with drop, e.g., we should not drop INDEX, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'll update the document and PR description more clearly.
Thank you for guidance, @rxin and @gatorsmile .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Drop, Create and Insert: Create and Insert could fail, but we still drop the table.
- Truncate and Insert: Insert could fail, but we always truncate the table.
I think it is OK to raise an exception here, but check whether the exception message is meaningful or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, dropping index does not make sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current exception message is "Column xxx not found".
|
The descriptions of PR/code are updated. |
|
Test build #62582 has finished for PR 14086 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sure.
|
LGTM except one minor comment. |
|
Thank you, @gatorsmile . |
|
Test build #62633 has finished for PR 14086 at commit
|
|
LGTM |
|
Hi, @srowen . |
…de for JDBC DataFrameWriter
|
Rebased. |
|
Test build #62741 has finished for PR 14086 at commit
|
| assert(1 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).count()) | ||
| assert(2 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).collect()(0).length) | ||
|
|
||
| val m = intercept[SparkException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check my understanding here, this overwrites the table with a different schema (new column seq). This shows the truncate fails because the schema has changed.
I guess it would be nice to test the case where the truncate works at least, though, we can't really test whether it truncates vs drops.
Could you for example just repeat the code on line 163-166 here to verify that overwriting just results in the same results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nevermind my last comment. You test the truncate succeeds path already.
OK, the assertion here makes sense though it highlights that if truncation can't succeed, then it only fails after truncating and the new dataframe can't be written. I suppose that's reasonable semantics, since it otherwise requires doing something like testing an insert.
|
Test build #62755 has finished for PR 14086 at commit
|
|
Merged to master |
|
Thank you for review and merging, @srowen , @rxin , and @gatorsmile ! |
…or JDBC DataFrameWriter apache#14086
What changes were proposed in this pull request?
This PR adds a boolean option,
truncate, forSaveMode.Overwriteof JDBC DataFrameWriter. If this option istrue, it try to take advantage ofTRUNCATE TABLEinstead ofDROP TABLE. This is a trivial option, but will provide great convenience for BI tool users based on RDBMS tables generated by Spark.Goal
CREATE/DROPprivilege, we can save dataframe to database. Sometime these are not allowed for security.INDEXandCONSTRAINTs for the table.TRUNCATEis faster than the combination ofDROP/CREATE.Supported DBMS
The following is
truncate-option support table. Due to the different behavior ofTRUNCATE TABLEamong DBMSs, it's not always safe to useTRUNCATE TABLE. Spark will ignore thetruncateoption for unknown and some DBMS with default CASCADING behavior. Newly added JDBCDialect should implement corresponding function to supporttruncateoption additionally.truncateOPTION SUPPORTBefore (TABLE with INDEX case): SparkShell & MySQL CLI are interleaved intentionally.
After (TABLE with INDEX case)
Error Handling
truncateoption.How was this patch tested?
Pass the Jenkins tests with a updated testcase.