-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22880][SQL] Add cascadeTruncate option to JDBC datasource #20057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dongjoon-hyun this is the functionality we discussed in PR for SPARK-22729, would be great to hear your opinion on this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the default value here?
It could be safe to prevent accidental inheritance of this wrong implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would force all dialects to implement this method though, which would lead to unnecessary code duplication. Moreover, removing the default value here would change the public API, as it would force others who have written custom dialects to change calls to this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the main reason I request the change. I don't think this implementation is correct on for all dialects. Anyway, it's only my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default for all dialects should be false, regardless of whether the cascade feature is even supported. And for those for which it is supported, it should default to false. Imho a cascade should only take place if the user explicitly asks for it
|
Thank you for your contribution! I am doubting the value of this feature. If you are interested in the JDBC-related work, could you take https://issues.apache.org/jira/browse/SPARK-22731? Thanks! |
|
@gatorsmile could you explain why you have doubts about the feature? Thanks! |
|
@gatorsmile would be great to hear why you doubt the value of the feature :). I know that for us it would be extremely valuable (at the moment we have to do an extra step in our data pipeline because this feature is missing in Spark), but of course we're not the only ones using Spark. |
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conf only takes an effect when SaveMode.Overwrite is enabled and truncate is set to true. My doubt is the usage scenario of this extra conf does not benefit most users. Could we hold this PR? When the other users have similar requests, we can revisit it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it raises the question of how complete/incomplete the Spark JDBC API should be, and what the use cases are that it should serve. For the most simple cases, in which no key constraints are set between tables, you won't need this option. However, as soon as foreign key constraints are introduced, it is very important. I agree that not every functionality from SQL (dialects) should be included, but I personally feel this is quite fundamental functionality.
Moreover, as it's configuration option users that don't want it also don't have to use it. I think we also discussed this functionality in previous PR with @dongjoon-hyun here: SPARK-22729
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @danielvdende .
I think you can ignore my comment in previous PR. There were many directional comments on that PR and it's not the final one. Your previous PR is merged by @gatorsmile .
For me, I also still don't agree on the default value inside JdbcDialects.scala in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun apologies if I misrepresented your comments/opinions from the previous PR, wasn't my intention :-).
I've given it some more thought, and I can see you point about the default value. I'll make the change we discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun one thing I'm thinking of (just curious to hear your opinion): could we use the value of isCascadingTruncateTable for each dialect as the default value for the cascade boolean? In that way, there is only a single boolean per dialect specifying what the default behaviour is with regard to cascading during truncates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the documentation: s/used with case/used with care/
|
I've made some changes to the code; the hardcoded |
|
@dongjoon-hyun @gatorsmile any further thoughts on this? |
|
@dongjoon-hyun @gatorsmile any further update? |
Stephan202
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use this feature 👍 .
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the documentation: s/used with case/used with care/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dialects not supporting cascading, should the documentation perhaps indicate that the cascade parameter is ignored?
2f56d07 to
3a7dda4
Compare
|
@Stephan202 thanks for pointing out those docs issues, just pushed the changes :-). |
|
Any idea when this will be merged into master? We could use this since we are ditching sqoop 👍 |
|
@dongjoon-hyun @gatorsmile sorry to keep asking, but could you let me know when we can get this merged? |
|
We're in the process of integrating Spark in Airflow, and support for the Would be great if we can get this merged asap so we can continue testing. Cheers |
|
ok to test |
|
@danielvdende @Fokko We definitely want to help the community replace Sqoop by Spark SQL. However, Could you show me the key missing features that are available in Sqoop but not in Spark SQL JDBC connectors? |
|
Test build #87484 has finished for PR 20057 at commit
|
|
Tests are failing on a spark streaming test. I think it's probably because of the age of this PR, will rebase to get the changes into the PR that were merged into master since I opened the PR |
This commit adds the cascadeTruncate option to the JDBC datasource API, for databases that support this functionality (PostgreSQL and Oracle at the moment). This allows for applying a cascading truncate that affects tables that have foreign key constraints on the table being truncated.
3a7dda4 to
6c0d3df
Compare
|
Test build #87493 has finished for PR 20057 at commit
|
|
Hi @gatorsmile, thanks for putting it to the tests. The main reason why I personally dislike Sqoop is:
This is also where Spark-jdbc comes in, for example, in the future we would like to delete single partitions, but this is wip. Maybe @danielvdende can elaborate a bit on their use-case. |
|
Hmm, not it fails the OrcQuerySuite. This PR doesn't touch any of the Orc implementation in Spark. Could this be a flaky test @gatorsmile ? |
|
Hi guys, @Fokko @gatorsmile, completely agree with what @Fokko mentioned, our main reason for wanting to get away from Sqoop is also for stability reasons and to get rid of MapReduce in preparation for our move to Kubernetes (or something similar). We've also seen it to be much faster than Sqoop. In terms of why we need the feature in this PR: we have some tables in PostgreSQL that have foreign keys linking them. We have also specified a schema for these tables. If we use the drop-and-recreate option, Spark will determine the schema, overriding our PostgreSQL schema. Obviously, these should match up, but I personally don't like that Spark can do this (and that you can't explicitly tell it not to). Because of this behaviour, we currently require 2 tasks in Airflow (similar to what @Fokko mentioned) to ensure the tables are truncated, but the schema stays in place. This PR would enable us to specify in a single, idempotent (Airflow) task that we want to truncate the table before putting new data in there. The cascade enables us to not break foreign key relations and cause errors. To be clear, this therefore isn't emulating a Sqoop feature (as a Sqoop task isn't idempotent), but is in fact improving on what Sqoop offers. |
|
Our overwrite semantics is confusing to most. We need to correct it in the next release, i.e., Spark 2.4. Even if we try our best to keep the schema of the original table, the actual CREATE TABLE statements still take many vendor-specific info. It is hard for us to generate a CREATE TABLE for containing all of them. I can understand your use case for truncate. I am sorry this will not be part of Spark 2.3 release. We will include it in the next release. You can still do the change in your forked Spark. Just feel free to let us know if you find anything that we should do in Spark SQL JCBC to match the corresponding ones in SQOOP. Thanks! |
|
This test is a flaky test. Your changes did not fail any test case. I will review your PR after the 2.3 release. Thanks again! cc @dongjoon-hyun Do you want to take a look at this? |
|
Thank you for pinging me, @gatorsmile . Yep. I'll take a look both |
|
Thanks guys! @gatorsmile @dongjoon-hyun |
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, do we need a default value description like It defaults to <code>false</code>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in another comment, I think we should use the value of isCascadingTruncateTable as the default, rather than always false. Seems like the correct use of that variable. I can add a sentence to the docs specifying that.
|
@dongjoon-hyun Made the changes you pointed out, thanks! 👍 |
|
Test build #87591 has finished for PR 20057 at commit
|
|
@dongjoon-hyun I saw that Spark 2.3 was released a few days ago, congrats on the release! :-) Is there anything stopping us from merging this PR into master now? |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile . Could you review this PR?
Spark has been supporting TRUNCATE before adding TeradataDialect. In this PR, it turns out TeradataDialect doesn't support TRUNCATE SQL syntax at all. Although this PR introduces DELETE statement in TeradataDialect.getTruncateQuery function, I think this feature is helpful in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this redundant empty line addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove line 114.
Also update the tests accordingly
ed452e7 to
7e0ff07
Compare
|
Test build #88040 has finished for PR 20057 at commit
|
|
Retest this please. |
|
Test build #88052 has finished for PR 20057 at commit
|
|
The failures are
|
|
Retest this please. |
|
Test build #88057 has finished for PR 20057 at commit
|
|
Gentle ping @gatorsmile :-) |
|
Gentle ping again @gatorsmile @dongjoon-hyun :-) |
|
Gentle ping @gatorsmile @dongjoon-hyun 👍 |
|
@gatorsmile @dongjoon-hyun Guys, any update? |
|
@dongjoon-hyun @gatorsmile Any update, guys? |
|
Test build #92841 has finished for PR 20057 at commit
|
|
retest this please |
|
Test build #92963 has finished for PR 20057 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except two minor comments
| @Since("2.4.0") | ||
| def getTruncateQuery( | ||
| table: String, | ||
| cascade: Option[Boolean] = isCascadingTruncateTable): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indent.
| options.isCascadeTruncate)) | ||
| } else { | ||
| statement.executeUpdate(dialect.getTruncateQuery(options.table)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the above comment of @dongjoon-hyun
|
retest this please |
|
Test build #93322 has finished for PR 20057 at commit
|
|
Just made the changes you mentioned @gatorsmile :-) |
|
Test build #93327 has finished for PR 20057 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @gatorsmile and @danielvdende .
LGTM, too.
|
Thanks! Merged to master. |
This commit adds the
cascadeTruncateoption to the JDBC datasourceAPI, for databases that support this functionality (PostgreSQL and
Oracle at the moment). This allows for applying a cascading truncate
that affects tables that have foreign key constraints on the table
being truncated.
What changes were proposed in this pull request?
Add
cascadeTruncateoption to JDBC datasource API. Allow this to affect theTRUNCATEquery for databases that support this option.How was this patch tested?
Existing tests for
truncateQuerywere updated. Also, an additional test was addedto ensure that the correct syntax was applied, and that enabling the config for databases
that do not support this option does not result in invalid queries.