[SPARK-22880][SQL] Add cascadeTruncate option to JDBC datasource #20057

danielvdende · 2017-12-22T16:02:16Z

This commit adds the cascadeTruncate option to the JDBC datasource
API, for databases that support this functionality (PostgreSQL and
Oracle at the moment). This allows for applying a cascading truncate
that affects tables that have foreign key constraints on the table
being truncated.

What changes were proposed in this pull request?

Add cascadeTruncate option to JDBC datasource API. Allow this to affect the
TRUNCATE query for databases that support this option.

How was this patch tested?

Existing tests for truncateQuery were updated. Also, an additional test was added
to ensure that the correct syntax was applied, and that enabling the config for databases
that do not support this option does not result in invalid queries.

danielvdende · 2017-12-22T16:03:05Z

@dongjoon-hyun this is the functionality we discussed in PR for SPARK-22729, would be great to hear your opinion on this :)

dongjoon-hyun · 2017-12-22T19:47:54Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

Could you remove the default value here?
It could be safe to prevent accidental inheritance of this wrong implementation.

That would force all dialects to implement this method though, which would lead to unnecessary code duplication. Moreover, removing the default value here would change the public API, as it would force others who have written custom dialects to change calls to this method.

That is the main reason I request the change. I don't think this implementation is correct on for all dialects. Anyway, it's only my opinion.

I think the default for all dialects should be false, regardless of whether the cascade feature is even supported. And for those for which it is supported, it should default to false. Imho a cascade should only take place if the user explicitly asks for it

gatorsmile · 2017-12-23T03:17:08Z

Thank you for your contribution! I am doubting the value of this feature.

If you are interested in the JDBC-related work, could you take https://issues.apache.org/jira/browse/SPARK-22731?

Thanks!

danielvdende · 2017-12-23T04:44:31Z

@gatorsmile could you explain why you have doubts about the feature? Thanks!

danielvdende · 2017-12-27T07:26:27Z

@gatorsmile would be great to hear why you doubt the value of the feature :). I know that for us it would be extremely valuable (at the moment we have to do an extra step in our data pipeline because this feature is missing in Spark), but of course we're not the only ones using Spark.

gatorsmile · 2017-12-30T06:39:25Z

docs/sql-programming-guide.md

This conf only takes an effect when SaveMode.Overwrite is enabled and truncate is set to true. My doubt is the usage scenario of this extra conf does not benefit most users. Could we hold this PR? When the other users have similar requests, we can revisit it?

I think it raises the question of how complete/incomplete the Spark JDBC API should be, and what the use cases are that it should serve. For the most simple cases, in which no key constraints are set between tables, you won't need this option. However, as soon as foreign key constraints are introduced, it is very important. I agree that not every functionality from SQL (dialects) should be included, but I personally feel this is quite fundamental functionality.

Moreover, as it's configuration option users that don't want it also don't have to use it. I think we also discussed this functionality in previous PR with @dongjoon-hyun here: SPARK-22729

Hi, @danielvdende .
I think you can ignore my comment in previous PR. There were many directional comments on that PR and it's not the final one. Your previous PR is merged by @gatorsmile .

For me, I also still don't agree on the default value inside JdbcDialects.scala in this PR.

@dongjoon-hyun apologies if I misrepresented your comments/opinions from the previous PR, wasn't my intention :-).

I've given it some more thought, and I can see you point about the default value. I'll make the change we discussed.

@dongjoon-hyun one thing I'm thinking of (just curious to hear your opinion): could we use the value of isCascadingTruncateTable for each dialect as the default value for the cascade boolean? In that way, there is only a single boolean per dialect specifying what the default behaviour is with regard to cascading during truncates.

About the documentation: s/used with case/used with care/

danielvdende · 2018-01-03T17:41:55Z

I've made some changes to the code; the hardcoded false default value for cascade in JdbcDialects is now replaced by a default value of isCascadingTruncateTable. This is also the case for each individual dialect, with a pattern match in the dialects for Postgres and Oracle, as they support the cascading behaviour. I think this is quite nice, because this now re=uses the isCascadingTruncateTable function, which defines the default truncating behaviour of a dialect. Curious to hear your thoughts on this @dongjoon-hyun @gatorsmile

danielvdende · 2018-01-10T09:35:31Z

@dongjoon-hyun @gatorsmile any further thoughts on this?

danielvdende · 2018-01-18T14:54:10Z

@dongjoon-hyun @gatorsmile any further update?

Stephan202

We could use this feature 👍 .

Stephan202 · 2018-02-03T13:22:18Z

docs/sql-programming-guide.md

About the documentation: s/used with case/used with care/

Stephan202 · 2018-02-03T13:26:21Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/DB2Dialect.scala

For dialects not supporting cascading, should the documentation perhaps indicate that the cascade parameter is ignored?

danielvdende · 2018-02-05T13:04:32Z

@Stephan202 thanks for pointing out those docs issues, just pushed the changes :-).
@gatorsmile @dongjoon-hyun would you have a chance to take a look at this again?

Fokko · 2018-02-05T13:13:24Z

Any idea when this will be merged into master? We could use this since we are ditching sqoop 👍

danielvdende · 2018-02-15T16:01:36Z

@dongjoon-hyun @gatorsmile sorry to keep asking, but could you let me know when we can get this merged?

Fokko · 2018-02-15T16:12:36Z

We're in the process of integrating Spark in Airflow, and support for the cascadeTruncate is required to make this succeed. First steps are here: apache/airflow#3021.

Would be great if we can get this merged asap so we can continue testing. Cheers

gatorsmile · 2018-02-15T17:07:02Z

ok to test

gatorsmile · 2018-02-15T17:20:27Z

@danielvdende @Fokko We definitely want to help the community replace Sqoop by Spark SQL. However, truncate is only used when users use SaveMode.Overwrite to write the external JDBC tables. In this specific scenario, Spark will truncate an existing table instead of dropping and recreating it.

Could you show me the key missing features that are available in Sqoop but not in Spark SQL JDBC connectors?

SparkQA · 2018-02-15T19:34:22Z

Test build #87484 has finished for PR 20057 at commit 3a7dda4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

danielvdende · 2018-02-15T20:00:27Z

Tests are failing on a spark streaming test. I think it's probably because of the age of this PR, will rebase to get the changes into the PR that were merged into master since I opened the PR

This commit adds the cascadeTruncate option to the JDBC datasource API, for databases that support this functionality (PostgreSQL and Oracle at the moment). This allows for applying a cascading truncate that affects tables that have foreign key constraints on the table being truncated.

SparkQA · 2018-02-15T22:21:36Z

Test build #87493 has finished for PR 20057 at commit 6c0d3df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Fokko · 2018-02-16T08:00:42Z

Hi @gatorsmile, thanks for putting it to the tests. The main reason why I personally dislike Sqoop is:

Legacy. The old map-reduce should be buried in the upcoming years. As a data engineering consultant, I see more people questioning the whole Hadoop stack. Using Sqoop you still need to run map-reduce tasks, and this isn't easy on other platforms like kubernetes.
Stability. I see Sqoop jobs fail quite often, and there isn't a nice way of retrying this in an atomic way. For example, when having a Sqoop job on Airflow, we cannot simply retry the operation. We when we import data from a rmdbs to hdfs, we have to make sure that the target directory of the previous run has been deleted.

This is also where Spark-jdbc comes in, for example, in the future we would like to delete single partitions, but this is wip. Maybe @danielvdende can elaborate a bit on their use-case.

danielvdende · 2018-02-16T08:01:33Z

Hmm, not it fails the OrcQuerySuite. This PR doesn't touch any of the Orc implementation in Spark. Could this be a flaky test @gatorsmile ?
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 12 times over 10.158754687999998 seconds. Last failure message: There are 1 possibly leaked file streams..

danielvdende · 2018-02-16T08:10:21Z

Hi guys, @Fokko @gatorsmile, completely agree with what @Fokko mentioned, our main reason for wanting to get away from Sqoop is also for stability reasons and to get rid of MapReduce in preparation for our move to Kubernetes (or something similar). We've also seen it to be much faster than Sqoop. In terms of why we need the feature in this PR: we have some tables in PostgreSQL that have foreign keys linking them. We have also specified a schema for these tables. If we use the drop-and-recreate option, Spark will determine the schema, overriding our PostgreSQL schema. Obviously, these should match up, but I personally don't like that Spark can do this (and that you can't explicitly tell it not to).

Because of this behaviour, we currently require 2 tasks in Airflow (similar to what @Fokko mentioned) to ensure the tables are truncated, but the schema stays in place. This PR would enable us to specify in a single, idempotent (Airflow) task that we want to truncate the table before putting new data in there. The cascade enables us to not break foreign key relations and cause errors.

To be clear, this therefore isn't emulating a Sqoop feature (as a Sqoop task isn't idempotent), but is in fact improving on what Sqoop offers.

gatorsmile · 2018-02-16T17:09:21Z

Our overwrite semantics is confusing to most. We need to correct it in the next release, i.e., Spark 2.4.

Even if we try our best to keep the schema of the original table, the actual CREATE TABLE statements still take many vendor-specific info. It is hard for us to generate a CREATE TABLE for containing all of them. I can understand your use case for truncate.

I am sorry this will not be part of Spark 2.3 release. We will include it in the next release. You can still do the change in your forked Spark.

Just feel free to let us know if you find anything that we should do in Spark SQL JCBC to match the corresponding ones in SQOOP. Thanks!

gatorsmile · 2018-02-16T17:11:11Z

This test is a flaky test. Your changes did not fail any test case. I will review your PR after the 2.3 release. Thanks again!

cc @dongjoon-hyun Do you want to take a look at this?

dongjoon-hyun · 2018-02-16T18:51:18Z

Thank you for pinging me, @gatorsmile . Yep. I'll take a look both OrcQuerySuite and this PR in this morning. Sorry for late response, @danielvdende .

danielvdende · 2018-02-16T18:58:34Z

Thanks guys! @gatorsmile @dongjoon-hyun
Happy to help out expanding Spark SQL JDBC where necessary to match and improve on sqoop :-)

dongjoon-hyun · 2018-02-16T22:04:51Z

docs/sql-programming-guide.md

Maybe, do we need a default value description like It defaults to <code>false</code>?

As mentioned in another comment, I think we should use the value of isCascadingTruncateTable as the default, rather than always false. Seems like the correct use of that variable. I can add a sentence to the docs specifying that.

danielvdende · 2018-02-21T17:41:08Z

@dongjoon-hyun Made the changes you pointed out, thanks! 👍

SparkQA · 2018-02-21T20:50:47Z

Test build #87591 has finished for PR 20057 at commit ed452e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

danielvdende · 2018-03-05T08:04:40Z

@dongjoon-hyun I saw that Spark 2.3 was released a few days ago, congrats on the release! :-) Is there anything stopping us from merging this PR into master now?

dongjoon-hyun

@gatorsmile . Could you review this PR?

Spark has been supporting TRUNCATE before adding TeradataDialect. In this PR, it turns out TeradataDialect doesn't support TRUNCATE SQL syntax at all. Although this PR introduces DELETE statement in TeradataDialect.getTruncateQuery function, I think this feature is helpful in general.

dongjoon-hyun · 2018-03-06T22:39:53Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

Let's remove this redundant empty line addition.

dongjoon-hyun · 2018-03-06T22:40:40Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala

Let's remove line 114.

Also update the tests accordingly

SparkQA · 2018-03-07T10:52:18Z

Test build #88040 has finished for PR 20057 at commit 7e0ff07.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-07T19:22:48Z

Retest this please.

SparkQA · 2018-03-07T21:08:51Z

Test build #88052 has finished for PR 20057 at commit 7e0ff07.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-07T22:42:02Z

The failures are

Last time, RateSourceV2Suite.
This time, HiveClientSuites.

dongjoon-hyun · 2018-03-07T22:42:09Z

Retest this please.

SparkQA · 2018-03-08T02:08:26Z

Test build #88057 has finished for PR 20057 at commit 7e0ff07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

danielvdende · 2018-03-13T08:27:02Z

Gentle ping @gatorsmile :-)

danielvdende · 2018-03-26T06:59:43Z

Gentle ping again @gatorsmile @dongjoon-hyun :-)

Fokko · 2018-04-03T07:21:38Z

Gentle ping @gatorsmile @dongjoon-hyun 👍

danielvdende · 2018-04-16T07:09:31Z

@gatorsmile @dongjoon-hyun Guys, any update?

danielvdende · 2018-07-11T06:52:41Z

@dongjoon-hyun @gatorsmile Any update, guys?

SparkQA · 2018-07-11T07:05:04Z

Test build #92841 has finished for PR 20057 at commit bc75051.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-13T06:35:56Z

retest this please

SparkQA · 2018-07-13T07:05:01Z

Test build #92963 has finished for PR 20057 at commit bc75051.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM except two minor comments

gatorsmile · 2018-07-20T06:32:49Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

+  @Since("2.4.0")
+  def getTruncateQuery(
+    table: String,
+    cascade: Option[Boolean] = isCascadingTruncateTable): String = {


Nit: indent.

gatorsmile · 2018-07-20T06:36:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+          options.isCascadeTruncate))
+      } else {
+        statement.executeUpdate(dialect.getTruncateQuery(options.table))
+      }


+1 to the above comment of @dongjoon-hyun

gatorsmile · 2018-07-20T06:39:07Z

retest this please

SparkQA · 2018-07-20T07:05:02Z

Test build #93322 has finished for PR 20057 at commit bc75051.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

danielvdende · 2018-07-20T07:29:40Z

Just made the changes you mentioned @gatorsmile :-)

SparkQA · 2018-07-20T11:19:10Z

Test build #93327 has finished for PR 20057 at commit a365f79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Thank you, @gatorsmile and @danielvdende .
LGTM, too.

gatorsmile · 2018-07-20T20:05:59Z

Thanks! Merged to master.

dongjoon-hyun reviewed Dec 22, 2017

View reviewed changes

gatorsmile reviewed Dec 30, 2017

View reviewed changes

Stephan202 reviewed Feb 3, 2018

View reviewed changes

danielvdende force-pushed the SPARK-22880 branch from 2f56d07 to 3a7dda4 Compare February 5, 2018 13:03

danielvdende force-pushed the SPARK-22880 branch from 3a7dda4 to 6c0d3df Compare February 15, 2018 20:02

dongjoon-hyun reviewed Feb 16, 2018

View reviewed changes

dongjoon-hyun reviewed Mar 6, 2018

View reviewed changes

Use correct truncation syntax for Teradata

7e0ff07

Also update the tests accordingly

danielvdende force-pushed the SPARK-22880 branch from ed452e7 to 7e0ff07 Compare March 7, 2018 08:24

Merge branch 'master' into SPARK-22880

bc75051

gatorsmile reviewed Jul 20, 2018

View reviewed changes

Minor fixes

a365f79

dongjoon-hyun approved these changes Jul 20, 2018

View reviewed changes

asfgit closed this in 2333a34 Jul 20, 2018

[SPARK-22880][SQL] Add cascadeTruncate option to JDBC datasource #20057

[SPARK-22880][SQL] Add cascadeTruncate option to JDBC datasource #20057

Uh oh!

Conversation

danielvdende commented Dec 22, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

danielvdende commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvdende Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 23, 2017

Uh oh!

danielvdende commented Dec 23, 2017

Uh oh!

danielvdende commented Dec 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvdende commented Jan 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvdende commented Jan 10, 2018

Uh oh!

danielvdende commented Jan 18, 2018

Uh oh!

Stephan202 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvdende commented Feb 5, 2018

Uh oh!

Fokko commented Feb 5, 2018

Uh oh!

danielvdende commented Feb 15, 2018

Uh oh!

Fokko commented Feb 15, 2018

Uh oh!

gatorsmile commented Feb 15, 2018

Uh oh!

gatorsmile commented Feb 15, 2018

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

danielvdende commented Feb 15, 2018

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

Fokko commented Feb 16, 2018

Uh oh!

danielvdende commented Feb 16, 2018

Uh oh!

danielvdende commented Feb 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Feb 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

danielvdende commented Dec 22, 2017 •

edited

Loading

danielvdende Dec 22, 2017 •

edited

Loading

danielvdende commented Jan 3, 2018 •

edited

Loading

danielvdende commented Feb 16, 2018 •

edited

Loading

gatorsmile commented Feb 16, 2018 •

edited

Loading