Skip to content

Conversation

@danielvdende
Copy link
Contributor

This commit adds the cascadeTruncate option to the JDBC datasource
API, for databases that support this functionality (PostgreSQL and
Oracle at the moment). This allows for applying a cascading truncate
that affects tables that have foreign key constraints on the table
being truncated.

What changes were proposed in this pull request?

Add cascadeTruncate option to JDBC datasource API. Allow this to affect the
TRUNCATE query for databases that support this option.

How was this patch tested?

Existing tests for truncateQuery were updated. Also, an additional test was added
to ensure that the correct syntax was applied, and that enabling the config for databases
that do not support this option does not result in invalid queries.

@danielvdende
Copy link
Contributor Author

danielvdende commented Dec 22, 2017

@dongjoon-hyun this is the functionality we discussed in PR for SPARK-22729, would be great to hear your opinion on this :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the default value here?
It could be safe to prevent accidental inheritance of this wrong implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would force all dialects to implement this method though, which would lead to unnecessary code duplication. Moreover, removing the default value here would change the public API, as it would force others who have written custom dialects to change calls to this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the main reason I request the change. I don't think this implementation is correct on for all dialects. Anyway, it's only my opinion.

Copy link
Contributor Author

@danielvdende danielvdende Dec 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default for all dialects should be false, regardless of whether the cascade feature is even supported. And for those for which it is supported, it should default to false. Imho a cascade should only take place if the user explicitly asks for it

@gatorsmile
Copy link
Member

Thank you for your contribution! I am doubting the value of this feature.

If you are interested in the JDBC-related work, could you take https://issues.apache.org/jira/browse/SPARK-22731?

Thanks!

@danielvdende
Copy link
Contributor Author

@gatorsmile could you explain why you have doubts about the feature? Thanks!

@danielvdende
Copy link
Contributor Author

@gatorsmile would be great to hear why you doubt the value of the feature :). I know that for us it would be extremely valuable (at the moment we have to do an extra step in our data pipeline because this feature is missing in Spark), but of course we're not the only ones using Spark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conf only takes an effect when SaveMode.Overwrite is enabled and truncate is set to true. My doubt is the usage scenario of this extra conf does not benefit most users. Could we hold this PR? When the other users have similar requests, we can revisit it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it raises the question of how complete/incomplete the Spark JDBC API should be, and what the use cases are that it should serve. For the most simple cases, in which no key constraints are set between tables, you won't need this option. However, as soon as foreign key constraints are introduced, it is very important. I agree that not every functionality from SQL (dialects) should be included, but I personally feel this is quite fundamental functionality.

Moreover, as it's configuration option users that don't want it also don't have to use it. I think we also discussed this functionality in previous PR with @dongjoon-hyun here: SPARK-22729

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @danielvdende .
I think you can ignore my comment in previous PR. There were many directional comments on that PR and it's not the final one. Your previous PR is merged by @gatorsmile .

For me, I also still don't agree on the default value inside JdbcDialects.scala in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun apologies if I misrepresented your comments/opinions from the previous PR, wasn't my intention :-).

I've given it some more thought, and I can see you point about the default value. I'll make the change we discussed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun one thing I'm thinking of (just curious to hear your opinion): could we use the value of isCascadingTruncateTable for each dialect as the default value for the cascade boolean? In that way, there is only a single boolean per dialect specifying what the default behaviour is with regard to cascading during truncates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the documentation: s/used with case/used with care/

@danielvdende
Copy link
Contributor Author

danielvdende commented Jan 3, 2018

I've made some changes to the code; the hardcoded false default value for cascade in JdbcDialects is now replaced by a default value of isCascadingTruncateTable. This is also the case for each individual dialect, with a pattern match in the dialects for Postgres and Oracle, as they support the cascading behaviour. I think this is quite nice, because this now re=uses the isCascadingTruncateTable function, which defines the default truncating behaviour of a dialect. Curious to hear your thoughts on this @dongjoon-hyun @gatorsmile

@danielvdende
Copy link
Contributor Author

@dongjoon-hyun @gatorsmile any further thoughts on this?

@danielvdende
Copy link
Contributor Author

@dongjoon-hyun @gatorsmile any further update?

Copy link

@Stephan202 Stephan202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use this feature 👍 .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the documentation: s/used with case/used with care/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dialects not supporting cascading, should the documentation perhaps indicate that the cascade parameter is ignored?

@danielvdende
Copy link
Contributor Author

@Stephan202 thanks for pointing out those docs issues, just pushed the changes :-).
@gatorsmile @dongjoon-hyun would you have a chance to take a look at this again?

@Fokko
Copy link
Contributor

Fokko commented Feb 5, 2018

Any idea when this will be merged into master? We could use this since we are ditching sqoop 👍

@danielvdende
Copy link
Contributor Author

@dongjoon-hyun @gatorsmile sorry to keep asking, but could you let me know when we can get this merged?

@Fokko
Copy link
Contributor

Fokko commented Feb 15, 2018

We're in the process of integrating Spark in Airflow, and support for the cascadeTruncate is required to make this succeed. First steps are here: apache/airflow#3021.

Would be great if we can get this merged asap so we can continue testing. Cheers

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

@danielvdende @Fokko We definitely want to help the community replace Sqoop by Spark SQL. However, truncate is only used when users use SaveMode.Overwrite to write the external JDBC tables. In this specific scenario, Spark will truncate an existing table instead of dropping and recreating it.

Could you show me the key missing features that are available in Sqoop but not in Spark SQL JDBC connectors?

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87484 has finished for PR 20057 at commit 3a7dda4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@danielvdende
Copy link
Contributor Author

Tests are failing on a spark streaming test. I think it's probably because of the age of this PR, will rebase to get the changes into the PR that were merged into master since I opened the PR

This commit adds the cascadeTruncate option to the JDBC datasource
API, for databases that support this functionality (PostgreSQL and
Oracle at the moment). This allows for applying a cascading truncate
that affects tables that have foreign key constraints on the table
being truncated.
@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87493 has finished for PR 20057 at commit 6c0d3df.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor

Fokko commented Feb 16, 2018

Hi @gatorsmile, thanks for putting it to the tests. The main reason why I personally dislike Sqoop is:

  • Legacy. The old map-reduce should be buried in the upcoming years. As a data engineering consultant, I see more people questioning the whole Hadoop stack. Using Sqoop you still need to run map-reduce tasks, and this isn't easy on other platforms like kubernetes.
  • Stability. I see Sqoop jobs fail quite often, and there isn't a nice way of retrying this in an atomic way. For example, when having a Sqoop job on Airflow, we cannot simply retry the operation. We when we import data from a rmdbs to hdfs, we have to make sure that the target directory of the previous run has been deleted.

This is also where Spark-jdbc comes in, for example, in the future we would like to delete single partitions, but this is wip. Maybe @danielvdende can elaborate a bit on their use-case.

@danielvdende
Copy link
Contributor Author

Hmm, not it fails the OrcQuerySuite. This PR doesn't touch any of the Orc implementation in Spark. Could this be a flaky test @gatorsmile ?
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 12 times over 10.158754687999998 seconds. Last failure message: There are 1 possibly leaked file streams..

@danielvdende
Copy link
Contributor Author

danielvdende commented Feb 16, 2018

Hi guys, @Fokko @gatorsmile, completely agree with what @Fokko mentioned, our main reason for wanting to get away from Sqoop is also for stability reasons and to get rid of MapReduce in preparation for our move to Kubernetes (or something similar). We've also seen it to be much faster than Sqoop. In terms of why we need the feature in this PR: we have some tables in PostgreSQL that have foreign keys linking them. We have also specified a schema for these tables. If we use the drop-and-recreate option, Spark will determine the schema, overriding our PostgreSQL schema. Obviously, these should match up, but I personally don't like that Spark can do this (and that you can't explicitly tell it not to).

Because of this behaviour, we currently require 2 tasks in Airflow (similar to what @Fokko mentioned) to ensure the tables are truncated, but the schema stays in place. This PR would enable us to specify in a single, idempotent (Airflow) task that we want to truncate the table before putting new data in there. The cascade enables us to not break foreign key relations and cause errors.

To be clear, this therefore isn't emulating a Sqoop feature (as a Sqoop task isn't idempotent), but is in fact improving on what Sqoop offers.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 16, 2018

Our overwrite semantics is confusing to most. We need to correct it in the next release, i.e., Spark 2.4.

Even if we try our best to keep the schema of the original table, the actual CREATE TABLE statements still take many vendor-specific info. It is hard for us to generate a CREATE TABLE for containing all of them. I can understand your use case for truncate.

I am sorry this will not be part of Spark 2.3 release. We will include it in the next release. You can still do the change in your forked Spark.

Just feel free to let us know if you find anything that we should do in Spark SQL JCBC to match the corresponding ones in SQOOP. Thanks!

@gatorsmile
Copy link
Member

This test is a flaky test. Your changes did not fail any test case. I will review your PR after the 2.3 release. Thanks again!

cc @dongjoon-hyun Do you want to take a look at this?

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @gatorsmile . Yep. I'll take a look both OrcQuerySuite and this PR in this morning. Sorry for late response, @danielvdende .

@danielvdende
Copy link
Contributor Author

Thanks guys! @gatorsmile @dongjoon-hyun
Happy to help out expanding Spark SQL JDBC where necessary to match and improve on sqoop :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, do we need a default value description like It defaults to <code>false</code>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in another comment, I think we should use the value of isCascadingTruncateTable as the default, rather than always false. Seems like the correct use of that variable. I can add a sentence to the docs specifying that.

@danielvdende
Copy link
Contributor Author

@dongjoon-hyun Made the changes you pointed out, thanks! 👍

@SparkQA
Copy link

SparkQA commented Feb 21, 2018

Test build #87591 has finished for PR 20057 at commit ed452e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@danielvdende
Copy link
Contributor Author

@dongjoon-hyun I saw that Spark 2.3 was released a few days ago, congrats on the release! :-) Is there anything stopping us from merging this PR into master now?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . Could you review this PR?

Spark has been supporting TRUNCATE before adding TeradataDialect. In this PR, it turns out TeradataDialect doesn't support TRUNCATE SQL syntax at all. Although this PR introduces DELETE statement in TeradataDialect.getTruncateQuery function, I think this feature is helpful in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this redundant empty line addition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove line 114.

Also update the tests accordingly
@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88040 has finished for PR 20057 at commit 7e0ff07.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88052 has finished for PR 20057 at commit 7e0ff07.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

The failures are

  • Last time, RateSourceV2Suite.
  • This time, HiveClientSuites.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 8, 2018

Test build #88057 has finished for PR 20057 at commit 7e0ff07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@danielvdende
Copy link
Contributor Author

Gentle ping @gatorsmile :-)

@danielvdende
Copy link
Contributor Author

Gentle ping again @gatorsmile @dongjoon-hyun :-)

@Fokko
Copy link
Contributor

Fokko commented Apr 3, 2018

Gentle ping @gatorsmile @dongjoon-hyun 👍

@danielvdende
Copy link
Contributor Author

@gatorsmile @dongjoon-hyun Guys, any update?

@danielvdende
Copy link
Contributor Author

@dongjoon-hyun @gatorsmile Any update, guys?

@SparkQA
Copy link

SparkQA commented Jul 11, 2018

Test build #92841 has finished for PR 20057 at commit bc75051.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 13, 2018

Test build #92963 has finished for PR 20057 at commit bc75051.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except two minor comments

@Since("2.4.0")
def getTruncateQuery(
table: String,
cascade: Option[Boolean] = isCascadingTruncateTable): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indent.

options.isCascadeTruncate))
} else {
statement.executeUpdate(dialect.getTruncateQuery(options.table))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the above comment of @dongjoon-hyun

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93322 has finished for PR 20057 at commit bc75051.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@danielvdende
Copy link
Contributor Author

Just made the changes you mentioned @gatorsmile :-)

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93327 has finished for PR 20057 at commit a365f79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @gatorsmile and @danielvdende .
LGTM, too.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 2333a34 Jul 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants