[SPARK-22729][SQL] Add getTruncateQuery to JdbcDialect #19911

danielvdende · 2017-12-06T14:09:38Z

In order to enable truncate for PostgreSQL databases in Spark JDBC, a change is needed to the query used for truncating a PostgreSQL table. By default, PostgreSQL will automatically truncate any descendant tables if a TRUNCATE query is executed. As this may result in (unwanted) side-effects, the query used for the truncate should be specified separately for PostgreSQL, specifying only to TRUNCATE a single table.

What changes were proposed in this pull request?

Add getTruncateQuery function to JdbcDialect.scala, with default query. Overridden this function for PostgreSQL to only truncate a single table. Also sets isCascadingTruncateTable to false, as this will allow truncates for PostgreSQL.

How was this patch tested?

Existing tests all pass. Added test for getTruncateQuery

srowen · 2017-12-06T16:13:06Z

I don't disbelieve you but do you have a reference? is it version-specific or anything? just want to be sure. EDIT: I see the link in the JIRA. Yes it does seem to imply CASCADE is not the default.

danielvdende · 2017-12-06T17:31:27Z

@srowen yep, I can add the details from the JIRA to the PR if you like (just to make it easier to read this PR in future if necessary)

gatorsmile · 2017-12-06T17:35:04Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

cc @dongjoon-hyun #14086

Actually @gatorsmile made the comment in #14086 that Postgres cascades truncates by default. Pointing to the exact same documentation as I do in the JIRA, but then referring to the opposite behavior. I still stand by my Jira and included proof (see Jira). Also Postgres supports a rollback of a truncate.

Sorry, you are the author of that comment. Missed that :)

Thank you for pining me, @gatorsmile .

@bolkedebruin .

Originally, Spark uses DROP TABLE for Overwrite mode. #14086 introduced TRUNCATE operation when JdbcUtils.isCascadingTruncateTable(url) == Some(false).

So, for PostgresDialect, Spark doesn't use TRUNCATE here.

@dongjoon-hyun to be clear, I think there are 2 problems:

The PostgresDialect indicates that CASCADE is enabled by default for Postgres. This isn't the case as the Postgres docs show.

As you correctly mention (this is what in my previous comment), Spark doesn't use CASCADE at all, which, especially considering the method this PR edits, is a bit odd I think. I plan to open a different JIRA ticket for this, and add it. This will be more work, and is outside the scope of the current JIRA.

airflow=# CREATE TABLE products ( airflow(# product_no integer PRIMARY KEY, airflow(# name text, airflow(# price numeric airflow(# ); CREATE TABLE airflow=# airflow=# CREATE TABLE orders ( airflow(# order_id integer PRIMARY KEY, airflow(# product_no integer REFERENCES products (product_no), airflow(# quantity integer airflow(# ); CREATE TABLE airflow=# insert into products VALUES (1, 1, 1); INSERT 0 1 airflow=# insert into orders VALUES (1,1,1); INSERT 0 1 airflow=# select * from products; product_no | name | price ------------+------+------- 1 | 1 | 1 (1 row) airflow=# select * from orders; order_id | product_no | quantity ----------+------------+---------- 1 | 1 | 1 (1 row) airflow=# truncate orders; TRUNCATE TABLE airflow=# select * from products; product_no | name | price ------------+------+------- 1 | 1 | 1 (1 row) airflow=# select * from orders; order_id | product_no | quantity ----------+------------+---------- (0 rows) airflow=# insert into orders VALUES (1,1,1); INSERT 0 1 airflow=# truncate products; 2017-12-06 20:31:44.146 CET [3708] ERROR: cannot truncate a table referenced in a foreign key constraint 2017-12-06 20:31:44.146 CET [3708] DETAIL: Table "orders" references "products". 2017-12-06 20:31:44.146 CET [3708] HINT: Truncate table "orders" at the same time, or use TRUNCATE ... CASCADE. 2017-12-06 20:31:44.146 CET [3708] STATEMENT: truncate products; ERROR: cannot truncate a table referenced in a foreign key constraint DETAIL: Table "orders" references "products". HINT: Truncate table "orders" at the same time, or use TRUNCATE ... CASCADE. airflow=#

Please note that drop/create is an expensive operation. In addition I don't think (imho) spark should ever do a drop/create as it changes the schema.

Thanks for the extended example. The following was the original motivation of #14086 . I'm not disagree with you on this.

`drop/create` is an expensive operation

In not sure if you are agreeing or not agreeing with me now :-) , but at least as just shown truncate is supported and does not cascade by default on Postgres. Can we conclude that this change seems right for Postgres, which is the only affected supported database by Spark - others default to truncate?

SparkQA · 2017-12-06T19:07:22Z

Test build #4005 has finished for PR 19911 at commit 40bd8ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-12-06T20:27:24Z

@danielvdende, @bolkedebruin , @srowen , @gatorsmile .

PostgreSQL has inheritance between tables. Due to the following side effects, I'm not sure about this PR. Any thoughts?

postgres=# CREATE TABLE parent(a INT);
CREATE TABLE
postgres=# CREATE TABLE child(b INT) INHERITS (parent);
CREATE TABLE
postgres=# INSERT INTO parent VALUES(1);
INSERT 0 1
postgres=# SELECT * FROM parent;
 a
---
 1
(1 row)

postgres=# INSERT INTO child VALUES(2);
INSERT 0 1
postgres=# SELECT * FROM parent;
 a
---
 1
 2
(2 rows)

postgres=# SELECT * FROM child;
 a | b
---+---
 2 |
(1 row)

postgres=# TRUNCATE TABLE parent;
TRUNCATE TABLE
postgres=# SELECT * FROM parent;
 a
---
(0 rows)

postgres=# SELECT * FROM child;
 a | b
---+---
(0 rows)

dongjoon-hyun · 2017-12-06T20:30:47Z

In Spark, TRUNCATE operation is additional feature over the default DROP operation. PostgreSQL also doesn't allow DROP operation on the parent table in the above example.

Before allowing TRUNCATE operation like this PR, I prefer to handle DROP operation first in general.

bolkedebruin · 2017-12-06T21:52:08Z

This is documented behavior for inherited tables. A select does the exact same thing. Basically spark is changing the expected behavior, i.e. one connects to a Postgres database this behavior is expected.

I don’t think spark should interfere with how the database should respond. If you use inherited tables you get cascades, except for ‘drop’. Think about about it are we limiting selects by using “only” to not have results from inherited tables?
https://www.postgresql.org/docs/9.1/static/ddl-inherit.html

dongjoon-hyun · 2017-12-06T22:40:32Z

Like the other DBMSs, Spark assumes that TRUNCATE can be used only when it doesn't make a change on the other tables.

I don’t think spark should interfere with how the database should respond.
It's documented here, "the table and all its descendant tables (if any) are truncated". So, we marked override def isCascadingTruncateTable(): Option[Boolean] = Some(true) for PostgreSQL. We don't want any side-effects in this Spark operations.

bolkedebruin · 2017-12-06T22:52:53Z

I’m not sure I follow. Spark isn’t the database here, Postgres is. This is an abstraction on top of jdbc to connect and import/export from a (R)DBMS. It is the user’s choice to have Postgres and the user chooses it for all kinds of reasons including a different behavior from “other dbmss”.

So it isn’t “a side” effect at all.

dongjoon-hyun · 2017-12-06T22:58:16Z

When Spark truncates a table A, B table is changed. More worse, Spark doesn't have any idea about how many tables will be affected. This is a side effect, isn't it?

bolkedebruin · 2017-12-06T23:02:59Z

No it isn’t because it is documented to happen and the scheme implies it. The same thing will happen from the command line. If I truncate from the command line I don’t get a report how much was truncated either.

dongjoon-hyun · 2017-12-06T23:03:31Z

Please see the previous discussion. I think we are reiterating the previous discussion.

bolkedebruin · 2017-12-06T23:18:39Z

Maybe we are :-). Ok going to your side of the discussion, why not issue TRUNCATE ONLY by default with Postgres? This does not have 'side effects' and will always do the right thing?

dongjoon-hyun · 2017-12-06T23:34:16Z

+1 for the idea.
Could you make a PR for that? We can discuss on that. Without side effects, why not? :)

bolkedebruin · 2017-12-06T23:36:36Z

Haha sure :-). Or maybe @danielvdende if he is still following

danielvdende · 2017-12-07T05:51:36Z

Sure, I'll make the changes :). I'll use this PR, seems to make sense, as it would fix truncate for postgres

dongjoon-hyun · 2017-12-07T05:54:05Z

Great! I'm looking forward your new PR, @danielvdende . I can help you a little bit during review. :)

danielvdende · 2017-12-07T08:20:04Z

@dongjoon-hyun I made the changes as suggested by @bolkedebruin (so modifying the Postgres dialect to use TRUNCATE ONLY). Let me know what you think :) (and if I missed anything)

dongjoon-hyun · 2017-12-07T09:52:53Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

@danielvdende .
As we see here, this is a new feature. It seems that we had better a new JIRA issue as a New Feature with a title like Add getTruncateQuery to JdbcDialect.

So create a new "improvement" JIRA and change the ticket number in this PR? (just checking I understand you correctly)

Yep. And, we need to fix the following in the PR description, too.

The PostgresDialect indicates that cascade is true by default for Postgres. This is not the case. This patch fixes this.

dongjoon-hyun · 2017-12-07T09:58:44Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

Indentation? The following will check Scala style violations for you.

$ dev/scalastyle

dongjoon-hyun · 2017-12-07T10:01:32Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

Could you add a function comment about why we use TRUNCATE ONLY here.

Also, please add the similar explanation to JdbcDialect.scala, too.

yes will do

danielvdende · 2017-12-07T15:59:51Z

@dongjoon-hyun I was considering this. Wanted to hear your opinion on it first ;-). I'll make those changes. the isCascadingTruncateTable option may be re-introduced if we add a cascade option to spark jdbc but for now it isn't necessary indeed.

In order to enable truncate for PostgreSQL databases in Spark JDBC, a change is needed to the query used for truncating a PostgreSQL table. By default, PostgreSQL will automatically truncate any descendant tables if a TRUNCATE query is executed. As this may result in (unwanted) side-effects, the query used for the truncate should be specified separately for PostgreSQL, specifying only to TRUNCATE a single table. This change replaces the isCascadingTruncateTable by a getTruncateQuery method for each dialect, that needs to be implemented in each dialect.

danielvdende · 2017-12-07T19:26:00Z

@dongjoon-hyun ok made the changes, also replaced the test that was in place for isCascadingTruncateTable with one for the getTruncateQuery method. Right now, I've left the method in JdbcDialects unimplemented. This will cause a compile time error if you extend JdbcDialect. Another option would be to do this at runtime, so to implement the method with a throw new UnimplementedException and let each dialect override this. Which of these options do you prefer?

bolkedebruin · 2017-12-07T19:37:20Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/DB2Dialect.scala


-  override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
+  override def getTruncateQuery(table: String): String = {
+    s"TRUNCATE $table"


Why override the default here?

(And all others that are not Postgres?)

The JdbcDialect object now no longer provides a default implementation for getTruncateQuery (as per @dongjoon-hyun 's suggestion). The reason is that this explicitly forces dialects to implement a truncate operation without side effects.

One thing I don't like about this solution is the code duplication across dialects. But it does prevent dialects from inheriting default behaviour and (accidentally) truncating with side effects.

Gotcha. It is also backward incompatible. Ie if someone implemented his/her own dialect one needs to update it (but removing the isCasc...) does this already so I assume that is ok

gatorsmile · 2017-12-08T21:51:08Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

-   * Some[false] : TRUNCATE TABLE does not cause cascading.
-   * None: The behavior of TRUNCATE TABLE is unknown (default).
-   */
-  def isCascadingTruncateTable(): Option[Boolean] = None


We can't drop this, because this is a public API.

As a middle ground, could it be marked deprecated? Seeing as we don't really need it anymore...

Although, custom third party dialects may want it.

gatorsmile · 2017-12-08T21:54:38Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala

      if (tableExists) {
        mode match {
          case SaveMode.Overwrite =>
-            if (options.isTruncate && isCascadingTruncateTable(options.url) == Some(false)) {


We still need to respect it, but we can enhance PostgresDialect to support the JDBC option truncate

danielvdende · 2017-12-09T17:03:13Z

@dongjoon-hyun @gatorsmile As @gatorsmile pointed out, the isCascadingTruncateTable is a method in the public API, so we can't just drop it. I've made changes again, now the truncate query method is defined for all dialects by default, with only the PostgresDialect overriding the definition (for reasons we discussed before). Have also reinstated the test I removed before (to test the isCascadingTruncateTable function). Let me know what you think, once you're ok with the changes, I'll squash the commits into 1.

Thanks for helping out :), much appreciated.

gatorsmile · 2017-12-11T04:27:49Z

retest this please

SparkQA · 2017-12-11T07:00:17Z

Test build #84702 has finished for PR 19911 at commit 3d32e34.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

danielvdende · 2017-12-11T09:34:45Z

retest this please

(I probably don't have permissions to do this, maybe @gatorsmile or @dongjoon-hyun could? :). I fixed the errors in the tests, all passing locally now.

dongjoon-hyun · 2017-12-11T17:15:54Z

Retest this please.

gatorsmile · 2017-12-11T17:28:26Z

retest this please

danielvdende · 2017-12-12T12:53:18Z

Forgive my impatience, but is it supposed to take this long? @dongjoon-hyun @gatorsmile

srowen · 2017-12-12T12:54:37Z

@danielvdende test don't pass yet. I've retriggered them. I don't see a particularly long response cycle here.

danielvdende · 2017-12-12T12:56:49Z

@srowen Sorry, should have been clearer, I meant the triggering of the tests (I see they're running now, thanks! )

SparkQA · 2017-12-12T15:44:16Z

Test build #4007 has finished for PR 19911 at commit 0b990e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-12T18:40:42Z

@danielvdende Thanks for your contributions!

LGTM

Merged to master.

dongjoon-hyun · 2017-12-12T19:18:05Z

Thank you all and sorry for the distraction due to me.

danielvdende changed the title ~~[SPARK-21098] Correct cascade default for postgres~~ [SPARK-22717] Correct cascade default for postgres Dec 6, 2017

danielvdende force-pushed the SPARK-22717 branch from 76b63a7 to 7f3d935 Compare December 6, 2017 14:10

danielvdende changed the title ~~[SPARK-22717] Correct cascade default for postgres~~ [SPARK-22717][SQL] Correct cascade default for postgres Dec 6, 2017

danielvdende force-pushed the SPARK-22717 branch from 7f3d935 to 40bd8ac Compare December 6, 2017 14:12

srowen approved these changes Dec 6, 2017

View reviewed changes

gatorsmile reviewed Dec 6, 2017

View reviewed changes

danielvdende force-pushed the SPARK-22717 branch from 40bd8ac to 6df3734 Compare December 7, 2017 08:18

dongjoon-hyun reviewed Dec 7, 2017

View reviewed changes

danielvdende force-pushed the SPARK-22717 branch 2 times, most recently from 6391a72 to 63b600d Compare December 7, 2017 13:05

danielvdende changed the title ~~[SPARK-22717][SQL] Correct cascade default for postgres~~ [SPARK-22729][SQL] Add getTruncateQuery to JdbcDialect Dec 7, 2017

danielvdende force-pushed the SPARK-22717 branch from 63b600d to 708a320 Compare December 7, 2017 19:23

bolkedebruin reviewed Dec 7, 2017

View reviewed changes

gatorsmile reviewed Dec 8, 2017

View reviewed changes

danielvdende force-pushed the SPARK-22717 branch 5 times, most recently from fb5d89d to 3d32e34 Compare December 9, 2017 17:00

Reinstate isCascadingTruncateTable function

0b990e1

danielvdende force-pushed the SPARK-22717 branch from 3d32e34 to 0b990e1 Compare December 11, 2017 09:28

asfgit closed this in e6dc5f2 Dec 12, 2017

danielvdende mentioned this pull request Dec 22, 2017

[SPARK-22880][SQL] Add cascadeTruncate option to JDBC datasource #20057

Closed

[SPARK-22729][SQL] Add getTruncateQuery to JdbcDialect #19911

[SPARK-22729][SQL] Add getTruncateQuery to JdbcDialect #19911

Uh oh!

Conversation

danielvdende commented Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvdende commented Dec 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bolkedebruin Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 6, 2017

Uh oh!

dongjoon-hyun commented Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bolkedebruin commented Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bolkedebruin commented Dec 6, 2017

Uh oh!

dongjoon-hyun commented Dec 6, 2017

Uh oh!

bolkedebruin commented Dec 6, 2017

Uh oh!

dongjoon-hyun commented Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bolkedebruin commented Dec 6, 2017

Uh oh!

dongjoon-hyun commented Dec 6, 2017

Uh oh!

bolkedebruin commented Dec 6, 2017

Uh oh!

danielvdende commented Dec 7, 2017

Uh oh!

dongjoon-hyun commented Dec 7, 2017

Uh oh!

danielvdende commented Dec 7, 2017

Uh oh!

dongjoon-hyun Dec 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

danielvdende commented Dec 6, 2017 •

edited

Loading

srowen commented Dec 6, 2017 •

edited

Loading

bolkedebruin Dec 6, 2017 •

edited

Loading

dongjoon-hyun commented Dec 6, 2017 •

edited

Loading

dongjoon-hyun commented Dec 6, 2017 •

edited

Loading

bolkedebruin commented Dec 6, 2017 •

edited

Loading

dongjoon-hyun commented Dec 6, 2017 •

edited

Loading

dongjoon-hyun commented Dec 6, 2017 •

edited

Loading

dongjoon-hyun Dec 7, 2017 •

edited

Loading

dongjoon-hyun Dec 7, 2017 •

edited

Loading

danielvdende commented Dec 7, 2017 •

edited

Loading