Skip to content

Conversation

@danielvdende
Copy link
Contributor

@danielvdende danielvdende commented Dec 6, 2017

In order to enable truncate for PostgreSQL databases in Spark JDBC, a change is needed to the query used for truncating a PostgreSQL table. By default, PostgreSQL will automatically truncate any descendant tables if a TRUNCATE query is executed. As this may result in (unwanted) side-effects, the query used for the truncate should be specified separately for PostgreSQL, specifying only to TRUNCATE a single table.

What changes were proposed in this pull request?

Add getTruncateQuery function to JdbcDialect.scala, with default query. Overridden this function for PostgreSQL to only truncate a single table. Also sets isCascadingTruncateTable to false, as this will allow truncates for PostgreSQL.

How was this patch tested?

Existing tests all pass. Added test for getTruncateQuery

@danielvdende danielvdende changed the title [SPARK-21098] Correct cascade default for postgres [SPARK-22717] Correct cascade default for postgres Dec 6, 2017
@danielvdende danielvdende changed the title [SPARK-22717] Correct cascade default for postgres [SPARK-22717][SQL] Correct cascade default for postgres Dec 6, 2017
@srowen
Copy link
Member

srowen commented Dec 6, 2017

I don't disbelieve you but do you have a reference? is it version-specific or anything? just want to be sure. EDIT: I see the link in the JIRA. Yes it does seem to imply CASCADE is not the default.

@danielvdende
Copy link
Contributor Author

@srowen yep, I can add the details from the JIRA to the PR if you like (just to make it easier to read this PR in future if necessary)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @gatorsmile made the comment in #14086 that Postgres cascades truncates by default. Pointing to the exact same documentation as I do in the JIRA, but then referring to the opposite behavior. I still stand by my Jira and included proof (see Jira). Also Postgres supports a rollback of a truncate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you are the author of that comment. Missed that :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pining me, @gatorsmile .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bolkedebruin .

Originally, Spark uses DROP TABLE for Overwrite mode. #14086 introduced TRUNCATE operation when JdbcUtils.isCascadingTruncateTable(url) == Some(false).

So, for PostgresDialect, Spark doesn't use TRUNCATE here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun to be clear, I think there are 2 problems:

  1. The PostgresDialect indicates that CASCADE is enabled by default for Postgres. This isn't the case as the Postgres docs show.
  2. As you correctly mention (this is what in my previous comment), Spark doesn't use CASCADE at all, which, especially considering the method this PR edits, is a bit odd I think. I plan to open a different JIRA ticket for this, and add it. This will be more work, and is outside the scope of the current JIRA.

Copy link

@bolkedebruin bolkedebruin Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

airflow=# CREATE TABLE products (
airflow(#     product_no integer PRIMARY KEY,
airflow(#     name text,
airflow(#     price numeric
airflow(# );
CREATE TABLE
airflow=#
airflow=# CREATE TABLE orders (
airflow(#     order_id integer PRIMARY KEY,
airflow(#     product_no integer REFERENCES products (product_no),
airflow(#     quantity integer
airflow(# );
CREATE TABLE

airflow=# insert into products VALUES (1, 1, 1);
INSERT 0 1
airflow=# insert into orders VALUES (1,1,1);
INSERT 0 1
airflow=# select * from products;
 product_no | name | price
------------+------+-------
          1 | 1    |     1
(1 row)

airflow=# select * from orders;
 order_id | product_no | quantity
----------+------------+----------
        1 |          1 |        1
(1 row)

airflow=# truncate orders;
TRUNCATE TABLE
airflow=# select * from products;
 product_no | name | price
------------+------+-------
          1 | 1    |     1
(1 row)

airflow=# select * from orders;
 order_id | product_no | quantity
----------+------------+----------
(0 rows)

airflow=# insert into orders VALUES (1,1,1);
INSERT 0 1
airflow=# truncate products;
2017-12-06 20:31:44.146 CET [3708] ERROR:  cannot truncate a table referenced in a foreign key constraint
2017-12-06 20:31:44.146 CET [3708] DETAIL:  Table "orders" references "products".
2017-12-06 20:31:44.146 CET [3708] HINT:  Truncate table "orders" at the same time, or use TRUNCATE ... CASCADE.
2017-12-06 20:31:44.146 CET [3708] STATEMENT:  truncate products;
ERROR:  cannot truncate a table referenced in a foreign key constraint
DETAIL:  Table "orders" references "products".
HINT:  Truncate table "orders" at the same time, or use TRUNCATE ... CASCADE.
airflow=#

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that drop/create is an expensive operation. In addition I don't think (imho) spark should ever do a drop/create as it changes the schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extended example. The following was the original motivation of #14086 . I'm not disagree with you on this.

`drop/create` is an expensive operation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In not sure if you are agreeing or not agreeing with me now :-) , but at least as just shown truncate is supported and does not cascade by default on Postgres. Can we conclude that this change seems right for Postgres, which is the only affected supported database by Spark - others default to truncate?

@SparkQA
Copy link

SparkQA commented Dec 6, 2017

Test build #4005 has finished for PR 19911 at commit 40bd8ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 6, 2017

@danielvdende, @bolkedebruin , @srowen , @gatorsmile .

PostgreSQL has inheritance between tables. Due to the following side effects, I'm not sure about this PR. Any thoughts?

postgres=# CREATE TABLE parent(a INT);
CREATE TABLE
postgres=# CREATE TABLE child(b INT) INHERITS (parent);
CREATE TABLE
postgres=# INSERT INTO parent VALUES(1);
INSERT 0 1
postgres=# SELECT * FROM parent;
 a
---
 1
(1 row)

postgres=# INSERT INTO child VALUES(2);
INSERT 0 1
postgres=# SELECT * FROM parent;
 a
---
 1
 2
(2 rows)

postgres=# SELECT * FROM child;
 a | b
---+---
 2 |
(1 row)

postgres=# TRUNCATE TABLE parent;
TRUNCATE TABLE
postgres=# SELECT * FROM parent;
 a
---
(0 rows)

postgres=# SELECT * FROM child;
 a | b
---+---
(0 rows)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 6, 2017

In Spark, TRUNCATE operation is additional feature over the default DROP operation. PostgreSQL also doesn't allow DROP operation on the parent table in the above example.

Before allowing TRUNCATE operation like this PR, I prefer to handle DROP operation first in general.

@bolkedebruin
Copy link

bolkedebruin commented Dec 6, 2017

This is documented behavior for inherited tables. A select does the exact same thing. Basically spark is changing the expected behavior, i.e. one connects to a Postgres database this behavior is expected.

I don’t think spark should interfere with how the database should respond. If you use inherited tables you get cascades, except for ‘drop’. Think about about it are we limiting selects by using “only” to not have results from inherited tables?
https://www.postgresql.org/docs/9.1/static/ddl-inherit.html

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 6, 2017

  1. Like the other DBMSs, Spark assumes that TRUNCATE can be used only when it doesn't make a change on the other tables.

    I don’t think spark should interfere with how the database should respond.

  2. It's documented here, "the table and all its descendant tables (if any) are truncated". So, we marked override def isCascadingTruncateTable(): Option[Boolean] = Some(true) for PostgreSQL. We don't want any side-effects in this Spark operations.

@bolkedebruin
Copy link

I’m not sure I follow. Spark isn’t the database here, Postgres is. This is an abstraction on top of jdbc to connect and import/export from a (R)DBMS. It is the user’s choice to have Postgres and the user chooses it for all kinds of reasons including a different behavior from “other dbmss”.

So it isn’t “a side” effect at all.

@dongjoon-hyun
Copy link
Member

When Spark truncates a table A, B table is changed. More worse, Spark doesn't have any idea about how many tables will be affected. This is a side effect, isn't it?

@bolkedebruin
Copy link

No it isn’t because it is documented to happen and the scheme implies it. The same thing will happen from the command line. If I truncate from the command line I don’t get a report how much was truncated either.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 6, 2017

Please see the previous discussion. I think we are reiterating the previous discussion.

@bolkedebruin
Copy link

Maybe we are :-). Ok going to your side of the discussion, why not issue TRUNCATE ONLY by default with Postgres? This does not have 'side effects' and will always do the right thing?

@dongjoon-hyun
Copy link
Member

+1 for the idea.
Could you make a PR for that? We can discuss on that. Without side effects, why not? :)

@bolkedebruin
Copy link

Haha sure :-). Or maybe @danielvdende if he is still following

@danielvdende
Copy link
Contributor Author

Sure, I'll make the changes :). I'll use this PR, seems to make sense, as it would fix truncate for postgres

@dongjoon-hyun
Copy link
Member

Great! I'm looking forward your new PR, @danielvdende . I can help you a little bit during review. :)

@danielvdende
Copy link
Contributor Author

@dongjoon-hyun I made the changes as suggested by @bolkedebruin (so modifying the Postgres dialect to use TRUNCATE ONLY). Let me know what you think :) (and if I missed anything)

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielvdende .
As we see here, this is a new feature. It seems that we had better a new JIRA issue as a New Feature with a title like Add getTruncateQuery to JdbcDialect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So create a new "improvement" JIRA and change the ticket number in this PR? (just checking I understand you correctly)

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. And, we need to fix the following in the PR description, too.

The PostgresDialect indicates that cascade is true by default for Postgres.
This is not the case. This patch fixes this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation? The following will check Scala style violations for you.

$ dev/scalastyle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a function comment about why we use TRUNCATE ONLY here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add the similar explanation to JdbcDialect.scala, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@danielvdende danielvdende force-pushed the SPARK-22717 branch 2 times, most recently from 6391a72 to 63b600d Compare December 7, 2017 13:05
@danielvdende danielvdende changed the title [SPARK-22717][SQL] Correct cascade default for postgres [SPARK-22729][SQL] Add getTruncateQuery to JdbcDialect Dec 7, 2017
@danielvdende
Copy link
Contributor Author

danielvdende commented Dec 7, 2017

@dongjoon-hyun I was considering this. Wanted to hear your opinion on it first ;-). I'll make those changes. the isCascadingTruncateTable option may be re-introduced if we add a cascade option to spark jdbc but for now it isn't necessary indeed.

In order to enable truncate for PostgreSQL databases in Spark JDBC, a change
is needed to the query used for truncating a PostgreSQL table. By default,
PostgreSQL will automatically truncate any descendant tables if a TRUNCATE
query is executed. As this may result in (unwanted) side-effects, the query
used for the truncate should be specified separately for PostgreSQL, specifying
only to TRUNCATE a single table.

This change replaces the isCascadingTruncateTable by a getTruncateQuery method
for each dialect, that needs to be implemented in each dialect.
@danielvdende
Copy link
Contributor Author

@dongjoon-hyun ok made the changes, also replaced the test that was in place for isCascadingTruncateTable with one for the getTruncateQuery method. Right now, I've left the method in JdbcDialects unimplemented. This will cause a compile time error if you extend JdbcDialect. Another option would be to do this at runtime, so to implement the method with a throw new UnimplementedException and let each dialect override this. Which of these options do you prefer?


override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
override def getTruncateQuery(table: String): String = {
s"TRUNCATE $table"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why override the default here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And all others that are not Postgres?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JdbcDialect object now no longer provides a default implementation for getTruncateQuery (as per @dongjoon-hyun 's suggestion). The reason is that this explicitly forces dialects to implement a truncate operation without side effects.

One thing I don't like about this solution is the code duplication across dialects. But it does prevent dialects from inheriting default behaviour and (accidentally) truncating with side effects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. It is also backward incompatible. Ie if someone implemented his/her own dialect one needs to update it (but removing the isCasc...) does this already so I assume that is ok

* Some[false] : TRUNCATE TABLE does not cause cascading.
* None: The behavior of TRUNCATE TABLE is unknown (default).
*/
def isCascadingTruncateTable(): Option[Boolean] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't drop this, because this is a public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a middle ground, could it be marked deprecated? Seeing as we don't really need it anymore...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, custom third party dialects may want it.

if (tableExists) {
mode match {
case SaveMode.Overwrite =>
if (options.isTruncate && isCascadingTruncateTable(options.url) == Some(false)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to respect it, but we can enhance PostgresDialect to support the JDBC option truncate

@danielvdende danielvdende force-pushed the SPARK-22717 branch 5 times, most recently from fb5d89d to 3d32e34 Compare December 9, 2017 17:00
@danielvdende
Copy link
Contributor Author

@dongjoon-hyun @gatorsmile As @gatorsmile pointed out, the isCascadingTruncateTable is a method in the public API, so we can't just drop it. I've made changes again, now the truncate query method is defined for all dialects by default, with only the PostgresDialect overriding the definition (for reasons we discussed before). Have also reinstated the test I removed before (to test the isCascadingTruncateTable function). Let me know what you think, once you're ok with the changes, I'll squash the commits into 1.

Thanks for helping out :), much appreciated.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 11, 2017

Test build #84702 has finished for PR 19911 at commit 3d32e34.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@danielvdende
Copy link
Contributor Author

danielvdende commented Dec 11, 2017

retest this please

(I probably don't have permissions to do this, maybe @gatorsmile or @dongjoon-hyun could? :). I fixed the errors in the tests, all passing locally now.

@dongjoon-hyun
Copy link
Member

Retest this please.

@gatorsmile
Copy link
Member

retest this please

@danielvdende
Copy link
Contributor Author

Forgive my impatience, but is it supposed to take this long? @dongjoon-hyun @gatorsmile

@srowen
Copy link
Member

srowen commented Dec 12, 2017

@danielvdende test don't pass yet. I've retriggered them. I don't see a particularly long response cycle here.

@danielvdende
Copy link
Contributor Author

@srowen Sorry, should have been clearer, I meant the triggering of the tests (I see they're running now, thanks! )

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #4007 has finished for PR 19911 at commit 0b990e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@danielvdende Thanks for your contributions!

LGTM

Merged to master.

@asfgit asfgit closed this in e6dc5f2 Dec 12, 2017
@dongjoon-hyun
Copy link
Member

Thank you all and sorry for the distraction due to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants