-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22729][SQL] Add getTruncateQuery to JdbcDialect #19911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
76b63a7 to
7f3d935
Compare
7f3d935 to
40bd8ac
Compare
|
I don't disbelieve you but do you have a reference? is it version-specific or anything? just want to be sure. EDIT: I see the link in the JIRA. Yes it does seem to imply CASCADE is not the default. |
|
@srowen yep, I can add the details from the JIRA to the PR if you like (just to make it easier to read this PR in future if necessary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually @gatorsmile made the comment in #14086 that Postgres cascades truncates by default. Pointing to the exact same documentation as I do in the JIRA, but then referring to the opposite behavior. I still stand by my Jira and included proof (see Jira). Also Postgres supports a rollback of a truncate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you are the author of that comment. Missed that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pining me, @gatorsmile .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, Spark uses DROP TABLE for Overwrite mode. #14086 introduced TRUNCATE operation when JdbcUtils.isCascadingTruncateTable(url) == Some(false).
So, for PostgresDialect, Spark doesn't use TRUNCATE here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun to be clear, I think there are 2 problems:
- The PostgresDialect indicates that
CASCADEis enabled by default for Postgres. This isn't the case as the Postgres docs show. - As you correctly mention (this is what in my previous comment), Spark doesn't use
CASCADEat all, which, especially considering the method this PR edits, is a bit odd I think. I plan to open a different JIRA ticket for this, and add it. This will be more work, and is outside the scope of the current JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
airflow=# CREATE TABLE products (
airflow(# product_no integer PRIMARY KEY,
airflow(# name text,
airflow(# price numeric
airflow(# );
CREATE TABLE
airflow=#
airflow=# CREATE TABLE orders (
airflow(# order_id integer PRIMARY KEY,
airflow(# product_no integer REFERENCES products (product_no),
airflow(# quantity integer
airflow(# );
CREATE TABLE
airflow=# insert into products VALUES (1, 1, 1);
INSERT 0 1
airflow=# insert into orders VALUES (1,1,1);
INSERT 0 1
airflow=# select * from products;
product_no | name | price
------------+------+-------
1 | 1 | 1
(1 row)
airflow=# select * from orders;
order_id | product_no | quantity
----------+------------+----------
1 | 1 | 1
(1 row)
airflow=# truncate orders;
TRUNCATE TABLE
airflow=# select * from products;
product_no | name | price
------------+------+-------
1 | 1 | 1
(1 row)
airflow=# select * from orders;
order_id | product_no | quantity
----------+------------+----------
(0 rows)
airflow=# insert into orders VALUES (1,1,1);
INSERT 0 1
airflow=# truncate products;
2017-12-06 20:31:44.146 CET [3708] ERROR: cannot truncate a table referenced in a foreign key constraint
2017-12-06 20:31:44.146 CET [3708] DETAIL: Table "orders" references "products".
2017-12-06 20:31:44.146 CET [3708] HINT: Truncate table "orders" at the same time, or use TRUNCATE ... CASCADE.
2017-12-06 20:31:44.146 CET [3708] STATEMENT: truncate products;
ERROR: cannot truncate a table referenced in a foreign key constraint
DETAIL: Table "orders" references "products".
HINT: Truncate table "orders" at the same time, or use TRUNCATE ... CASCADE.
airflow=#
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that drop/create is an expensive operation. In addition I don't think (imho) spark should ever do a drop/create as it changes the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extended example. The following was the original motivation of #14086 . I'm not disagree with you on this.
`drop/create` is an expensive operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In not sure if you are agreeing or not agreeing with me now :-) , but at least as just shown truncate is supported and does not cascade by default on Postgres. Can we conclude that this change seems right for Postgres, which is the only affected supported database by Spark - others default to truncate?
|
Test build #4005 has finished for PR 19911 at commit
|
|
@danielvdende, @bolkedebruin , @srowen , @gatorsmile . PostgreSQL has postgres=# CREATE TABLE parent(a INT);
CREATE TABLE
postgres=# CREATE TABLE child(b INT) INHERITS (parent);
CREATE TABLE
postgres=# INSERT INTO parent VALUES(1);
INSERT 0 1
postgres=# SELECT * FROM parent;
a
---
1
(1 row)
postgres=# INSERT INTO child VALUES(2);
INSERT 0 1
postgres=# SELECT * FROM parent;
a
---
1
2
(2 rows)
postgres=# SELECT * FROM child;
a | b
---+---
2 |
(1 row)
postgres=# TRUNCATE TABLE parent;
TRUNCATE TABLE
postgres=# SELECT * FROM parent;
a
---
(0 rows)
postgres=# SELECT * FROM child;
a | b
---+---
(0 rows) |
|
In Spark, Before allowing TRUNCATE operation like this PR, I prefer to handle |
|
This is documented behavior for inherited tables. A select does the exact same thing. Basically spark is changing the expected behavior, i.e. one connects to a Postgres database this behavior is expected. I don’t think spark should interfere with how the database should respond. If you use inherited tables you get cascades, except for ‘drop’. Think about about it are we limiting selects by using “only” to not have results from inherited tables? |
|
|
I’m not sure I follow. Spark isn’t the database here, Postgres is. This is an abstraction on top of jdbc to connect and import/export from a (R)DBMS. It is the user’s choice to have Postgres and the user chooses it for all kinds of reasons including a different behavior from “other dbmss”. So it isn’t “a side” effect at all. |
|
When Spark truncates a table |
|
No it isn’t because it is documented to happen and the scheme implies it. The same thing will happen from the command line. If I truncate from the command line I don’t get a report how much was truncated either. |
|
Please see the previous discussion. I think we are reiterating the previous discussion. |
|
Maybe we are :-). Ok going to your side of the discussion, why not issue |
|
+1 for the idea. |
|
Haha sure :-). Or maybe @danielvdende if he is still following |
|
Sure, I'll make the changes :). I'll use this PR, seems to make sense, as it would fix truncate for postgres |
|
Great! I'm looking forward your new PR, @danielvdende . I can help you a little bit during review. :) |
40bd8ac to
6df3734
Compare
|
@dongjoon-hyun I made the changes as suggested by @bolkedebruin (so modifying the Postgres dialect to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielvdende .
As we see here, this is a new feature. It seems that we had better a new JIRA issue as a New Feature with a title like Add getTruncateQuery to JdbcDialect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So create a new "improvement" JIRA and change the ticket number in this PR? (just checking I understand you correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. And, we need to fix the following in the PR description, too.
The PostgresDialect indicates that cascade is true by default for Postgres.
This is not the case. This patch fixes this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation? The following will check Scala style violations for you.
$ dev/scalastyle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a function comment about why we use TRUNCATE ONLY here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add the similar explanation to JdbcDialect.scala, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
6391a72 to
63b600d
Compare
|
@dongjoon-hyun I was considering this. Wanted to hear your opinion on it first ;-). I'll make those changes. the |
In order to enable truncate for PostgreSQL databases in Spark JDBC, a change is needed to the query used for truncating a PostgreSQL table. By default, PostgreSQL will automatically truncate any descendant tables if a TRUNCATE query is executed. As this may result in (unwanted) side-effects, the query used for the truncate should be specified separately for PostgreSQL, specifying only to TRUNCATE a single table. This change replaces the isCascadingTruncateTable by a getTruncateQuery method for each dialect, that needs to be implemented in each dialect.
63b600d to
708a320
Compare
|
@dongjoon-hyun ok made the changes, also replaced the test that was in place for |
|
|
||
| override def isCascadingTruncateTable(): Option[Boolean] = Some(false) | ||
| override def getTruncateQuery(table: String): String = { | ||
| s"TRUNCATE $table" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why override the default here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And all others that are not Postgres?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JdbcDialect object now no longer provides a default implementation for getTruncateQuery (as per @dongjoon-hyun 's suggestion). The reason is that this explicitly forces dialects to implement a truncate operation without side effects.
One thing I don't like about this solution is the code duplication across dialects. But it does prevent dialects from inheriting default behaviour and (accidentally) truncating with side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. It is also backward incompatible. Ie if someone implemented his/her own dialect one needs to update it (but removing the isCasc...) does this already so I assume that is ok
| * Some[false] : TRUNCATE TABLE does not cause cascading. | ||
| * None: The behavior of TRUNCATE TABLE is unknown (default). | ||
| */ | ||
| def isCascadingTruncateTable(): Option[Boolean] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't drop this, because this is a public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a middle ground, could it be marked deprecated? Seeing as we don't really need it anymore...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, custom third party dialects may want it.
| if (tableExists) { | ||
| mode match { | ||
| case SaveMode.Overwrite => | ||
| if (options.isTruncate && isCascadingTruncateTable(options.url) == Some(false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to respect it, but we can enhance PostgresDialect to support the JDBC option truncate
fb5d89d to
3d32e34
Compare
|
@dongjoon-hyun @gatorsmile As @gatorsmile pointed out, the Thanks for helping out :), much appreciated. |
|
retest this please |
|
Test build #84702 has finished for PR 19911 at commit
|
3d32e34 to
0b990e1
Compare
|
retest this please (I probably don't have permissions to do this, maybe @gatorsmile or @dongjoon-hyun could? :). I fixed the errors in the tests, all passing locally now. |
|
Retest this please. |
|
retest this please |
|
Forgive my impatience, but is it supposed to take this long? @dongjoon-hyun @gatorsmile |
|
@danielvdende test don't pass yet. I've retriggered them. I don't see a particularly long response cycle here. |
|
@srowen Sorry, should have been clearer, I meant the triggering of the tests (I see they're running now, thanks! ) |
|
Test build #4007 has finished for PR 19911 at commit
|
|
@danielvdende Thanks for your contributions! LGTM Merged to master. |
|
Thank you all and sorry for the distraction due to me. |
In order to enable truncate for PostgreSQL databases in Spark JDBC, a change is needed to the query used for truncating a PostgreSQL table. By default, PostgreSQL will automatically truncate any descendant tables if a TRUNCATE query is executed. As this may result in (unwanted) side-effects, the query used for the truncate should be specified separately for PostgreSQL, specifying only to TRUNCATE a single table.
What changes were proposed in this pull request?
Add
getTruncateQueryfunction toJdbcDialect.scala, with default query. Overridden this function for PostgreSQL to only truncate a single table. Also setsisCascadingTruncateTableto false, as this will allow truncates for PostgreSQL.How was this patch tested?
Existing tests all pass. Added test for
getTruncateQuery