Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: CTAS on multiple statements #12188

Merged
merged 3 commits into from
Jan 4, 2021
Merged

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Dec 22, 2020

SUMMARY

The current behavior for running CTAS (create table as select) in SQL Lab is to check for each statement if it's a SELECT, and if so prepend the query with CREATE TABLE $table_name AS $query. For example, if we have this query:

SELECT 1 AS col;

And we run CTAS by passing the table name my_table we get this query:

CREATE TABLE my_table AS
SELECT 1 AS col;

The problem is that for a query with multiple statements we run CTAS for each one. For example:

SELECT 1 AS col;
SELECT 2 AS col;

When run with CTAS and my_table will produce these 2 queries:

CREATE TABLE my_table AS
SELECT 1 AS col;
CREATE TABLE my_table AS
SELECT 2 AS col;

The first query runs successfully, but the second fails because the table already exists.

This PR fixes the CTAS behavior, making it more consistent with how multiple statements work in SQL Lab. It changes SQL Lab so that:

  1. A CTAS query can have multiple statements, as long as the last one is a SELECT. This allows user to pre-process the data before loading it into the table. A simple example in MySQL would be:
SET @value = 42;
SELECT @value AS foo;

When run in a CTAS, this will create a table with the column foo with a row with the value 42, as expected.

In Hive, a common pattern is to write queries like this:

INSERT OVERWRITE staging_table ...;
SELECT * FROM staging_table WHERE ...;

Which should also work.

  1. A CVAS (create view as select) query, on the other hand, can have a single statement and it MUST be a SELECT. Before, there was no differentiation between CTAS and CVAS (other than the query generated).

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TEST PLAN

Tested manually, added unit tests.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@codecov-io
Copy link

codecov-io commented Dec 23, 2020

Codecov Report

Merging #12188 (06ebf9c) into master (a52031a) will decrease coverage by 3.10%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12188      +/-   ##
==========================================
- Coverage   66.61%   63.51%   -3.11%     
==========================================
  Files         994      484     -510     
  Lines       49079    29771   -19308     
  Branches     4982        0    -4982     
==========================================
- Hits        32695    18908   -13787     
+ Misses      16254    10863    -5391     
+ Partials      130        0     -130     
Flag Coverage Δ
cypress ?
javascript ?
python 63.51% <100.00%> (-0.72%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/sql_lab.py 80.37% <100.00%> (+0.75%) ⬆️
superset/sql_parse.py 99.36% <100.00%> (+0.02%) ⬆️
superset/sql_validators/postgres.py 50.00% <0.00%> (-50.00%) ⬇️
superset/views/database/views.py 62.50% <0.00%> (-25.07%) ⬇️
superset/db_engine_specs/mysql.py 79.59% <0.00%> (-12.25%) ⬇️
superset/databases/commands/create.py 83.67% <0.00%> (-8.17%) ⬇️
superset/databases/commands/update.py 85.71% <0.00%> (-8.17%) ⬇️
superset/db_engine_specs/presto.py 74.67% <0.00%> (-8.01%) ⬇️
superset/db_engine_specs/base.py 79.38% <0.00%> (-6.71%) ⬇️
superset/sql_validators/base.py 93.33% <0.00%> (-6.67%) ⬇️
... and 525 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a52031a...06ebf9c. Read the comment docs.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Dec 23, 2020
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a verbose description, made this very easy to review 👍 This will surely be appreciated by advanced SQL Lab users. Minor non-blocking comment, LGTM!

Comment on lines +335 to +358
if (
query.select_as_cta
and query.ctas_method == CtasMethod.TABLE
and not parsed_query.is_valid_ctas()
):
raise SqlLabException(
_(
"CTAS (create table as select) can only be run with a query where "
"the last statement is a SELECT. Please make sure your query has "
"a SELECT as its last statement. Then, try running your query again."
)
)
if (
query.select_as_cta
and query.ctas_method == CtasMethod.VIEW
and not parsed_query.is_valid_cvas()
):
raise SqlLabException(
_(
"CVAS (create view as select) can only be run with a query with "
"a single SELECT statement. Please make sure your query has only "
"a SELECT statement. Then, try running your query again."
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the error messages imply a deep understanding of how is_valid_ctas() and is_valid_cvas() work, I wonder if these should be moved into ParsedQuery as something like assert_is_valid_ctas() and assert_is_valid_cvas() and have the error messages there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great point, @villebro! I'll move the exception message closer to the checks.

@betodealmeida betodealmeida merged commit 164db3e into apache:master Jan 4, 2021
@amitmiran137
Copy link
Member

What would be an example use case for using CTAS ?

@mistercrunch
Copy link
Member

What would be an example use case for using CTAS ?

If you mean "why materialize a dataset?", the main reason is to create a summary and/or richer dataset to speed subsequent queries against this snapshot

If you mean "why using the Superset CTAS feature over simply writing your own CREATE TABLE {} AS" statement, it has to do with administrator/policy control. It's possible in Superset to NOT give DML access, but allow people to effectively do this in a controlled way, and even force a target location/naming scheme for the CTAS-generated table

@bkyryliuk
Copy link
Member

+1 to this comment, we are actively leveraging it

What would be an example use case for using CTAS ?

If you mean "why materialize a dataset?", the main reason is to create a summary and/or richer dataset to speed subsequent queries against this snapshot

If you mean "why using the Superset CTAS feature over simply writing your own CREATE TABLE {} AS" statement, it has to do with administrator/policy control. It's possible in Superset to NOT give DML access, but allow people to effectively do this in a controlled way, and even force a target location/naming scheme for the CTAS-generated table

henryyeh pushed a commit to preset-io/superset that referenced this pull request Jan 7, 2021
* WIP

* Add unit tests for sql_parse

* Add unit tests for sql_lab

(cherry picked from commit 164db3e)
@villebro villebro removed the v1.0 label Jan 12, 2021
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels preset-io size/L 🚢 1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants