Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid query generated for test with store_failures if file_format not specified #236

Closed
ferdyh opened this issue Oct 19, 2021 · 5 comments
Labels
bug Something isn't working Stale

Comments

@ferdyh
Copy link

ferdyh commented Oct 19, 2021

Describe the bug

When add as test with a specific query and use 'store_failures', then the resulting sql ran against Databricks will result in a duplicate column name 'AS'. The query as defined in the test file:

  select 
    a.*
from {{ ref('table_to_test') }} a
where column1 is null

The resulting query run against Databricks:

CREATE TABLE `tests_schema`.`table_to_test` AS 
SELECT `source_schema`.`table_to_test`.`column1` AS , 
`source_schema`.`table_to_test`.`column1` AS 
 FROM `source_schema`.`table_to_test`
  WHERE (`source_schema`.`table_to_test`.`column1` IS NULL )

Steps To Reproduce

Queries provided

Expected behavior

That the columns get a valid alias, or 'AS' keyword is removed.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

The output of dbt --version:

installed version: 0.20.0
   latest version: 0.21.0

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - spark: 0.20.0
  - postgres: 0.20.0
  - redshift: 0.20.0
  - bigquery: 0.20.0
  - snowflake: 0.20.0

Tried with 0.21 aswell.

The operating system you're using:

The output of python --version:
Python 3.8.10

Additional context

I have an almost similar query (only as an additional where statement) that works fine.

@ferdyh ferdyh added bug Something isn't working triage labels Oct 19, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 19, 2021

@ferdyh How are you connecting to Databricks? Are you using a SQL endpoint?

The query that dbt-spark should actually be running here looks like:

CREATE TABLE `tests_schema`.`table_to_test` AS 
SELECT a.* 
 FROM `source_schema`.`table_to_test` a
  WHERE column1 is null

You can look in your debug-level logs (logs/dbt.log) to confirm.

I previously noticed (in #233) that either pyodbc or the Databricks endpoint seems to rewrite queries, in a way that sometimes results in invalid syntax (!). On a hunch, I managed to get it working by configuring tests to use the Delta file format:

# dbt_project.yml
tests:
  +file_format: delta
  +store_failures: true

When I did that, the queries weren't rewritten in a way that caused syntax errors.

@jtcohen6 jtcohen6 removed the triage label Oct 19, 2021
@ferdyh
Copy link
Author

ferdyh commented Oct 19, 2021

@jtcohen6 I checked the queries, and indeed that is the exact query logged, but that is not the query that databricks receives. It is however fixed if I add +file_format: delta (or parquet for that matter).

I'm not using SQL Endpoint, but cluster.

@jtcohen6
Copy link
Contributor

or parquet for that matter

Oh, interesting! It sounds like, unless the file format is specified, Databricks rewrites tests in a way that may yield syntax errors.

I think a reasonable fix would be to update the file_format_clause() macro to return parquet as the default format, i.e. if file_format is None. I think that would make a lot of sense, given that we already specify it as the default in SparkConfig.

Is that a change you'd be interested in contributing? Is there any reason we shouldn't include using parquet in all create table statements by default?

@jtcohen6 jtcohen6 changed the title Invalid query generated for test with store_failures Invalid query generated for test with store_failures if file_format not specified Oct 20, 2021
@ferdyh
Copy link
Author

ferdyh commented Oct 20, 2021

Only thing I can think of is that since Databricks 8.3 delta is the default format for tables.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants