Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.8.0 not able to detect existing tables anymore #691

Closed
thijs-nijhuis opened this issue May 30, 2024 · 3 comments · Fixed by #692
Closed

1.8.0 not able to detect existing tables anymore #691

thijs-nijhuis opened this issue May 30, 2024 · 3 comments · Fixed by #692
Labels
bug Something isn't working

Comments

@thijs-nijhuis
Copy link
Contributor

Describe the bug

Dbt-databricks 1.8.0 doesn't seem to be able to determine if a table already exists anymore. I have noticed this with both seeds and incremental models.

Seeds
We have a seed in our project called 'country'. When running dbt build --select country it succeeds if the country table is not created yet. But the second time you run it, it fails with this error messages:

Runtime Error in seed country (seeds\example\country.csv)
  [TABLE_OR_VIEW_ALREADY_EXISTS] Cannot create table or view `schema_name`.`country` because it already exists.
  Choose a different name, drop or replace the existing object, add the IF NOT EXISTS clause to tolerate pre-existing objects, or add the OR REFRESH clause to refresh the existing streaming table. SQLSTATE: 42P07

When I reinstall dbt-databricks 1.7.14, I can run the command over and over again and the incremental models do create merge statements.
I saw there was a fix with rerunning seeds in 1.8.1 but that doesn't solve it for me. Also, we don't have 'persist_doc' set, nor do we have a description for this seed.

Incremental
Our incremental models are always run as a 'create or replace' statement instead of a merge after upgrading tot 1.8.0. I see that locally in my target\run folder. I also see it on UC when running 'describe history catalog.schema.table_name' where all recent chnages are 'create or replace' instead of 'merge'.

Steps To Reproduce

Add a seed to the project called country.csv. Use dbt_project.yml to set its target catalog and schema (not sure if that is required). Run dbt build --select country or simply dbt seed; this should work. Then run the same command again and it should fail.

For the incremental models, simply create a model, set it to incremental and add this line: {{ log("For model '"~model.name~"' is_incremental() is set to '"~is_incremental()~"'", True) }} . The second time the model is run, 'is_incremental()' should be 'True' but it is not.

Expected behavior

Automated detection if the table already exists.

Screenshots and log output

See error output above.

System information

The output of dbt --version:

Core:
  - installed: 1.8.1
  - latest:    1.8.1 - Up to date!

Plugins:
  - databricks: 1.8.1 - Up to date!
  - spark:      1.8.0 - Up to date!

The operating system you're using:
Windows 11 enterprise
The output of python --version:
Python 3.11.4

Additional context

I tried to debug the dbt-databricks seeds materialization locally. This line yields 'None' for me where I would expect it to get the relation if the table exists. Weirdly enough, the used parameters (database, schema and identifier) all have the correct value.
I see the same thing happening when running from a Databricks workflow. It uses a job cluster so it will get a fresh install of dbt-databricks 1.8.1 on each job run. No other packages installed.

@thijs-nijhuis thijs-nijhuis added the bug Something isn't working label May 30, 2024
@benc-db
Copy link
Collaborator

benc-db commented May 30, 2024

@thijs-nijhuis-shell can you email me your artifacts? I have not seen any report like this, and I use incremental models with 1.8.x daily. ben.cassell@databricks.com

@thijs-nijhuis
Copy link
Contributor Author

@benc-db , I think I found the problem.

If I look into the dbt.log I see this query somewhere at the top:

select
      table_name,
      if(table_type in ('EXTERNAL', 'MANAGED', 'MANAGED_SHALLOW_CLONE'), 'table', lower(table_type)) as table_type,
      lower(data_source_format) as file_format,
      table_owner
    from `catalog_name`.`information_schema`.`tables`
    where table_schema = 'schema_name'

I assume that one is used to fetch all the existing objects all at once. But when I run that command I get no results and I think the reason is similar to this discussion and caused by the fact that our catalog was renamed at some point.

When I change the query into this, I do see all the objects:

select
      table_name,
      if(table_type in ('EXTERNAL', 'MANAGED', 'MANAGED_SHALLOW_CLONE'), 'table', lower(table_type)) as table_type,
      lower(data_source_format) as file_format,
      table_owner
    from system.`information_schema`.`tables`
    where table_catalog = 'catalog_name' and  table_schema = 'schema_name'

@benc-db
Copy link
Collaborator

benc-db commented May 31, 2024

Interesting. Would you mind filing a ticket through your company's databricks contact about this issue with information schema, giving the context about how it's breaking dbt? I can code a fallback like I did for MV, but it would be more performant if Databricks just fixed the information_schema behavior to work as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants