Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to set search_path in Postgresql #1477

Closed
buremba opened this issue May 21, 2019 · 6 comments
Closed

Ability to set search_path in Postgresql #1477

buremba opened this issue May 21, 2019 · 6 comments

Comments

@buremba
Copy link
Contributor

buremba commented May 21, 2019

Issue

Issue description

The schema parameter in Postgresql credentials is not being used.

Results

I was expecting dbt to pass the schema to Postgresql similar to Snowflake and BigQuery. If that's not something that you want, I believe that we should remove that parameter from PostgresqlCredentials.
Also, please note that if this parameter is not set in credentials, we won't be overriding the schema and it will default to the user's search_path. If that's not set, the default is public.

System information

The output of dbt --version:

installed version: 0.13.1

The operating system you're running on: OS X
The python version you're using: Python 3.6.4

Steps to reproduce

We use both Snowflake and Postgresql and use the schema parameter in both adapters. When we execute a simple query such as SELECT count(*) my_table in Snowflake, dbt automatically execute the query in the schema that I defined since it passes schema to Snowflake.
However; that's not the case in Postgresql, SELECT count(*) my_table references public.my_table no matter what I use in schema in my credentials. I believe that's confusing to the analysts because it's not consistent across the databases.

@buremba
Copy link
Contributor Author

buremba commented May 21, 2019

We actually discussed this issue at #1476.
@drewbanin, could you please help me understand how this feature would break people's dbt projects on Postgres/Redshift?

P.S: I'm not trying to be sarcastic. Again, sorry for the misunderstanding. 😬

@drewbanin
Copy link
Contributor

Hi @buremba - thanks for making this issue. Let me outline some of the basics on how dbt works, and how the connection configs differ between Snowflake and Postgres/Redshift:

  1. The schema specified in a profiles.yml target

The schema config specified in a target represents the schema that dbt should build models into. When dbt builds a table or view from a model, it does so using this syntax:

create table "schema_name"."table_name" as (

  select ...

);

The "schema_name" qualifier is plucked directly out of the active target. This config is crucial to the dbt workflow: it allows users to build their models into a local development schema without impacting production runs of dbt. Here's an example Postgres profile:

default:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: drew
      schema: dbt_drew # In development, build models into the dbt_drew schema
      ...
    prod:
      type: postgres
      host: localhost
      user: prod_user
      schema: analytics # In production, build models into the dbt_drew schema
      ...

This notion of repeatable builds in configurable schemas is super core to the dbt workflow. We've written about managing environments in the dbt documentation.

So, the schema configuration in a target is absolutely being used in dbt. It's hard for me to point to a specific place in the codebase, since it's really used pretty much everywhere we create or select from models! You can see where it's defined in the model compilation context and an example of where we refer to it in the table materialization definition.

  1. Search paths

Postgres and Redshift both support search paths. These search paths are used by the database to resolve a schemaless relation name. You provided this example above:

select count(*) from my_table

You're right to identify that on Postgres/Redshift, the database would go looking for a table called public.my_table. You're also right to acknowledge that this differs from the behavior on Snowflake. I'm happy to reconcile this difference by providing a search_path config to the target contract for Postgres and Redshift. I do not think that this config should be provided by default, but I'm ok with allowing users to override it in their profiles.

  1. Why setting the search_path to the value of schema would break some dbt projects on Postgres/Redshift

If a dbt user on Redshift has a model with the following code:

select * from orders

Then this code will be interpreted by Redshift as:

select * from public.orders -- default to the public schema (via search_path)

If we changed the search_path on Redshift to implicitly be the value of the schema, then this code would instead be interpreted by Redshift as:

select * from dbt_dbanin.orders -- assuming "schema: dbt_dbanin"

This breaks the project.

  1. Why are your users selecting from tables without specifying a schema on Snowflake?

dbt is intended to be used with different (dev/prod) environments. Moreover, it's assumed that source data lives in different schemas than the dbt models are built into. If a user is selecting from a relation in a model without specifying a schema, then that means they're selecting from other dbt models, right? These users should absolutely be using the ref function here if they're selecting from dbt models inside of other dbt models.

Do I have the right idea here? This is the most confusing thing to me about the challenges that you're describing.

  1. A PR that I would consider merging

I would merge a PR that did the following things:

  1. Add search_path as a config to the Postgres/Redshift target contract
  2. Only set the search_path for the connection if one is explicitly provided in the target config

@buremba
Copy link
Contributor Author

buremba commented May 21, 2019

Thanks for taking the time to clarify the things @drewbanin!

My understanding was that the credentials in the profile section are for the source part which means that I will be generating my models from the schema that is in the profile. If I don't specify the schema in config dbt will generate the model in the base schema. However; it looks like it's the other way around as you explained and actually the documentation states it clearly. (IMO, it would be better if you can mention earlier in this document.)

It may be because I started using dbt with Snowflake and used the schema of the fact tables for simplicity because I wanted to reference the tables easily in my SQL queries. After a few shots, I found out that I needed to create the models in a different schema. That's how I found out that I can actually use schema in config. We use dbt for the event data and all the raw data is in the same schema. Each user has its own use-case and when they want to model the raw event data, they have their own schema similar to the use-case described here. However; your main use-case looks like raw data is in multiple schemas and dbt users create most of their models in the same schema. Therefore, the credentials defined in the profile is for the target, not source which is kinda different than the BI tools. As a person who comes from the BI world, when I read the about the dbt profiles, I thought that "OK, I'm giving my credentials to dbt because when I run the queries, it will connect to this database & schema". I admit that I was a bit lazy and wanted to figure out the things on my own rather than reading the documentation first. 🙄

When I switched to Postgresql from Snowflake, I followed the same procedure and expected the same outcome. I still believe that dbt should be consistent across the databases but it looks like the main cause of the difference is that both PG and Redshift has a default public schema while Snowflake & BigQuery require you to create the schemas. Therefore, the behavior is different and it's OK in dbt's case because the credentials are for the target, not source.

Do you want me to create a PR that uses search_path in this case? BTW, in my opinion, something like source_schema might be easier to understand.

@drewbanin
Copy link
Contributor

Thanks for the update @buremba - this all makes a lot of sense!

Feel free to re-open that PR (or I can do it for you if you're unable) to contribute this functionality to dbt. My opinion is that the config should be called search_path, as the Postgres connection will be created with:

-c search_path={search path goes here}

I think calling it source_schema might be confusing, as dbt already has a notion of sources, and really, this config will be setting the search_path!

@buremba
Copy link
Contributor Author

buremba commented May 21, 2019

That sounds right, thanks for being so responsive!

@drewbanin drewbanin added this to the Wilt Chamberlain milestone May 28, 2019
@drewbanin drewbanin changed the title Ability to set schema in Postgresql Ability to set search_path in Postgresql May 29, 2019
@drewbanin
Copy link
Contributor

Fixed in #1476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants