-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-3012] [Bug] dbt docs generate takes a lot of time #8452
Comments
Thanks for reaching out @SamuelBohumel ! It is a known case that catalog generation time is outsized in comparison to the actual number of dbt models in cases like these:
One option for you
There is! I haven't tried it myself, but you can do something like the following:
Filter on dbt objects onlyHere are the key additions to include that filter within 2a3,25
> {# Here is where we build our list of relations of interest #}
> {%- set relations_in_project = [] -%}
>
> {%- for node in graph.nodes.values() -%}
> {%- if node.resource_type == 'model' -%}
> {%- do relations_in_project.append(node.alias) -%}
> {%- endif -%}
> {%- endfor -%}
> {%- for source in graph.sources.values() -%}
> {%- do relations_in_project.append(source.name) -%}
> {%- endfor -%}
>
> {# Make the list unique #}
> {%- set relations_in_project = set(relations_in_project) | list -%}
35a59,64
> {# Here is where we filter the relations so we dont simply fetch all of them #}
> {%- if relations_in_project | length > 0 %}
> and coalesce(REGEXP_EXTRACT(table_id, '^(.+)_{1}[0-9]{8}$'), table_id) in (
> {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
> )
> {% endif -%} Note that Postgres/Greenplum, you don't need to worry about date-shared tables, so I don't think you would need the Here is the full diff2a3,25
> {#
> /*
> This macro changes the catalog query to filter on dbt objects only and not all objects in a dataset.
> The regex is very sensitive to the date sharding pattern of your tables so you may want to
> change the regex of '^.+[0-9]{8}$' to '^.+[0-9]{6}$' if say your tables are shareded by YYYYMM
> instead of YYYYMMDD.
> */
> #}
>
> {# Here is where we build our list of relations of interest #}
> {%- set relations_in_project = [] -%}
>
> {%- for node in graph.nodes.values() -%}
> {%- if node.resource_type == 'model' -%}
> {%- do relations_in_project.append(node.alias) -%}
> {%- endif -%}
> {%- endfor -%}
> {%- for source in graph.sources.values() -%}
> {%- do relations_in_project.append(source.name) -%}
> {%- endfor -%}
>
> {# Make the list unique #}
> {%- set relations_in_project = set(relations_in_project) | list -%}
26c49
< REGEXP_CONTAINS(table_id, '^.+[0-9]{8}$') and coalesce(type, 0) = 1 as is_date_shard,
---
> REGEXP_CONTAINS(table_id, '^.+[0-9]{8}$') and coalesce(type, 0) in (1, 2) as is_date_shard,
35a59,64
> {# Here is where we filter the relations so we dont simply fetch all of them #}
> {%- if relations_in_project | length > 0 %}
> and coalesce(REGEXP_EXTRACT(table_id, '^(.+)_{1}[0-9]{8}$'), table_id) in (
> {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
> )
> {% endif -%}
207c236,240
< {{ return(run_query(query)) }}
---
> {%- do log(query) -%}
> {%- set results = run_query(query) -%}
> {%- do log(schemas ~ ' - rows returned: ' ~ results | length, True) -%}
>
> {{ return(results) }} Another optionIf you have any control over which schemas the dbt dbt sources and models live in, then isolating them into dedicated schemas should also help. |
Thanks for your quick response. We have seen these solutions when looking for an issue with the performance of the catalog. Unfortunately, this does not solve our use case, as we are trying to enable users to use all filtering functionality (tags, lineage, etc.). In our environment, we have been given little control over schema. We are working on one of the largest Greenplum database implementations, so there are thousands of users, and schemas are dedicated for security management rather than project separation. You are correct, separating by schema would work as catalog uses this filter condition, but we would propose something like adding a flag to We believe this could be useful in other implementations too, whenever there are more complex environments, and would also improve scalability. |
Thanks for that additional info @SamuelBohumel
After discussing this with @graciegoheen, we think what you're asking for will be covered by #4997 (which we've started working on) combined with #6014, so I'm going to close this issue in favor of those two. If you have any further insights, we'd appreciate your comments on either of both of those. Either way, you may want to subscribe to those so you can get notified of any updates. Workaround in the meantimeIn the meantime, you might be able to use the See below for a quick proof-of-concept (that I'll leave to you optimize as-desired). At least two things to note in this implementation that could be improved:
Proof-of-conceptFull customized version for postgres / greenplum: Click to expand{% macro default__get_catalog(information_schema, schemas) -%}
{# Here is where we build our list of relations of interest #}
{%- set relations_in_project = [] -%}
{%- for node in graph.nodes.values() -%}
{%- if node.resource_type == 'model' -%}
{%- if node.unique_id in selected_resources -%}
{%- do relations_in_project.append(node.alias) -%}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
{%- for source in graph.sources.values() -%}
{%- do relations_in_project.append(source.name) -%}
{%- endfor -%}
{# Make the list unique #}
{%- set relations_in_project = set(relations_in_project) | list -%}
{%- call statement('catalog', fetch_result=True) -%}
{#
If the user has multiple databases set and the first one is wrong, this will fail.
But we will not fail in the case where there are multiple quoting-difference-only dbs, which is better.
#}
{% set database = information_schema.database %}
{{ adapter.verify_database(database) }}
select
'{{ database }}' as table_database,
sch.nspname as table_schema,
tbl.relname as table_name,
case tbl.relkind
when 'v' then 'VIEW'
else 'BASE TABLE'
end as table_type,
tbl_desc.description as table_comment,
col.attname as column_name,
col.attnum as column_index,
pg_catalog.format_type(col.atttypid, col.atttypmod) as column_type,
col_desc.description as column_comment,
pg_get_userbyid(tbl.relowner) as table_owner
from pg_catalog.pg_namespace sch
join pg_catalog.pg_class tbl on tbl.relnamespace = sch.oid
join pg_catalog.pg_attribute col on col.attrelid = tbl.oid
left outer join pg_catalog.pg_description tbl_desc on (tbl_desc.objoid = tbl.oid and tbl_desc.objsubid = 0)
left outer join pg_catalog.pg_description col_desc on (col_desc.objoid = tbl.oid and col_desc.objsubid = col.attnum)
where (
{%- for schema in schemas -%}
upper(sch.nspname) = upper('{{ schema }}'){%- if not loop.last %} or {% endif -%}
{%- endfor -%}
)
{# Here is where we filter the relations so we do not simply fetch all of them #}
{%- if relations_in_project | length > 0 %}
and tbl.relname in (
{%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
)
{%- endif %}
and not pg_is_other_temp_schema(sch.oid) -- not a temporary schema belonging to another session
and tbl.relpersistence in ('p', 'u') -- [p]ermanent table or [u]nlogged table. Exclude [t]emporary tables
and tbl.relkind in ('r', 'v', 'f', 'p') -- o[r]dinary table, [v]iew, [f]oreign table, [p]artitioned table. Other values are [i]ndex, [S]equence, [c]omposite type, [t]OAST table, [m]aterialized view
and col.attnum > 0 -- negative numbers are used for system columns such as oid
and not col.attisdropped -- column as not been dropped
order by
sch.nspname,
tbl.relname,
col.attnum
{%- endcall -%}
{{ return(load_result('catalog').table) }}
{%- endmacro %} Diff of customized version vs. the default implementation: Click to expand3a4,20
> {# Here is where we build our list of relations of interest #}
> {%- set relations_in_project = [] -%}
>
> {%- for node in graph.nodes.values() -%}
> {%- if node.resource_type == 'model' -%}
> {%- if node.unique_id in selected_resources -%}
> {%- do relations_in_project.append(node.alias) -%}
> {%- endif -%}
> {%- endif -%}
> {%- endfor -%}
> {%- for source in graph.sources.values() -%}
> {%- do relations_in_project.append(source.name) -%}
> {%- endfor -%}
>
> {# Make the list unique #}
> {%- set relations_in_project = set(relations_in_project) | list -%}
>
37a55,62
>
> {# Here is where we filter the relations so we do not simply fetch all of them #}
> {%- if relations_in_project | length > 0 %}
> and tbl.relname in (
> {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
> )
> {%- endif %}
> Let us know it goes if you try it out! |
Hi, |
See below for a version of It also includes a stand-alone Overall, the implementation still has this remaining issue though:
|
For anyone following here, #8772 allows for the usage of |
Is this a new bug in dbt-core?
Current Behavior
Hi there,
I have seen issue #1576, but I think it still needs attention.
I am using dbt to transform data in Greenplum database. When generating catalog with command
dbt docs generate
, the process sometimes takes around 10 min.After analysis of a query that is executed against database, I found out that there is no condition specifying tables that are needed and although I am working with ~20 tables, resulting set had 3.3 million records.
Is there a way to optimize the query that gets the metadata to be more efficient, or modify WHERE clause to decrease the size of result?.
Thanks.
Expected Behavior
I would expect the
dbt docs generate
command to finish generating the catalog within a minute as it is planned for production usage and a lot of pipelines.Steps To Reproduce
standard dbt project with 20 tables.
run
dbt docs generate
.Have around 50 000 tables in database.
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: