-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating docs take and long time #1576
Comments
Thanks for the report @whittid4 - we'll check it out! My initial thinking:
|
I have run into a similar problem. I also have DBT setup to read from a BQ database with a lot of date sharded tables. Running In the dbt logs the following is repeated over 50,000 times:
|
Thanks for the additional info @EricLeer! It's pretty clear to me that we need dbt to be smarter about catalog generation with date sharded tables on BQ. I don't think it makes sense for dbt to know about every single individual date shard -- these are probably source tables, and the BQ interface is probably suited for exploring these shards than dbt docs is, at least currently. Presently, dbt is using an API method to fetch all of the tables/views in every dataset that dbt touches. This means that dbt will fundamentally need to pull down every single one of these tables, then maybe distill the date-sharded tables down to a single source table in-memory. Maybe an alternative approach is to use BQ's new-ish information schema? I think with some clever SQL, we can push a lot of this filtering into the BQ layer. This should make I just played around with some code here, what do you guys think of something like this? with base as (
select *,
REGEXP_CONTAINS(table_name, '^.+[0-9]{8}$') as is_date_shard,
REGEXP_EXTRACT(table_name, '^(.+)[0-9]{8}$') as base_name,
REGEXP_EXTRACT(table_name, '^.+([0-9]{8})$') as shard_name
FROM dbt_dbanin.INFORMATION_SCHEMA.TABLES
),
extracted as (
select *,
coalesce(base_name, table_name) as root_table_name
from base
),
unsharded as (
select
table_catalog,
table_schema,
root_table_name as table_name,
row_number() over (partition by root_table_name order by shard_name desc) as shard_index,
min(shard_name) over (partition by root_table_name) as first_shard,
max(shard_name) over (partition by root_table_name) as last_shard,
count(*) over (partition by root_table_name) as num_shards
from extracted
)
select *
from unsharded
where shard_index = 1
order by table_name This query will find all of the tables/views in a given dataset, then squash down date shards into a single "root" record. We can pluck out the individual shard names if necessary, maybe aggregating them into an array in SQL? That would be pretty slick. Curious to hear what you think! cc @whittid4 |
I think this would work. For one of my datasets this would reduce the amount of tables searched from 40000 to 400. On the other hand I think you would still run into the same problem if the tables are sharded in a different way then yyyymmdd date. In principle with querying a wildcard table any pattern is possible and I think it would be best if this behaviour is also supported with this solution. Maybe the correct information of on what to colapse the shards can be gathered from the schema.yml file where the source is defined? For instance I have defined my source as:
and thus I would expect the shards to colapse on Finally what is the reason that dbt fetches all tables/views in a dataset it touches? Wouldn't it make more sense to only fetch the tables/views that are actually defined as a source? |
That's a really good point - I didn't realize that you could shard on strings other than an 8-char date suffix. I agree - generating some sort of glob query from the specified dbt tries to find all of the tables in a given BQ dataset that match up with models and sources defined in the active dbt project. By querying for all of the relations in a dataset at once, dbt can make one query per dataset referenced in the dbt project. If dbt instead queried for each source/model individually, then dbt would need to execute orders of magnitude more queries like this in the general case. Imagine you had 30 source tables defined in a single dataset -- all things considered, I'd rather make one query against that dataset than 30! |
Sorry for the delay in getting back to you. @drewbanin, the example SQL you shared above will work for our case as we are only using the standard data sharding method, but as @EricLeer points out it would be nice to make this solution fit most use cases |
I tried to point to a specific table during doc generation with dbt 0.14.2 but it didn't help. sources:
- name: dataset_1
database: main_database
schema: dataset_1_{{ var("tenant_id") }}
tables:
- name: table_1
identifier: table_1_{{ var("tenant_id") }}_{{ "*" if not var("doc_gen", False) else "20190901" }}
- name: dataset_2
database: main_database
schema: dataset_2
tables:
- name: table_2
identifier: table_2
columns:
- name: table_id |
Hey @mr2dark - I spent some time yesterday pulling together a more completed "catalog" query for BigQuery which leverages the Are you able/interested to help test out a branch of dbt which introduces this logic? |
@drewbanin I'll be glad to help. How can I test that? BTW it looks like my estimation is wrong. It can take 30 minutes and the cases where I didn't receive results for hours were probably caused by power/network state issues when my laptop went to sleep and/or was disconnected from network for a while. I'll try to reproduce that when I'll have some spare time. |
Cool! I'll follow up with a branch name and some more info when I have something to show for myself :) |
I've left a comment in #1795 |
(#1576) use the information schema on BigQuery
Hello there👋 I am facing the same issue when using v1.0.0 I see there was a PR #1795 mentioned just above so I thought it would have been fixed, but it seems not I declared ga_sessions_* in source and we have over 10k tables. It takes around 30min to generate the catalog. |
When running
dbt docs generate
is take over 10 mins to complete and uses lots of CPU resources producing the catalog.When running with debug turned on the following gets written out:
along with:
which is a date sharded table containing hundreds of shards.
The interesting thing is the
cloudaudit_googleapis_com_data_access
is the only one in the dataset that is not used or referenced by DBT. In the same dataset there is 2 other date sharded tables containing over a thousand shards, which are referenced as a source in DBT, but they are not printed outThe text was updated successfully, but these errors were encountered: