-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) #115
Comments
@chaos87 Sorry for the delay here! I'm going to transfer this to Given the change we made way back in dbt-labs/dbt-core#1795 — which seems like it should have fixed this, but maybe we missed a spot? — and Drew's note there about the syntax he tapped into, there might also be some interesting overlap with #113... |
Thanks! I'll take a look see if I can come up with a fix for this on the macro in |
We also observe this, our catalog on a BigQuery-only project takes between 10-20 minutes to generate. |
@chaos87 @muscovitebob Ok! Could I ask both of you to try:
I'd be interested to know which parts are slowest, based on what you can find in the query planner / your experimentation. If I had to guess, it might be these aggregates to get summary stats for entire tables (aggregating across all shards): dbt-bigquery/dbt/include/bigquery/macros/catalog.sql Lines 59 to 66 in 87095c4
Thanks in advance for your help! |
The |
Ah, ok - that's really helpful! Any chance you could also estimate:
This might be a |
In my case the query returns 22,144 rows and a total result size of 12,18 MB. The subprocess spawned by the generate command always grows up to 3 GB memory consumption and stays there for the whole run on my machine. |
Also it's worth mentioning that even when I use a single thread for dbt, the log still logs multiple compiled SQL catalog statements one after another before going silent:
|
My slowdown occurs somewhere in this line, because if I put exception statements before that my dbt crashes quite quickly. I'm finding it harder to modify the insides of dbt-bigquery however so I haven't been able to narrow it down further. |
I realised that |
Yeah alright. I did not realise that it launches multiple different catalog queries, one per dataset. I was under the impression all the queries in my log were the same, but this is clearly not the case. One of the queries for example creates a result set of 524,236 which is 256 MB. Now, this executes very quickly on the web interface, but what the web interface doesn't do is download the whole result set to you. I know from experience the BigQuery API can be very slow to return lots of rows. so I suspect that is where all the latency comes from - the BigQuery API takes a lot of time to return the results. |
@chaos87 would you be able to run each individual catalog query in the |
Hey thanks for the indications I have found the queries in The queries take around 15sec to run in BQ console and produce ~440K rows each (20MB each) |
The bytes size of your result is smaller than mine but as I mentioned BQAPI seems to mostly suffer from large numbers of rows regardless of how large in bytes the actual result is. It seems there are two possible solutions to this:
|
Thanks both for weighing in! That's a sizeable chunk of (meta)data. I'm not surprised to hear that dbt's processes are slower at that scale, though it's good to know that the bottleneck here may be the BigQuery client's API. The current query is filtered to just the datasets that dbt is interested in, but I take it that you might have either/both of:
If you have any control over those, such as by isolating dbt sources and models into dedicated datasets, that would help significantly. Otherwise, we could try filtering further in the query, to just the objects (tables/views) that map to sources/models/etc in your dbt project. To do that, we'd need to rework |
The datasets I ingest have many tables and most of them are indeed untracked by dbt (they are fivetran connector datasets that ingest everything in the databases they track), right on the money. The filtering proposal sounds good, we can fall back to a full download if it exceeds the X max number of identifiers we can settle on. I'll take a look how to do this. |
"source datasets with many thousands of tables, of which a small subset are actually referenced by dbt models" yes this is the situation I'm in |
I haven't made much progress with this because it seems I am not able to debug across the dbt-bigquery dbt-core package boundary - namely I cannot step in from dbt-bigquery into code in dbt-core. |
@jtcohen6 I am wondering if we actually need to do the code modifications to pass the table names instead of going with the Storage API? I noticed that in the API docs it looks like the Storage API is now plug-and-play, and even states:
I see the |
Hey, sorry for long silence I have found a workaround to solve the slow to build catalog on my side if we make (i-e Then the output of the But with that said, none of the views were declared as sources or models, so the true problem is still that dbt scans objects that are out of the realm of the dbt project as @jtcohen6 mentioned |
@chaos87 @muscovitebob @jtcohen6 thanks for your detailed investigation! This has been a helpful resource as we troubleshoot a similar issue on our end (error with generating BigQuery docs in dbt Cloud). We've been hitting a memory limit error even after the kind folks at dbt Labs helped to increase our dbt Cloud memory. Here are some findings in case it can help draw attention to this open issue. We still do not have docs generating at the moment...
Hope these takeaways can help...and open to any ideas for continuing to improve the BigQuery doc generation process, thank you! |
UPDATE: was able to get docs to generate and open properly on dbt Cloud (without making any changes to the
|
Hey @kevinhoe would be good context if you put the total number of tables/views in the sources that you're using in your dbt project that you told me. Ftr, when you pop in your own custom |
Thanks for the context @jeremyyeo! Yes, your description is spot on for why docs are not serving properly online. I was unaware of the config in Profile Settings where one selects the job for the UI to point to in order to serve docs: https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/artifacts I tested a few different scenarios, but alas, I do not think we are able to utilize the built-in feature for the UI to serve docs in dbt Cloud. More specifically, I tried checking off the box for "Generate Docs" in addition to including command steps for So for the time being, we are left using the hack-y workaround of replacing the |
@kevinhoe added a full macro override: https://gist.github.com/jeremyyeo/f83ca852510956ba3f2f96aa079c43d5 that limits catalog query to only models/sources that are relevant to our project. |
Wow! Thank you @jeremyyeo! Not sure what's more impressive - your jinja expertise or the extremely thorough documentation 😄 I'm excited to try this out and will report back on how it may improve our docs generation, thanks! Also, I'm sharing a link you sent me in case others might find it helpful too. This is an issue that someone else opened regarding long compilation times for dbt-bigquery: #205 On the surface, that might seem like a separate issue, but my interpretation of the official dbt documentation for the command So to summarize, when we run the generic command
Very much appreciate your time and expertise here, Jeremy! |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Is there an existing issue for this?
Current Behavior
Hi there
I have seen a similar bug here dbt-labs/dbt-core#1576 and the issue is closed, but it appears to still be long for me to generate docs because (I believe) of the sharded tables in BQ (the ga_sessions_* from GA360)
Right now it takes around 30min and we have a little bit more of 10K tables
Expected Behavior
I would expect the
dbt docs generate
command to finish generating the catalog within a minuteSteps To Reproduce
No response
Relevant log output
No response
Environment
What database are you using dbt with?
bigquery
Additional Context
No response
The text was updated successfully, but these errors were encountered: