[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) #115

chaos87 · 2022-01-19T05:37:52Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Hi there

I have seen a similar bug here dbt-labs/dbt-core#1576 and the issue is closed, but it appears to still be long for me to generate docs because (I believe) of the sharded tables in BQ (the ga_sessions_* from GA360)

Right now it takes around 30min and we have a little bit more of 10K tables

Expected Behavior

I would expect the dbt docs generate command to finish generating the catalog within a minute

Steps To Reproduce

No response

Relevant log output

No response

Environment

- OS:
- Python: 3.7
- dbt: 1.0.0

What database are you using dbt with?

bigquery

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-02-07T18:03:34Z

@chaos87 Sorry for the delay here! I'm going to transfer this to dbt-bigquery, which is where the code for BigQuery-specific catalog generation lives today (see catalog.sql). That's also where any change to this logic would need to happen.

Given the change we made way back in dbt-labs/dbt-core#1795 — which seems like it should have fixed this, but maybe we missed a spot? — and Drew's note there about the syntax he tapped into, there might also be some interesting overlap with #113...

chaos87 · 2022-02-09T06:36:06Z

Thanks!

I'll take a look see if I can come up with a fix for this on the macro in catalog.sql

muscovitebob · 2022-02-10T14:42:10Z

We also observe this, our catalog on a BigQuery-only project takes between 10-20 minutes to generate.

jtcohen6 · 2022-02-10T16:01:45Z

@chaos87 @muscovitebob Ok! Could I ask both of you to try:

Finding the catalog query from a past run in your BigQuery query history, or
Grabbing the templated SQL for the catalog query (probably from logs/dbt.log), and try running it yourself in the console

I'd be interested to know which parts are slowest, based on what you can find in the query planner / your experimentation. If I had to guess, it might be these aggregates to get summary stats for entire tables (aggregating across all shards):

dbt-bigquery/dbt/include/bigquery/macros/catalog.sql

Lines 59 to 66 in 87095c4

    
           struct( 
        
               min(shard_name) as shard_min, 
        
               max(shard_name) as shard_max, 
        
               count(*) as shard_count 
        
           ) as table_shards, 
        
           sum(size_bytes) as size_bytes, 
        
           sum(row_count) as row_count,

Thanks in advance for your help!

muscovitebob · 2022-02-10T16:33:27Z

The logs/dbt.log catalog query returns nearly instantaneously for me when I run it in the web interface. What I see in the logs as the last entry is the templated query. Then there are no logs for 10 minutes while it is building catalog. If I attach lldb to the running subprocess dbt spawns the names of the frames as I periodically interrupt the process suggest it is munging dictionaries and tuples and such for these 10 minutes. I haven't figured out how to attach a real python debugger to the subprocess on my M1 Mac yet. My hunch therefore is that the issue may lie in how the catalog is actually built, not in the catalog query.

jtcohen6 · 2022-02-10T17:01:06Z

Ah, ok - that's really helpful! Any chance you could also estimate:

The amount of data returned by the catalog query (number of rows, approximate bytes)
The amount of memory the dbt process is using / has access to

This might be a dbt-core issue after all, since that's where turning catalog query results -> Catalog is implemented.

muscovitebob · 2022-02-10T17:11:26Z

In my case the query returns 22,144 rows and a total result size of 12,18 MB. The subprocess spawned by the generate command always grows up to 3 GB memory consumption and stays there for the whole run on my machine.

muscovitebob · 2022-02-10T17:14:42Z

Also it's worth mentioning that even when I use a single thread for dbt, the log still logs multiple compiled SQL catalog statements one after another before going silent:

17:12:38.633863 [debug] [ThreadPool]: Acquiring new bigquery connection "project.information_schema"
17:12:38.637211 [debug] [ThreadPool]: Opening a new connection, currently in state closed
17:12:38.637498 [debug] [ThreadPool]: On project.information_schema: /* {"app": "dbt", "dbt_version": "1.0.0", "profile_name": "default", "target_name": "target", "connection_name": "project.information_schema"} */

    with tables as (
        select
            project_id as table_database,
            dataset_id as table_schema,
            table_id as original_table_name,
....

muscovitebob · 2022-02-10T17:44:05Z

My slowdown occurs somewhere in this line, because if I put exception statements before that my dbt crashes quite quickly. I'm finding it harder to modify the insides of dbt-bigquery however so I haven't been able to narrow it down further.

muscovitebob · 2022-02-10T18:17:29Z

I realised that get_catalog is still a function within dbt core. Whatever it is that is taking ages is happening inside this multithreaded for loop. This seems to launch a catalog query for each dataset of which I have about 27 if I print the schema_map array.

muscovitebob · 2022-02-10T18:32:31Z

Yeah alright. I did not realise that it launches multiple different catalog queries, one per dataset. I was under the impression all the queries in my log were the same, but this is clearly not the case. One of the queries for example creates a result set of 524,236 which is 256 MB. Now, this executes very quickly on the web interface, but what the web interface doesn't do is download the whole result set to you. I know from experience the BigQuery API can be very slow to return lots of rows. so I suspect that is where all the latency comes from - the BigQuery API takes a lot of time to return the results.

muscovitebob · 2022-02-10T18:47:40Z

@chaos87 would you be able to run each individual catalog query in the logs/dbt.log of your own and see if any of them produce north of a 100k rows?
e: for some more context on BigQuery API speeds googleapis/python-bigquery-pandas#133 (comment)

chaos87 · 2022-02-11T06:28:34Z

Hey thanks for the indications

I have found the queries in logs/dbt.log.
I have 4 datasets for the ga sessions tables listed in source.

The queries take around 15sec to run in BQ console and produce ~440K rows each (20MB each)

muscovitebob · 2022-02-11T10:12:10Z

The bytes size of your result is smaller than mine but as I mentioned BQAPI seems to mostly suffer from large numbers of rows regardless of how large in bytes the actual result is. It seems there are two possible solutions to this:

Somehow push down which tables dbt is actually interested in to the catalog query to reduce the result set we have to download.
Provide an option to use the BigQuery Storage Read API, which is how pandas-gbq solved it.

jtcohen6 · 2022-02-11T11:00:29Z

Thanks both for weighing in! That's a sizeable chunk of (meta)data. I'm not surprised to hear that dbt's processes are slower at that scale, though it's good to know that the bottleneck here may be the BigQuery client's API.

The current query is filtered to just the datasets that dbt is interested in, but I take it that you might have either/both of:

source datasets with many thousands of tables, of which a small subset are actually referenced by dbt models
lots of non-dbt objects living in the same datasets as dbt models/seeds/snapshots/etc

If you have any control over those, such as by isolating dbt sources and models into dedicated datasets, that would help significantly.

Otherwise, we could try filtering further in the query, to just the objects (tables/views) that map to sources/models/etc in your dbt project. To do that, we'd need to rework _get_catalog_schemas to return a SchemaSearchMap() that contains a list of specific relation identifiers, and template them into the catalog query, up to X specific identifiers (<1 MB limit for StandardSQL query text).

muscovitebob · 2022-02-11T13:28:49Z

The datasets I ingest have many tables and most of them are indeed untracked by dbt (they are fivetran connector datasets that ingest everything in the databases they track), right on the money. The filtering proposal sounds good, we can fall back to a full download if it exceeds the X max number of identifiers we can settle on. I'll take a look how to do this.

chaos87 · 2022-02-14T01:47:12Z

"source datasets with many thousands of tables, of which a small subset are actually referenced by dbt models"

yes this is the situation I'm in

muscovitebob · 2022-03-01T13:18:10Z

I haven't made much progress with this because it seems I am not able to debug across the dbt-bigquery dbt-core package boundary - namely I cannot step in from dbt-bigquery into code in dbt-core.

muscovitebob · 2022-03-02T14:02:19Z

@jtcohen6 I am wondering if we actually need to do the code modifications to pass the table names instead of going with the Storage API? I noticed that in the API docs it looks like the Storage API is now plug-and-play, and even states:

dataframe = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(
        # Optionally, explicitly request to use the BigQuery Storage API. As of
        # google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
        # API is used by default.
        create_bqstorage_client=True,
    )
)

I see the setup.py in this project requests google-cloud-bigquery at or above 1.25.0, so perhaps simply upgrading the version of this library will suffice to fix this usability issue. I will test this.
e: Ah hold on, this plug-and-play method appears to be specific to the arrow and pandas data frame methods.

chaos87 · 2022-03-18T06:32:16Z

Hey, sorry for long silence

I have found a workaround to solve the slow to build catalog on my side

here https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/include/bigquery/macros/catalog.sql#L26

if we make is_date_shard to also work for views

(i-e REGEXP_CONTAINS(table_id, '^.+[0-9]{8}$') and coalesce(type, 0) in (1, 2) as is_date_shard)

Then the output of the get_catalog query (for 1 dataset having many sharded tables and views) with 4K views goes from 400K rows down to 1K rows

But with that said, none of the views were declared as sources or models, so the true problem is still that dbt scans objects that are out of the realm of the dbt project as @jtcohen6 mentioned

kevinhoe · 2022-07-26T00:56:28Z

@chaos87 @muscovitebob @jtcohen6 thanks for your detailed investigation! This has been a helpful resource as we troubleshoot a similar issue on our end (error with generating BigQuery docs in dbt Cloud). We've been hitting a memory limit error even after the kind folks at dbt Labs helped to increase our dbt Cloud memory. Here are some findings in case it can help draw attention to this open issue. We still do not have docs generating at the moment...

dbt docs generate is throwing a memory limit error
dbt compile by itself takes about 35 minutes to complete
when i try dbt docs generate, I think the compiling is consuming the majority of the memory because the logs cut off shortly into the catalog building phase
I then tried to override the default macro “bigquery__get_catalog”…but haven’t quite got it right yet because I’m not sure what the data types are for the two arguments that feed into the macro when it’s called. In general, I tried to hard code some data sources in the macro CTES to pull only from the BigQuery project information schema that's needed...and separately tested filtering the where clauses to pull metadata from only the BigQuery project we care about (instead of scanning much more beyond our needed scope)…but this isn’t quite working as expected, so I tried the next thing…
I started testing dbt docs generate --no-compile (against the default catalog macro) which actually finished running successfully in about 9 minutes, but we have no docs to show for still...getting a "Not found." page when trying to open documentation in dbt Cloud.
also came across this BigQuery macro override file that Drew wrote which addresses date-shards with "date suffixes that are shorter than 8 characters"...but still not enough performance gains to complete the docs generation

Hope these takeaways can help...and open to any ideas for continuing to improve the BigQuery doc generation process, thank you!

kevinhoe · 2022-07-26T21:56:11Z

UPDATE: was able to get docs to generate and open properly on dbt Cloud (without making any changes to the bigquery__get_catalog macro)...but ran into two separate issues and needed a hack-y workaround. For context, we have a ton of models (and unrelated BigQuery tables) which pose a challenge for generating docs.

First we hit a memory limit when trying to run dbt docs generate; we can get around that by splitting up some of the nested commands such as starting the run with dbt compile (takes 35 minutes ) and then running dbt docs generate --no-compile (9 minutes).
Then we run into a separate issue where docs are not “serving” properly online:

Currently opening docs the normal way in dbt Cloud does not work for us (by clicking on “Documentation” in the dbt Cloud hamburger menu). We get a URL with the following format: https://cloud.getdbt.com/accounts/[account_number]/jobs/[job_number]/docs/
However I found this other URL format that seems to work: https://cloud.getdbt.com/accounts/[account_number]/runs/[run_number]/docs/#!/overview
What we're doing now as a workaround is to replace the run_number in the URL above with the latest run number of our scheduled Docs job.
The interesting/hack-y thing to note is that the working URL format is something I borrowed from the GUI of dbt Cloud v0.21 ... and I believe this "Run Documentation" link is no longer present in the GUI for dbt Cloud v1.0!

jeremyyeo · 2022-08-01T10:15:35Z

Hey @kevinhoe would be good context if you put the total number of tables/views in the sources that you're using in your dbt project that you told me.

Ftr, when you pop in your own custom dbt docs generate step but turn the "Generate Docs" checkbox off - dbt Cloud will not be able to correctly server your docs - and probably the reason for your "docs are not “serving” properly online" statement.

kevinhoe · 2022-08-01T23:32:32Z

Thanks for the context @jeremyyeo! Yes, your description is spot on for why docs are not serving properly online. I was unaware of the config in Profile Settings where one selects the job for the UI to point to in order to serve docs: https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/artifacts

I tested a few different scenarios, but alas, I do not think we are able to utilize the built-in feature for the UI to serve docs in dbt Cloud. More specifically, I tried checking off the box for "Generate Docs" in addition to including command steps for dbt compile and dbt docs generate --no-compile in hopes of the UI picking up the completed manifest and catalog json files from the explicit commands. Those two explicit commands completed as expected, and then the built-in command for dbt docs generate failed as expected; unfortunately the UI was not able to serve the docs with the normal way of opening docs in dbt Cloud.

So for the time being, we are left using the hack-y workaround of replacing the run_number in the url format of: https://cloud.getdbt.com/accounts/[account_number]/runs/[run_number]/docs/#!/overview

jeremyyeo · 2022-08-03T05:10:45Z

@kevinhoe added a full macro override: https://gist.github.com/jeremyyeo/f83ca852510956ba3f2f96aa079c43d5 that limits catalog query to only models/sources that are relevant to our project.

kevinhoe · 2022-08-03T23:40:43Z

Wow! Thank you @jeremyyeo! Not sure what's more impressive - your jinja expertise or the extremely thorough documentation 😄 I'm excited to try this out and will report back on how it may improve our docs generation, thanks!

Also, I'm sharing a link you sent me in case others might find it helpful too. This is an issue that someone else opened regarding long compilation times for dbt-bigquery: #205

On the surface, that might seem like a separate issue, but my interpretation of the official dbt documentation for the command dbt docs generate is that the compile step is a nested command within the general command of dbt docs generate? https://docs.getdbt.com/reference/commands/cmd-docs

So to summarize, when we run the generic command dbt docs generate, I think we may be running into issues with:

the compile step taking a very long time
and the docs generate [--no-compile] step (ie, the catalog step) also taking a long time

Very much appreciate your time and expertise here, Jeremy!

github-actions · 2023-01-31T02:04:00Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

chaos87 added bug Something isn't working triage labels Jan 19, 2022

github-actions bot changed the title ~~[Bug] It takes too long to generate docs (BigQuery)~~ [CT-60] [Bug] It takes too long to generate docs (BigQuery) Jan 19, 2022

jtcohen6 added the Team:Adapters label Jan 19, 2022

jtcohen6 removed the triage label Feb 7, 2022

jtcohen6 transferred this issue from dbt-labs/dbt-core Feb 7, 2022

github-actions bot changed the title ~~[CT-60] [Bug] It takes too long to generate docs (BigQuery)~~ [CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) Feb 7, 2022

jtcohen6 added triage and removed Team:Adapters labels Feb 7, 2022

jtcohen6 removed the triage label Feb 11, 2022

jtcohen6 added the help_wanted Extra attention is needed label Feb 11, 2022

ryanClift-sd mentioned this issue Mar 21, 2022

[CT-202] Workaround for some limitations due to list_relations_without_caching method dbt-labs/dbt-spark#228

Open

jtcohen6 mentioned this issue Apr 5, 2022

[CT-458] Catalog queries should filter on specific relations in busy schemas dbt-labs/dbt-core#4997

Closed

kevinhoe mentioned this issue Aug 4, 2022

[CT-779] Listing tables on big datasets on every compile can be abnormally long (and incomplete?) #205

Closed

github-actions bot added the Stale label Jan 31, 2023

github-actions bot closed this as completed Feb 8, 2023

dbeatty10 mentioned this issue Aug 18, 2023

[CT-3012] [Bug] dbt docs generate takes a lot of time dbt-labs/dbt-core#8452

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) #115

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) #115

chaos87 commented Jan 19, 2022

jtcohen6 commented Feb 7, 2022 •

edited

Loading

chaos87 commented Feb 9, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022

jtcohen6 commented Feb 10, 2022

muscovitebob commented Feb 10, 2022 •

edited

Loading

jtcohen6 commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022

muscovitebob commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022

muscovitebob commented Feb 10, 2022 •

edited

Loading

chaos87 commented Feb 11, 2022

muscovitebob commented Feb 11, 2022 •

edited

Loading

jtcohen6 commented Feb 11, 2022 •

edited

Loading

muscovitebob commented Feb 11, 2022

chaos87 commented Feb 14, 2022

muscovitebob commented Mar 1, 2022

muscovitebob commented Mar 2, 2022 •

edited

Loading

chaos87 commented Mar 18, 2022

kevinhoe commented Jul 26, 2022

kevinhoe commented Jul 26, 2022

jeremyyeo commented Aug 1, 2022 •

edited

Loading

kevinhoe commented Aug 1, 2022

jeremyyeo commented Aug 3, 2022

kevinhoe commented Aug 3, 2022 •

edited

Loading

github-actions bot commented Jan 31, 2023

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) #115

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) #115

Comments

chaos87 commented Jan 19, 2022

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

What database are you using dbt with?

Additional Context

jtcohen6 commented Feb 7, 2022 • edited Loading

chaos87 commented Feb 9, 2022 • edited Loading

muscovitebob commented Feb 10, 2022

jtcohen6 commented Feb 10, 2022

muscovitebob commented Feb 10, 2022 • edited Loading

jtcohen6 commented Feb 10, 2022 • edited Loading

muscovitebob commented Feb 10, 2022 • edited Loading

muscovitebob commented Feb 10, 2022 • edited Loading

muscovitebob commented Feb 10, 2022

muscovitebob commented Feb 10, 2022 • edited Loading

muscovitebob commented Feb 10, 2022

muscovitebob commented Feb 10, 2022 • edited Loading

chaos87 commented Feb 11, 2022

muscovitebob commented Feb 11, 2022 • edited Loading

jtcohen6 commented Feb 11, 2022 • edited Loading

muscovitebob commented Feb 11, 2022

chaos87 commented Feb 14, 2022

muscovitebob commented Mar 1, 2022

muscovitebob commented Mar 2, 2022 • edited Loading

chaos87 commented Mar 18, 2022

kevinhoe commented Jul 26, 2022

kevinhoe commented Jul 26, 2022

jeremyyeo commented Aug 1, 2022 • edited Loading

kevinhoe commented Aug 1, 2022

jeremyyeo commented Aug 3, 2022

kevinhoe commented Aug 3, 2022 • edited Loading

github-actions bot commented Jan 31, 2023

jtcohen6 commented Feb 7, 2022 •

edited

Loading

chaos87 commented Feb 9, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022 •

edited

Loading

jtcohen6 commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 10, 2022 •

edited

Loading

muscovitebob commented Feb 11, 2022 •

edited

Loading

jtcohen6 commented Feb 11, 2022 •

edited

Loading

muscovitebob commented Mar 2, 2022 •

edited

Loading

jeremyyeo commented Aug 1, 2022 •

edited

Loading

kevinhoe commented Aug 3, 2022 •

edited

Loading