Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) #115

Closed
1 task done
chaos87 opened this issue Jan 19, 2022 · 27 comments
Closed
1 task done

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) #115

chaos87 opened this issue Jan 19, 2022 · 27 comments
Labels
bug Something isn't working help_wanted Extra attention is needed Stale

Comments

@chaos87
Copy link

chaos87 commented Jan 19, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Hi there

I have seen a similar bug here dbt-labs/dbt-core#1576 and the issue is closed, but it appears to still be long for me to generate docs because (I believe) of the sharded tables in BQ (the ga_sessions_* from GA360)

Right now it takes around 30min and we have a little bit more of 10K tables

Expected Behavior

I would expect the dbt docs generate command to finish generating the catalog within a minute

Steps To Reproduce

No response

Relevant log output

No response

Environment

- OS:
- Python: 3.7
- dbt: 1.0.0

What database are you using dbt with?

bigquery

Additional Context

No response

@chaos87 chaos87 added bug Something isn't working triage labels Jan 19, 2022
@github-actions github-actions bot changed the title [Bug] It takes too long to generate docs (BigQuery) [CT-60] [Bug] It takes too long to generate docs (BigQuery) Jan 19, 2022
@jtcohen6 jtcohen6 removed the triage label Feb 7, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 7, 2022

@chaos87 Sorry for the delay here! I'm going to transfer this to dbt-bigquery, which is where the code for BigQuery-specific catalog generation lives today (see catalog.sql). That's also where any change to this logic would need to happen.

Given the change we made way back in dbt-labs/dbt-core#1795 — which seems like it should have fixed this, but maybe we missed a spot? — and Drew's note there about the syntax he tapped into, there might also be some interesting overlap with #113...

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Feb 7, 2022
@github-actions github-actions bot changed the title [CT-60] [Bug] It takes too long to generate docs (BigQuery) [CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) Feb 7, 2022
@chaos87
Copy link
Author

chaos87 commented Feb 9, 2022

Thanks!

I'll take a look see if I can come up with a fix for this on the macro in catalog.sql

@muscovitebob
Copy link

We also observe this, our catalog on a BigQuery-only project takes between 10-20 minutes to generate.

@jtcohen6
Copy link
Contributor

@chaos87 @muscovitebob Ok! Could I ask both of you to try:

  • Finding the catalog query from a past run in your BigQuery query history, or
  • Grabbing the templated SQL for the catalog query (probably from logs/dbt.log), and try running it yourself in the console

I'd be interested to know which parts are slowest, based on what you can find in the query planner / your experimentation. If I had to guess, it might be these aggregates to get summary stats for entire tables (aggregating across all shards):

struct(
min(shard_name) as shard_min,
max(shard_name) as shard_max,
count(*) as shard_count
) as table_shards,
sum(size_bytes) as size_bytes,
sum(row_count) as row_count,

Thanks in advance for your help!

@muscovitebob
Copy link

muscovitebob commented Feb 10, 2022

The logs/dbt.log catalog query returns nearly instantaneously for me when I run it in the web interface. What I see in the logs as the last entry is the templated query. Then there are no logs for 10 minutes while it is building catalog. If I attach lldb to the running subprocess dbt spawns the names of the frames as I periodically interrupt the process suggest it is munging dictionaries and tuples and such for these 10 minutes. I haven't figured out how to attach a real python debugger to the subprocess on my M1 Mac yet. My hunch therefore is that the issue may lie in how the catalog is actually built, not in the catalog query.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 10, 2022

Ah, ok - that's really helpful! Any chance you could also estimate:

  • The amount of data returned by the catalog query (number of rows, approximate bytes)
  • The amount of memory the dbt process is using / has access to

This might be a dbt-core issue after all, since that's where turning catalog query results -> Catalog is implemented.

@muscovitebob
Copy link

muscovitebob commented Feb 10, 2022

In my case the query returns 22,144 rows and a total result size of 12,18 MB. The subprocess spawned by the generate command always grows up to 3 GB memory consumption and stays there for the whole run on my machine.

@muscovitebob
Copy link

muscovitebob commented Feb 10, 2022

Also it's worth mentioning that even when I use a single thread for dbt, the log still logs multiple compiled SQL catalog statements one after another before going silent:

17:12:38.633863 [debug] [ThreadPool]: Acquiring new bigquery connection "project.information_schema"
17:12:38.637211 [debug] [ThreadPool]: Opening a new connection, currently in state closed
17:12:38.637498 [debug] [ThreadPool]: On project.information_schema: /* {"app": "dbt", "dbt_version": "1.0.0", "profile_name": "default", "target_name": "target", "connection_name": "project.information_schema"} */

    with tables as (
        select
            project_id as table_database,
            dataset_id as table_schema,
            table_id as original_table_name,
....

@muscovitebob
Copy link

My slowdown occurs somewhere in this line, because if I put exception statements before that my dbt crashes quite quickly. I'm finding it harder to modify the insides of dbt-bigquery however so I haven't been able to narrow it down further.

@muscovitebob
Copy link

muscovitebob commented Feb 10, 2022

I realised that get_catalog is still a function within dbt core. Whatever it is that is taking ages is happening inside this multithreaded for loop. This seems to launch a catalog query for each dataset of which I have about 27 if I print the schema_map array.

@muscovitebob
Copy link

Yeah alright. I did not realise that it launches multiple different catalog queries, one per dataset. I was under the impression all the queries in my log were the same, but this is clearly not the case. One of the queries for example creates a result set of 524,236 which is 256 MB. Now, this executes very quickly on the web interface, but what the web interface doesn't do is download the whole result set to you. I know from experience the BigQuery API can be very slow to return lots of rows. so I suspect that is where all the latency comes from - the BigQuery API takes a lot of time to return the results.

@muscovitebob
Copy link

muscovitebob commented Feb 10, 2022

@chaos87 would you be able to run each individual catalog query in the logs/dbt.log of your own and see if any of them produce north of a 100k rows?
e: for some more context on BigQuery API speeds googleapis/python-bigquery-pandas#133 (comment)

@chaos87
Copy link
Author

chaos87 commented Feb 11, 2022

Hey thanks for the indications

I have found the queries in logs/dbt.log.
I have 4 datasets for the ga sessions tables listed in source.

The queries take around 15sec to run in BQ console and produce ~440K rows each (20MB each)

@muscovitebob
Copy link

muscovitebob commented Feb 11, 2022

The bytes size of your result is smaller than mine but as I mentioned BQAPI seems to mostly suffer from large numbers of rows regardless of how large in bytes the actual result is. It seems there are two possible solutions to this:

  • Somehow push down which tables dbt is actually interested in to the catalog query to reduce the result set we have to download.
  • Provide an option to use the BigQuery Storage Read API, which is how pandas-gbq solved it.

@jtcohen6 jtcohen6 removed the triage label Feb 11, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 11, 2022

Thanks both for weighing in! That's a sizeable chunk of (meta)data. I'm not surprised to hear that dbt's processes are slower at that scale, though it's good to know that the bottleneck here may be the BigQuery client's API.

The current query is filtered to just the datasets that dbt is interested in, but I take it that you might have either/both of:

  • source datasets with many thousands of tables, of which a small subset are actually referenced by dbt models
  • lots of non-dbt objects living in the same datasets as dbt models/seeds/snapshots/etc

If you have any control over those, such as by isolating dbt sources and models into dedicated datasets, that would help significantly.

Otherwise, we could try filtering further in the query, to just the objects (tables/views) that map to sources/models/etc in your dbt project. To do that, we'd need to rework _get_catalog_schemas to return a SchemaSearchMap() that contains a list of specific relation identifiers, and template them into the catalog query, up to X specific identifiers (<1 MB limit for StandardSQL query text).

@jtcohen6 jtcohen6 added the help_wanted Extra attention is needed label Feb 11, 2022
@muscovitebob
Copy link

The datasets I ingest have many tables and most of them are indeed untracked by dbt (they are fivetran connector datasets that ingest everything in the databases they track), right on the money. The filtering proposal sounds good, we can fall back to a full download if it exceeds the X max number of identifiers we can settle on. I'll take a look how to do this.

@chaos87
Copy link
Author

chaos87 commented Feb 14, 2022

"source datasets with many thousands of tables, of which a small subset are actually referenced by dbt models"

yes this is the situation I'm in

@muscovitebob
Copy link

I haven't made much progress with this because it seems I am not able to debug across the dbt-bigquery dbt-core package boundary - namely I cannot step in from dbt-bigquery into code in dbt-core.

@muscovitebob
Copy link

muscovitebob commented Mar 2, 2022

@jtcohen6 I am wondering if we actually need to do the code modifications to pass the table names instead of going with the Storage API? I noticed that in the API docs it looks like the Storage API is now plug-and-play, and even states:

dataframe = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(
        # Optionally, explicitly request to use the BigQuery Storage API. As of
        # google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
        # API is used by default.
        create_bqstorage_client=True,
    )
)

I see the setup.py in this project requests google-cloud-bigquery at or above 1.25.0, so perhaps simply upgrading the version of this library will suffice to fix this usability issue. I will test this.
e: Ah hold on, this plug-and-play method appears to be specific to the arrow and pandas data frame methods.

@chaos87
Copy link
Author

chaos87 commented Mar 18, 2022

Hey, sorry for long silence

I have found a workaround to solve the slow to build catalog on my side

here https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/include/bigquery/macros/catalog.sql#L26

if we make is_date_shard to also work for views

(i-e REGEXP_CONTAINS(table_id, '^.+[0-9]{8}$') and coalesce(type, 0) in (1, 2) as is_date_shard)

Then the output of the get_catalog query (for 1 dataset having many sharded tables and views) with 4K views goes from 400K rows down to 1K rows

But with that said, none of the views were declared as sources or models, so the true problem is still that dbt scans objects that are out of the realm of the dbt project as @jtcohen6 mentioned

@kevinhoe
Copy link

@chaos87 @muscovitebob @jtcohen6 thanks for your detailed investigation! This has been a helpful resource as we troubleshoot a similar issue on our end (error with generating BigQuery docs in dbt Cloud). We've been hitting a memory limit error even after the kind folks at dbt Labs helped to increase our dbt Cloud memory. Here are some findings in case it can help draw attention to this open issue. We still do not have docs generating at the moment...

  • dbt docs generate is throwing a memory limit error
  • dbt compile by itself takes about 35 minutes to complete
  • when i try dbt docs generate, I think the compiling is consuming the majority of the memory because the logs cut off shortly into the catalog building phase
  • I then tried to override the default macro “bigquery__get_catalog”…but haven’t quite got it right yet because I’m not sure what the data types are for the two arguments that feed into the macro when it’s called. In general, I tried to hard code some data sources in the macro CTES to pull only from the BigQuery project information schema that's needed...and separately tested filtering the where clauses to pull metadata from only the BigQuery project we care about (instead of scanning much more beyond our needed scope)…but this isn’t quite working as expected, so I tried the next thing…
  • I started testing dbt docs generate --no-compile (against the default catalog macro) which actually finished running successfully in about 9 minutes, but we have no docs to show for still...getting a "Not found." page when trying to open documentation in dbt Cloud.
  • also came across this BigQuery macro override file that Drew wrote which addresses date-shards with "date suffixes that are shorter than 8 characters"...but still not enough performance gains to complete the docs generation

Hope these takeaways can help...and open to any ideas for continuing to improve the BigQuery doc generation process, thank you!

@kevinhoe
Copy link

UPDATE: was able to get docs to generate and open properly on dbt Cloud (without making any changes to the bigquery__get_catalog macro)...but ran into two separate issues and needed a hack-y workaround. For context, we have a ton of models (and unrelated BigQuery tables) which pose a challenge for generating docs.

  • First we hit a memory limit when trying to run dbt docs generate; we can get around that by splitting up some of the nested commands such as starting the run with dbt compile (takes 35 minutes ) and then running dbt docs generate --no-compile (9 minutes).
  • Then we run into a separate issue where docs are not “serving” properly online:
  1. Currently opening docs the normal way in dbt Cloud does not work for us (by clicking on “Documentation” in the dbt Cloud hamburger menu). We get a URL with the following format: https://cloud.getdbt.com/accounts/[account_number]/jobs/[job_number]/docs/
  2. However I found this other URL format that seems to work: https://cloud.getdbt.com/accounts/[account_number]/runs/[run_number]/docs/#!/overview
  3. What we're doing now as a workaround is to replace the run_number in the URL above with the latest run number of our scheduled Docs job.
  4. The interesting/hack-y thing to note is that the working URL format is something I borrowed from the GUI of dbt Cloud v0.21 ... and I believe this "Run Documentation" link is no longer present in the GUI for dbt Cloud v1.0!

@jeremyyeo
Copy link
Contributor

jeremyyeo commented Aug 1, 2022

Hey @kevinhoe would be good context if you put the total number of tables/views in the sources that you're using in your dbt project that you told me.

Ftr, when you pop in your own custom dbt docs generate step but turn the "Generate Docs" checkbox off - dbt Cloud will not be able to correctly server your docs - and probably the reason for your "docs are not “serving” properly online" statement.

image

@kevinhoe
Copy link

kevinhoe commented Aug 1, 2022

Thanks for the context @jeremyyeo! Yes, your description is spot on for why docs are not serving properly online. I was unaware of the config in Profile Settings where one selects the job for the UI to point to in order to serve docs: https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/artifacts

I tested a few different scenarios, but alas, I do not think we are able to utilize the built-in feature for the UI to serve docs in dbt Cloud. More specifically, I tried checking off the box for "Generate Docs" in addition to including command steps for dbt compile and dbt docs generate --no-compile in hopes of the UI picking up the completed manifest and catalog json files from the explicit commands. Those two explicit commands completed as expected, and then the built-in command for dbt docs generate failed as expected; unfortunately the UI was not able to serve the docs with the normal way of opening docs in dbt Cloud.

So for the time being, we are left using the hack-y workaround of replacing the run_number in the url format of: https://cloud.getdbt.com/accounts/[account_number]/runs/[run_number]/docs/#!/overview

@jeremyyeo
Copy link
Contributor

@kevinhoe added a full macro override: https://gist.github.com/jeremyyeo/f83ca852510956ba3f2f96aa079c43d5 that limits catalog query to only models/sources that are relevant to our project.

@kevinhoe
Copy link

kevinhoe commented Aug 3, 2022

Wow! Thank you @jeremyyeo! Not sure what's more impressive - your jinja expertise or the extremely thorough documentation 😄 I'm excited to try this out and will report back on how it may improve our docs generation, thanks!

Also, I'm sharing a link you sent me in case others might find it helpful too. This is an issue that someone else opened regarding long compilation times for dbt-bigquery: #205

On the surface, that might seem like a separate issue, but my interpretation of the official dbt documentation for the command dbt docs generate is that the compile step is a nested command within the general command of dbt docs generate? https://docs.getdbt.com/reference/commands/cmd-docs
splitting out dbt compile from dbt docs generate

So to summarize, when we run the generic command dbt docs generate, I think we may be running into issues with:

  1. the compile step taking a very long time
  2. and the docs generate [--no-compile] step (ie, the catalog step) also taking a long time

Very much appreciate your time and expertise here, Jeremy!

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help_wanted Extra attention is needed Stale
Projects
None yet
Development

No branches or pull requests

5 participants