[CT-832] [Feature] `maxBytes` output option for dbt artifacts #5461

charlespicowski · 2022-07-12T11:51:21Z

Is this your first time opening an issue?

I have read the expectations for open source contributors

Describe the Feature

An artifactMaxBytes: 16*1024*1024 configuration setting in the project.yml that would control the output of the artifact files (manifest.json etc). If the limit is exceeded, multiple files less that or equal to that size could be produced.

The manifest.json file in particular can exceed a limit imposed by Snowflake for uploading VARIANT data - specifically, any compressed (.gz) file cannot have a single cell entry with size >16MB.

Describe alternatives you've considered

There are many novel ways this step can be overcome, some work and some do not.

I have seen pre-processing ways that recommend using jq or Snowflake UDFs. The well-known package by Brooklyn data seems to address this issue with the V2 method of uploading the artifacts (which first attempts to flatten the JSON before expanding and uploading it into tables) - however it seems there are still problems with this approach.

Who will this benefit?

Data observability is becoming more of a trend these days, with more people looking into extracting value from the dbt artifacts.
I think if this feature was developed, it would save a lot of time/headache/mess of splitting up the JSON file after it is created.

Are you interested in contributing this feature?

I am happy to try.

Anything else?

No response

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-07-19T17:56:05Z

@charlespicowski Thanks for opening!

Manifest data has been getting bigger and bigger. This reflects both:

The addition of new configuration options within dbt, while preserving backwards compatibility for metadata consumers
The increasing scale of dbt projects running in production

For the specific use case identified here (loading into Snowflake, for use in the dbt_artifacts package), we don't need the overall manifest contents to be <16 MB, but we do need the contents of each "row" (NDJSON record) to be <16 MB.

One simple way to do this might be to "flatten" the top-level manifest dictionaries into separate records, and then "flatten" again any record which is over a certain scale (>1k entries, say) — such that we end up with a manifest looking something like:

{"metadata": ...}
{"nodes_1": ...}
{"nodes_2": ...}
{"sources": ...}

More interesting, but a heavier lift, would be turning the manifest into a "manifest of manifests." The top-level manifest would preserve pointers to all resources, with resource unique_id as the keys, but move the detailed entries for those resources into separate files / indexable storage locations.

{
    "nodes": {
       "model.<package_name>.<model_name>": "<lookup value (name of another file?)>",
       ...
    }
}

All of this might be doable via custom post-processing (whether fancy jq or a Python script). I understand the desire to have it all-in-dbt, since that means dbt metadata can be loaded into Snowflake and powered the dbt_artifacts package, with no additional tooling needed. But I'm also aware that dbt features like dbt-docs wouldn't work with this configuration option... unless we taught it how to do those index lookups, too.

It does beg the question: Should we load this directly into a key-value store, or a database running locally? That feels out of scope for this, but in scope for our imaginations :)

Did you have other ideas of how we might go about this? It feels like we'd want to experiment with a few different approaches, before settling on one.

(cc @barryaron — you might find this interesting)

charlespicowski · 2022-07-21T17:49:10Z

Approach two feels a bit nicer.

How are you handling size the split of the dbt log files?

I understand the data structure is a bit different, and considerations are different (eg. dbt-docs does not need to reference this), but perhaps there is something to gain by looking into this.

NiallRees · 2022-07-21T20:51:47Z

Just came here to say that we've been busy reimplementing the package, resolving this issue in dbt_artifacts. We now process the graph and results context variables, inserting the values straight into the source tables, avoiding the artifacts altogether.

jtcohen6 · 2022-07-22T12:03:53Z

How are you handling size the split of the dbt log files?

This is much simpler to do, because every log message is self-contained, and their inter-relationship is expressed through linear time. Whereas we'd want to divvy up the JSON artifacts in a way that still produces a self-contained, valid, and conceptually meaningful subset of the overall JSON blob.

The maxBytes option also comes "for free" from RotatingFileHandler:

dbt-core/core/dbt/events/functions.py

Line 92 in 2548ba9

    
           filename=log_dest, encoding="utf8", maxBytes=10 * 1024 * 1024, backupCount=5  # 10 mb

This isn't configurable today for end users of dbt, but such a thing would be very easy to instrument.

charlespicowski · 2022-07-25T12:55:14Z

this links to #5096 - if there is a way or removing the appearance of dep packages in the manifest.json could (greatly?) reduce the overall size of it

charlespicowski · 2022-07-25T17:26:53Z

out of interest @NiallRees , can you share more on what you mean by "graph" and "context variables?

NiallRees · 2022-07-25T19:34:16Z

Hi @charlespicowski, see https://docs.getdbt.com/reference/dbt-jinja-functions/graph and https://docs.getdbt.com/reference/dbt-jinja-functions/on-run-end-context#results

github-actions · 2025-02-15T01:58:28Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2025-02-23T02:03:53Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

charlespicowski added enhancement New feature or request triage labels Jul 12, 2022

github-actions bot changed the title ~~[Feature] maxBytes output option for dbt artifacts~~ [CT-832] [Feature] maxBytes output option for dbt artifacts Jul 12, 2022

jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors artifacts and removed triage labels Jul 19, 2022

jtcohen6 added the Team:Language label Jul 19, 2022

jtcohen6 removed the Team:Language label Jul 19, 2023

github-actions bot added the stale Issues that have gone stale label Feb 15, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-832] [Feature] `maxBytes` output option for dbt artifacts #5461

[CT-832] [Feature] `maxBytes` output option for dbt artifacts #5461

charlespicowski commented Jul 12, 2022

jtcohen6 commented Jul 19, 2022

charlespicowski commented Jul 21, 2022 •

edited

Loading

NiallRees commented Jul 21, 2022

jtcohen6 commented Jul 22, 2022

charlespicowski commented Jul 25, 2022

charlespicowski commented Jul 25, 2022

NiallRees commented Jul 25, 2022

github-actions bot commented Feb 15, 2025

github-actions bot commented Feb 23, 2025

[CT-832] [Feature] maxBytes output option for dbt artifacts #5461

[CT-832] [Feature] maxBytes output option for dbt artifacts #5461

Comments

charlespicowski commented Jul 12, 2022

Is this your first time opening an issue?

Describe the Feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Jul 19, 2022

charlespicowski commented Jul 21, 2022 • edited Loading

NiallRees commented Jul 21, 2022

jtcohen6 commented Jul 22, 2022

charlespicowski commented Jul 25, 2022

charlespicowski commented Jul 25, 2022

NiallRees commented Jul 25, 2022

github-actions bot commented Feb 15, 2025

github-actions bot commented Feb 23, 2025

[CT-832] [Feature] `maxBytes` output option for dbt artifacts #5461

[CT-832] [Feature] `maxBytes` output option for dbt artifacts #5461

charlespicowski commented Jul 21, 2022 •

edited

Loading