Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-832] [Feature] maxBytes output option for dbt artifacts #5461

Closed
1 task done
charlespicowski opened this issue Jul 12, 2022 · 9 comments
Closed
1 task done

[CT-832] [Feature] maxBytes output option for dbt artifacts #5461

charlespicowski opened this issue Jul 12, 2022 · 9 comments
Labels
artifacts enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors stale Issues that have gone stale

Comments

@charlespicowski
Copy link

Is this your first time opening an issue?

Describe the Feature

An artifactMaxBytes: 16*1024*1024 configuration setting in the project.yml that would control the output of the artifact files (manifest.json etc). If the limit is exceeded, multiple files less that or equal to that size could be produced.

The manifest.json file in particular can exceed a limit imposed by Snowflake for uploading VARIANT data - specifically, any compressed (.gz) file cannot have a single cell entry with size >16MB.

Describe alternatives you've considered

There are many novel ways this step can be overcome, some work and some do not.

I have seen pre-processing ways that recommend using jq or Snowflake UDFs. The well-known package by Brooklyn data seems to address this issue with the V2 method of uploading the artifacts (which first attempts to flatten the JSON before expanding and uploading it into tables) - however it seems there are still problems with this approach.

Who will this benefit?

Data observability is becoming more of a trend these days, with more people looking into extracting value from the dbt artifacts.
I think if this feature was developed, it would save a lot of time/headache/mess of splitting up the JSON file after it is created.

Are you interested in contributing this feature?

I am happy to try.

Anything else?

No response

@charlespicowski charlespicowski added enhancement New feature or request triage labels Jul 12, 2022
@github-actions github-actions bot changed the title [Feature] maxBytes output option for dbt artifacts [CT-832] [Feature] maxBytes output option for dbt artifacts Jul 12, 2022
@jtcohen6 jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors artifacts and removed triage labels Jul 19, 2022
@jtcohen6
Copy link
Contributor

@charlespicowski Thanks for opening!

Manifest data has been getting bigger and bigger. This reflects both:

  • The addition of new configuration options within dbt, while preserving backwards compatibility for metadata consumers
  • The increasing scale of dbt projects running in production

For the specific use case identified here (loading into Snowflake, for use in the dbt_artifacts package), we don't need the overall manifest contents to be <16 MB, but we do need the contents of each "row" (NDJSON record) to be <16 MB.

One simple way to do this might be to "flatten" the top-level manifest dictionaries into separate records, and then "flatten" again any record which is over a certain scale (>1k entries, say) — such that we end up with a manifest looking something like:

{"metadata": ...}
{"nodes_1": ...}
{"nodes_2": ...}
{"sources": ...}

More interesting, but a heavier lift, would be turning the manifest into a "manifest of manifests." The top-level manifest would preserve pointers to all resources, with resource unique_id as the keys, but move the detailed entries for those resources into separate files / indexable storage locations.

{
    "nodes": {
       "model.<package_name>.<model_name>": "<lookup value (name of another file?)>",
       ...
    }
}

All of this might be doable via custom post-processing (whether fancy jq or a Python script). I understand the desire to have it all-in-dbt, since that means dbt metadata can be loaded into Snowflake and powered the dbt_artifacts package, with no additional tooling needed. But I'm also aware that dbt features like dbt-docs wouldn't work with this configuration option... unless we taught it how to do those index lookups, too.

It does beg the question: Should we load this directly into a key-value store, or a database running locally? That feels out of scope for this, but in scope for our imaginations :)

Did you have other ideas of how we might go about this? It feels like we'd want to experiment with a few different approaches, before settling on one.

(cc @barryaron — you might find this interesting)

@charlespicowski
Copy link
Author

charlespicowski commented Jul 21, 2022

Approach two feels a bit nicer.

How are you handling size the split of the dbt log files?

I understand the data structure is a bit different, and considerations are different (eg. dbt-docs does not need to reference this), but perhaps there is something to gain by looking into this.

@NiallRees
Copy link
Contributor

Just came here to say that we've been busy reimplementing the package, resolving this issue in dbt_artifacts. We now process the graph and results context variables, inserting the values straight into the source tables, avoiding the artifacts altogether.

@jtcohen6
Copy link
Contributor

How are you handling size the split of the dbt log files?

This is much simpler to do, because every log message is self-contained, and their inter-relationship is expressed through linear time. Whereas we'd want to divvy up the JSON artifacts in a way that still produces a self-contained, valid, and conceptually meaningful subset of the overall JSON blob.

The maxBytes option also comes "for free" from RotatingFileHandler:

filename=log_dest, encoding="utf8", maxBytes=10 * 1024 * 1024, backupCount=5 # 10 mb

This isn't configurable today for end users of dbt, but such a thing would be very easy to instrument.

@charlespicowski
Copy link
Author

this links to #5096 - if there is a way or removing the appearance of dep packages in the manifest.json could (greatly?) reduce the overall size of it

@charlespicowski
Copy link
Author

out of interest @NiallRees , can you share more on what you mean by "graph" and "context variables?

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Feb 15, 2025
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
artifacts enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

3 participants