-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-832] [Feature] maxBytes
output option for dbt artifacts
#5461
Comments
maxBytes
output option for dbt artifacts maxBytes
output option for dbt artifacts
@charlespicowski Thanks for opening! Manifest data has been getting bigger and bigger. This reflects both:
For the specific use case identified here (loading into Snowflake, for use in the One simple way to do this might be to "flatten" the top-level manifest dictionaries into separate records, and then "flatten" again any record which is over a certain scale (>1k entries, say) — such that we end up with a manifest looking something like: {"metadata": ...}
{"nodes_1": ...}
{"nodes_2": ...}
{"sources": ...} More interesting, but a heavier lift, would be turning the manifest into a "manifest of manifests." The top-level manifest would preserve pointers to all resources, with resource {
"nodes": {
"model.<package_name>.<model_name>": "<lookup value (name of another file?)>",
...
}
} All of this might be doable via custom post-processing (whether fancy It does beg the question: Should we load this directly into a key-value store, or a database running locally? That feels out of scope for this, but in scope for our imaginations :) Did you have other ideas of how we might go about this? It feels like we'd want to experiment with a few different approaches, before settling on one. (cc @barryaron — you might find this interesting) |
Approach two feels a bit nicer. How are you handling size the split of the dbt I understand the data structure is a bit different, and considerations are different (eg. |
Just came here to say that we've been busy reimplementing the package, resolving this issue in dbt_artifacts. We now process the graph and results context variables, inserting the values straight into the source tables, avoiding the artifacts altogether. |
This is much simpler to do, because every log message is self-contained, and their inter-relationship is expressed through linear time. Whereas we'd want to divvy up the JSON artifacts in a way that still produces a self-contained, valid, and conceptually meaningful subset of the overall JSON blob. The dbt-core/core/dbt/events/functions.py Line 92 in 2548ba9
This isn't configurable today for end users of dbt, but such a thing would be very easy to instrument. |
this links to #5096 - if there is a way or removing the appearance of dep packages in the |
out of interest @NiallRees , can you share more on what you mean by "graph" and "context variables? |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Is this your first time opening an issue?
Describe the Feature
An
artifactMaxBytes: 16*1024*1024
configuration setting in theproject.yml
that would control the output of the artifact files (manifest.json
etc). If the limit is exceeded, multiple files less that or equal to that size could be produced.The manifest.json file in particular can exceed a limit imposed by Snowflake for uploading VARIANT data - specifically, any compressed (.gz) file cannot have a single cell entry with size >16MB.
Describe alternatives you've considered
There are many novel ways this step can be overcome, some work and some do not.
I have seen pre-processing ways that recommend using jq or Snowflake UDFs. The well-known package by Brooklyn data seems to address this issue with the V2 method of uploading the artifacts (which first attempts to flatten the JSON before expanding and uploading it into tables) - however it seems there are still problems with this approach.
Who will this benefit?
Data observability is becoming more of a trend these days, with more people looking into extracting value from the dbt artifacts.
I think if this feature was developed, it would save a lot of time/headache/mess of splitting up the JSON file after it is created.
Are you interested in contributing this feature?
I am happy to try.
Anything else?
No response
The text was updated successfully, but these errors were encountered: