Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed blocks #2276

Closed
drewbanin opened this issue Mar 31, 2020 · 8 comments
Closed

Seed blocks #2276

drewbanin opened this issue Mar 31, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

Describe the feature

Let's wrap seeds (optionally) inside of blocks. Example implementation:

{% seed my_seed_name %}

{{ config(...) }}

col1,col2,col3
abc,def,ghi

{% endseed %}

Some rough implementation thoughts:

  • This tag should definitely be optional. Wrapping a csv file in a jinja tag is pretty weird to do!
  • Supporting seed configs inline would be really nice!
  • I could imagine supporting non-csv seeds in the future, eg:
{% seed my_json_seed type=json %}

{"col1": "value", "col2": "other value"},
{"col1": "value2", "col2": "other value2"},

{% endseed %}

That is certainly out of scope for this change, but a block here gives us a nucleation point to build new and compelling things on top of in the future.

@drewbanin drewbanin added enhancement New feature or request 1.0.0 Issues related to the 1.0.0 release of dbt labels Mar 31, 2020
@drewbanin drewbanin mentioned this issue Mar 31, 2020
11 tasks
@sumanau7
Copy link
Contributor

sumanau7 commented Apr 7, 2020

@drewbanin can you share the purpose for doing this ?

{% seed my_seed_name %}

col1,col2,col3
abc,def,ghi

{% endseed %}

Is the above required so that more than one seed data can be added in one file, for eg:

{% seed my_seed_name %}

col1,col2,col3
abc,def,ghi

{% endseed %}

{% seed my_seed_name_1 %}

col4,col5,col6
abc,def,ghi

{% endseed %}

And does the {{ config(...) }} denote the same configuration as explained in this doc https://docs.getdbt.com/reference/seed-configs/

@drewbanin
Copy link
Contributor Author

hey @sumanau7 - yep, that's exactly right! These seed tags would be optional, but if they're provided, they would support:

  • multiple seed blocks in a single file
  • configuration of seeds inline (as opposed to in the dbt_project.yml file)
  • seed blocks defined outside of .csv files (eg. next to a model defined with a block: model blocks #184 )

The configs we'd support are indeed the ones you've linked to!

@sumanau7
Copy link
Contributor

sumanau7 commented Apr 9, 2020

@drewbanin Adding jinja tags inside csv files might get weird, few issues that i can see of is:

  1. User can dump data in csv using their tools but those files after that have to be manually updated and jinja tags have to be added.
  2. Let's say there is some formatting issue in csv file now due to jinja tags inside it, user's won't be able to open these files in their tools for eg: Google sheets, Numbers, Excel etc to properly figure out the issue.
  3. Even Github won't be able to format these csv files properly for display purposes because of unknown tags.

There can be many other benefits also of this approach which of course you would know better.

As an alternative may i suggest the following:

  1. Let user define seed data in multiple csv files.
data/file_1.csv
data/file_2.csv
data/file_3.json
  1. In their data folder they can now define a jinja template seed_template.jinja2 which has content like
{% seed my_seed_name file=file1.csv %}
{{ config(...) }}
{% endseed %}

{% seed my_seed_name_1 file=file2.csv %}
{{ config(...) }}
{% endseed %}

{% seed my_seed_name_2 file=file3.json type=json %}
{{ config(...) }}
{% endseed %}

Of course you would be the best person to take this call.

@drewbanin
Copy link
Contributor Author

You make some really good points here @sumanau7! I buy it! Let's still move forwards with a seed block, but let's consider separating the CSV file contents from the block itself :)

@sumanau7
Copy link
Contributor

@drewbanin Interested to pick this up.
I did spend some time in understanding how seeds are getting processed. So if we have to do what we discussed above does the below approach sounds good:

  1. Define dataclass to process seed config, just like a Project class exists in core/dbt/config.project
  2. Make sure RuntimeConfig gets this class.
  3. While processing seed data check if SeedConfig is specified use that else fallback to seed config defined in Project.

Your thoughts on implementation would help here.

@jml
Copy link

jml commented Apr 29, 2020

Came here from #2365. Seems like seed blocks are orthogonal to supporting other seed types? For example, you could configure the seed source format to be either JSON or CSV in the dbm_project.yml without having to implement blocks as described here.

@jtcohen6
Copy link
Contributor

@drewbanin I think we should close this, or at least kick it out of v1.0.

Seed blocks get us:

  • The ability to define multiple seeds in the same file
  • The ability to config() seeds inline

The latter is the more compelling of the two by far, and indeed requiring that all seed configs be specified in dbt_project.yml is not sustainable for especially large projects. I think we can better resolve this by enabling the specification of configs in seeds/whatever.yml (#2401).

If we do take this off our roadmap, I think we should reopen #2365, which is compelling in its own right.

@drewbanin
Copy link
Contributor Author

@jtcohen6 i'm with you! Let's cose this issue, prioritize #2274 and re-open #2365 - sounds like that will get us to a good place :D

@jtcohen6 jtcohen6 removed the 1.0.0 Issues related to the 1.0.0 release of dbt label Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants