Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake catalog/manifest support (#832) #849

Merged
merged 14 commits into from
Jul 17, 2018

Conversation

beckjake
Copy link
Contributor

Implement catalog and manifest creation for snowflake, and created/improved some tests. Implements #832.

@beckjake beckjake requested a review from drewbanin July 13, 2018 15:28
@beckjake beckjake changed the base branch from development to dev/isaac-asimov July 13, 2018 17:34
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plugged this into the (presently unreleased) docs site, and the catalog worked really well :)

I think we'll probably want to be more aggressive about filtering the catalog, but keen to discuss in the comments.


from information_schema.tables

union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're in the habit of using union all instead of union. The only difference is that union all preserves duplicates. I don't think that's an issue here, but when I see union, I think "dupes are specifically not wanted here".

Via: https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md#style-guide

results = cls.run_operation(profile, project_cfg, manifest,
GET_CATALOG_OPERATION_NAME,
GET_CATALOG_RESULT_KEY)
schemas = cls.get_existing_schemas(profile, project_cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_existing_schemas function just returns all of the schemas in the database. I think that means that the filter on the line below doesn't actually do anything! I think we probably don't want to index every single schema in the db, just insofar as it could be a lot of data.

I imagined the catalog only being generated for the dbt-managed schemas in the database. There's an example of logic like that in the snippet over here: #830

schemas = list({node.to_dict()['schema'] for node in manifest.nodes.values()})

I think eventually, we'll want to allow users to turn on catalog generation for non-db schemas, but we're not quite there yet. Do you agree @cmcarthur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just migrated it into the default adapter since it isn't really intended to be postgres-specific and didn't want to make any operational changes.

Does that mean the catalog shouldn't include stuff loaded by dbt seed, or does that count as managed by dbt?

Either way, we will want to make bigquery consistent so I'll have to change one of these!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, totally, I think this is something I missed in the PR that originally added the catalog code for Postgres.

I think it makes sense to include seed schemas in the catalog too. Seeds, like models, can be totally rebuilt from scratch, so I consider them "dbt managed" also. Further, seeds are nodes, so their schemas will be plucked by the code sample above too.

I think that just leaves out archives. Archives are stateful, and definitely cannot be rebuilt from scratch if they are dropped, so we recommend that folks put them in totally different schemas than their models & seeds. Even further: we consider archived tables to be "source data" in that archives don't participate in the graph and cannot be refd. We should someday normalize their behavior, but I think it's ok to leave them out of the catalog for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that really makes things a bit nicer anyway in my opinion, we don't have to re-generate the same data already in the manifest. 👍

@beckjake beckjake merged commit b5a5003 into dev/isaac-asimov Jul 17, 2018
@beckjake beckjake deleted the snowflake-get-catalog branch July 17, 2018 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants