Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alias database to allow matching between rows and Manifest #85

Closed
wants to merge 3 commits into from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented May 22, 2020

Currently the docs generation is broken because we need to supply
the database name when fetching the relations:

if dct['table_database'] is None:
    dct['table_database'] = dct['table_schema']

However, when we get the manifest we don't get the database:

{CatalogKey(database='', schema='fokko', name='logistical_configuration_data'):
	['model.dbtlake.logistical_configuration_data']}

Therefore the keys never line up, and we can't match the Catalogs:

https://github.com/fishtown-analytics/dbt/blob/9d0eab630511723cd0bc328f6f11d3ffe6c8f879/core/dbt/task/generate.py#L108

We get from the describe relations:

CatalogKey(database='fokko', schema='fokko', name='logistical_configuration_data')

Due to the logic above. I think ALIASing this is the easiest way out. Making the database non-optional in core would be another option, that would be cleaner in the long run. Please advise.

Currently the docs generation is broken because we need to supply
the database name when fetching the relations:

if dct['table_database'] is None:
    dct['table_database'] = dct['table_schema']

However, when we get the manifest we don't get the database:

{CatalogKey(database='', schema='fokko', name='logistical_configuration_data'):
	['model.dbtlake.logistical_configuration_data']}

Therefore the keys never line up, and we can't match the Catalogs:

https://github.com/fishtown-analytics/dbt/blob/9d0eab630511723cd0bc328f6f11d3ffe6c8f879/core/dbt/task/generate.py#L108

We get from the describe relations:

CatalogKey(database='fokko', schema='fokko', name='logistical_configuration_data')

Due to the logic above.
@Fokko
Copy link
Contributor Author

Fokko commented May 22, 2020

cc @jtcohen6

@jtcohen6
Copy link
Contributor

Good find @Fokko. I confirmed that while the docs generation commands work, the resulting docs site is missing information from the catalog.

@beckjake Could you take a look at the ALIAS approach here? It feels related to the changes in #83 around schema/database.

@beckjake
Copy link
Contributor

This is a good find, and the approach looks valid.

That said, would it perhaps make sense to change the list_relations_without_caching method's self.Relation.create? I haven't tested that, but it seems like it'd solve the problem effectively the same way. Or perhaps the SparkRelation.__post_init__ I added in #83 should set self.database = self.schema or self.schema = self.database depending upon None-ness. I think that would be reasonable as well.

I would feel very comfortable with that fix, whereas I feel a bit concerned about the knock-on effects downstream of setting an actual value as an alias.

I don't think either this PR or my suggestion will actually conflict with #83 (though I haven't tested). I'm pretty confident #83 totally misses this issue.

As an aside: I feel like a broken record here, but we really need a better test story for plugins. This kind of issue just shouldn't happen, and our test suite isn't even at a point where we can reasonably try to add a test for this. I guess we could modify the db-integration-tests branch we use for spark to support reading from a json file and validating some structural things, but that's a lot to ask on a PR.

@Fokko
Copy link
Contributor Author

Fokko commented May 22, 2020

Thanks for the insights. I'll give it a try to set it in the __post_init__.

I don't think the database can ever be set since it excluded in the accepted connection keys: https://github.com/fishtown-analytics/dbt-spark/blob/master/dbt/adapters/spark/connections.py#L53

@Fokko
Copy link
Contributor Author

Fokko commented May 22, 2020

The only down-stream issue that I've seen so far is the schema and database being set in the docs:
image

But I don't care so much about that, however, also the statistics seem to be broken again. Might dive into this somewhere next week, kinda busy at the moment.

@beckjake
Copy link
Contributor

That _connection_keys method actually lists the keys in the credentials that will be logged in dbt debug output. It exists to avoid logging passwords/private keys/etc.

I think in 0.17.0 there will be more things that could have problems with it, because we use translate_aliases in more places. Even if it's fine there, in the long run we'd like to expand the use of aliases quite a lot to exist just about everywhere, and that's a lot harder if they can step on each other.

I'd prefer to add a special flag to core for disabling fields in adapters or even support this specific adapter behavior where database=schema as an option in Relations in core, if it comes down to it. That would be a lot of work, but at least it wouldn't constrain the design space so much.

@jtcohen6
Copy link
Contributor

@beckjake I have a draft PR open (#91) that attempts to follow your recommendations above. I'm still running into issues with catalog generation.

@Fokko I opened a separate issue (#90) re: owner / table stats not showing up. I think this has been broken for a while, and we should absolutely fix it.

I also opened an issue re: the less-than-ideal Relation display in the docs site: dbt-labs/dbt-docs#94

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants