Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix YAML parsing with anchors and duplicate keys in dbt_project.yml file #5347

Closed
wants to merge 6 commits into from

Conversation

jeremyyeo
Copy link
Contributor

@jeremyyeo jeremyyeo commented Jun 8, 2022

resolves #5268, #5331

Description

Just pondering on a fix for the linked issues.

  1. Seems like safe_load is used in multiple places - so just revert to previous behaviour by default + option to use the newly introduced UniqueKeyLoader.
  2. Looks like a call to flatten_mapping sorts out the YAML anchors.

This seems to work for:

# ~/.dbt/profiles.yml

postgres:
  outputs:
    dev: &profile
      type: postgres
      host: localhost
      user: root
      password: password
      schema: public
      database: postgres
      port: 5432
      threads: 8
    prod:
      <<: *profile
    uat: *profile
  target: dev

and

# dbt_project.yml

...
vars:
  foo: bar
  foo: bar
...

Checklist

@cla-bot cla-bot bot added the cla:yes label Jun 8, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the bugs and taking a swing at this @jeremyyeo! The approach you're taking here looks directionally correct. We want to try validating yaml with UniqueKeyLoader, raise a warning ONLY (which becomes an error if --warn-error), and then proceed to actually use the SafeLoader regardless.

It should be really simple to add unit test cases for the two cases we identified. Here's my go at doing that: f7705a3

@dbt-labs/core-language Let's prioritize review of this PR, since it resolves some known net-new regressions in v1.2, and blocks us from being able to put out a beta prerelease.

@@ -31,6 +31,7 @@ class UniqueKeyLoader(SafeLoader):

def construct_mapping(self, node, deep=False):
mapping = set()
self.flatten_mapping(node) # This processes yaml anchors / merge keys (<<).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 72 to 76
def safe_load(contents, unique=False) -> Optional[Dict[str, Any]]:
if unique:
return yaml.load(contents, Loader=UniqueKeyLoader)
else:
return yaml.load(contents, Loader=SafeLoader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @dbt-labs/core-language on how to factor this code: Should we opt for one method with a boolean argument like this, or for two different methods: load_and_validate_unique_keys + safe_load?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether I prefer a 'unique' argument or a separate method name depends on why we're only actually using the 'unique' arg in 'load_yaml_text', but we call 'safe_load' in multiple other places. Do those other places not need the duplicate key validation? Why is it only called there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - I think we should be validating duplicate key everywhere.. so I reverted this to what it was before.

@jtcohen6 jtcohen6 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Jun 14, 2022
@jeremyyeo
Copy link
Contributor Author

jeremyyeo commented Jun 16, 2022

Not sure what's up with the unit test but standard run with dupe vars keys worked okay

image

Ah okay... because we raise a warning instead of cleanly parsing the dupe:

from core.dbt.clients.yaml_helper import load_yaml_text

text = """
vars:
  foo: bar
  foo: bar
"""

load_yaml_text(text)
$ python check.py  
06:12:47  [WARNING]: Compilation Error
  Duplicate 'foo' key found in yaml file

Change the assertion slightly.

Another thing to consider - a dupe with a different value:

vars:
  foo: bar
  foo: baz

What should that really resolve to (in this scenario it resolve to {'vars': {'foo': 'baz'}})? Perhaps it's not such a bad idea to error as it was doing in the first place?

@jtcohen6
Copy link
Contributor

Perhaps it's not such a bad idea to error as it was doing in the first place?

This wasn't an error in previous versions, so we can't raise an error now. The implicit behavior is to "take the last"; that's what it used to do, and what it will keep doing. It's a huge net improvement that we're raising a warning, so that users can become aware of the duplication/discrepancy and fix it themselves!

One other small aesthetic thing. It's a bit confusing to see:

06:12:47  [WARNING]: Compilation Error

Rather than raising an exception, which we then catch and raise as a warning within load_yaml_text, could construct_mapping just use dbt.exceptions.warn_or_raise to raise the warning message directly? I'm not sure exactly how to make that interact with the try/except, but we should really be falling back to real SafeLoader in all cases, and just using the UniqueKeyLoader for validation/warning.

@jeremyyeo
Copy link
Contributor Author

  1. Use warn_or_raise in construct_mapping instead. warn_or_raise calls the exception - not sure if this is the right way to do this since I couldn't find any other use of warn_or_raise.
  2. Tidied up confusing [WARNING]: Compilation Error print outs.
  3. Not too sure how to get the path of the file to be included in the warning message as before (when we had the path in the load_yaml_text function).

@gshank
Copy link
Contributor

gshank commented Jun 22, 2022

We could add a path arg to 'safe_load', throw an error in 'construct_mapping', catch it and issue a warning in safe_load with the path. I did a quick scan to see how hard it would be to do that, and there's a number of places where it wouldn't really matter (loading samples, etc), so it might not be too hard to do that. Up to you whether you want to go to the trouble, though a warning without the file name would be kind of frustrating as a user.

@jtcohen6
Copy link
Contributor

Is it fair to close this PR for now, since we ended up reverting the change in #5146?

@jeremyyeo jeremyyeo closed this Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-663] YAML anchor expansion failure in 1.2.0a1 / main
3 participants