Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-520] [Enhancement] Schema files with repeated top level keys (e.g. models) results in only the last key being parsed #5114

Open
1 task done
jeremyyeo opened this issue Apr 20, 2022 · 8 comments · Fixed by #5146
Labels
enhancement New feature or request

Comments

@jeremyyeo
Copy link
Contributor

jeremyyeo commented Apr 20, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

If you have a schema file with repeated top level keys:

# models/schema.yml
version: 2
models:
  - name: my_model_a
    description: "This is a"
    columns:
      - name: user_name
        tests:
          - not_null
models:
  - name: my_model_b
    description: "This is b"
    columns:
      - name: user_name
        tests:
          - not_null

Only the last models key will apply to the project when you execute dbt tasks that depend on them (like test, docs generate). We should raise an exception for such cases.

Expected Behavior

Perhaps we should raise an error in this case instead of silently just including the last models key of the schema file in our project's manifest.

Steps To Reproduce

  1. Add the above schema.yml file to your project along with 2 simple models:
-- models/my_model_a.sql
select 1 as user_name
-- models/my_model_b.sql
select 1 as user_name
  1. Do a dbt run to build our models.
  2. Do a dbt test and observe that we only tested my_model_b.
  3. Do a dbt docs generate && dbt docs serve and observe that only my_model_b has a description (my_model_a has no description).

Relevant log output

$ dbt test
04:02:08  Running with dbt=1.0.4
04:02:09  Found 2 models, 1 test, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
04:02:09  
04:02:09  Concurrency: 1 threads (target='dev')
04:02:09  
04:02:09  1 of 1 START test not_null_my_model_b_user_name................................. [RUN]
04:02:09  1 of 1 PASS not_null_my_model_b_user_name....................................... [PASS in 0.07s]
04:02:09  
04:02:09  Finished running 1 test in 0.24s.
04:02:09  
04:02:09  Completed successfully
04:02:09  
04:02:09  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Environment

- OS: macOS
- Python: 3.9.7
- dbt: 1.0.4

What database are you using dbt with?

postgres

Additional Context

image

@jeremyyeo jeremyyeo added bug Something isn't working triage labels Apr 20, 2022
@github-actions github-actions bot changed the title [Bug] Schema files with repeated top level keys (e.g. models) results in only the last key being parsed [CT-520] [Bug] Schema files with repeated top level keys (e.g. models) results in only the last key being parsed Apr 20, 2022
@jeremyyeo
Copy link
Contributor Author

Here's the relevant issue and a fix we could try which extends SafeLoader.

Add the extended class to our yaml_helper client then add in a ValueError in the following tuple:

except (yaml.scanner.ScannerError, yaml.YAMLError) as e:

Like so:

    except (yaml.scanner.ScannerError, yaml.YAMLError, ValueError) as e:

Then our dbt test / dbt docs generate) should result in:

$ dbt test
04:10:34  Running with dbt=1.0.4
04:10:35  Encountered an error:
Parsing Error
  Error reading postgres: schema.yml - Runtime Error
    Duplicate 'models' key found in YAML.

@jtcohen6
Copy link
Contributor

(Maybe a warning instead of an error, for starters? Then we could switch it to error-level, in a future minor version)

@emmyoop
Copy link
Member

emmyoop commented Apr 21, 2022

@jeremyyeo thanks for this amazing write up and suggested solution!

I agree with @jtcohen6 that a warning would be a good place to start here. It would need to be caught separate from theyaml.scanner.ScannerError and yaml.YAMLError since we want to handle it differently. warn_or_error could be leveraged on a custom exception we raise in the fix you mention above as opposed to just a ValueError.

class UniqueKeyLoader(yaml.SafeLoader):
    def construct_mapping(self, node, deep=False):
        mapping = set()
        for key_node, value_node in node.value:
            key = self.construct_object(key_node, deep=deep)
            if key in mapping:
                raise DuplicateYamlKeyException(f"Duplicate {key!r} key found in YAML.")
            mapping.add(key)
        return super().construct_mapping(node, deep)

Then define that new exception in exceptions.py.

Additional it would be useful to add a test that checks we trigger a warning as expected. If/when we convert to an error, the test would need to be updated. All of our duplication tests live in 025_duplicate_model_tests but we are currently in the process of converting all of our tests to a more intuitive and consistent framework. The new test would live in /dbt/tests/functional in a new folder, possibly named duplication. It would need to be written in our new testing framework.

If you or another community member want to work on this I would be happy to help give some direction on writing the new test and any other questions there may be.

@emmyoop emmyoop added good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Apr 21, 2022
@jeremyyeo
Copy link
Contributor Author

Thanks for the guidance @emmyoop + @jtcohen6. Will take a crack at the raising a new warning instead. Possibly will need help adding the test later but will circle back on that.

@jeremyyeo jeremyyeo self-assigned this Apr 21, 2022
@jeremyyeo jeremyyeo reopened this Jun 23, 2022
@gshank gshank removed the good_first_issue Straightforward + self-contained changes, good for new contributors! label Jun 24, 2022
@gshank
Copy link
Contributor

gshank commented Jun 24, 2022

At this point we've reverted the original pull request because we didn't have time to figure out a solution for anchor overrides , which were causing error messages for duplicate keys, even though anchor overrides are legal syntax and should not be considered duplicate keys.

It's still possible that we could come up with a solution, but it requires more research. Some possibilities are 1) fixing in pyyaml and doing a pull request. 2) supporting a different yaml library that does correctly handle duplicate keys (ruamel has the same issue with spurious errors for anchor overrides). 3) expanding the code that looked for duplicate keys to handle anchor overrides correctly.

One possibility is yaml.parse, which does provide information on anchors and such. A fix using this would quite possibly make more sense as a pyyaml pr though. https://stackoverflow.com/questions/64460249/how-to-remove-duplicate-keys-in-yaml-file-automatically. yaml.parse returns information on anchors, so the linked code could possibly be modified to skip duplicates for anchor overrides.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 22, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 29, 2022
@Tonayya Tonayya added enhancement New feature or request and removed bug Something isn't working stale Issues that have gone stale labels Nov 27, 2024
@Tonayya Tonayya changed the title [CT-520] [Bug] Schema files with repeated top level keys (e.g. models) results in only the last key being parsed [CT-520] [Enhancement] Schema files with repeated top level keys (e.g. models) results in only the last key being parsed Nov 27, 2024
@Tonayya Tonayya reopened this Nov 27, 2024
@Tonayya
Copy link
Contributor

Tonayya commented Nov 27, 2024

Hi @gshank just re-opened this issue as it's still unresolved and would be really helpful if we could potentially implement your suggestion maybe. Just sharing an additional example of how this could impact users:

If you have a sources.yml file and within it you define a source multiple times (e.g. developer doesn't realize source is defined already so adds another one of the same), the most latest defined source overrides the first ones. So if you have like:

sources:
  - name: some_source
    database: some_database
    schema: some_schema
    tables:
      - name: ac
      - name: ab
      - name: abc
  - name: some_source
    database: some_database
    schema: some_schema
    tables:
      - name: ac
      - name: ab

With the above, you basically lose table abc.

Expected Behavior:
We expect to either:
Flag Duplicate Sources: Clearly indicate duplicate source definitions and prevent overwriting.
Merge Duplicate Sources: Combine table definitions from multiple sources with the same name, database, and schema.
Impact:
This issue can lead to incorrect model definitions, unexpected errors, and wasted development time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants