Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contracts: Handle struct column specified both at root and nested levels + arrays of structs #806

Merged
merged 11 commits into from
Jul 11, 2023

Conversation

MichelleArk
Copy link
Contributor

@MichelleArk MichelleArk commented Jul 1, 2023

resolves #782
resolves #781

Description

Handles scenarios where a nested column type may be specified both at the struct-level (e.g. b.id: int, b.name: string) and parent level b: struct. Additionally handles when parent level data_type is used to specify an array of structs.

This PR resolves two issues -- initially I had aimed to just tackle 782 on its own by ignoring a top-level data_type if specified (preserving its constraints) but the top-level array specification was just a couple additional lines at that point, and figured reviewing / testing it all together would be more straightforward since the issues and affected codepaths were very closely related.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • scope of changes is entirely covered by unit tests 🎉
  • I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
  • I have run changie new to create a changelog entry

🎩 parent array

-- schema.yml
  - name: nested_array_field
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: my_array
        data_type: array
        constraints:
          - type: not_null
      - name: my_array.id
        data_type: integer
        description: Desc for field_1
      - name: my_array.name
        data_type: string
        description: Desc for field_2

nested_array_field.sql

select
  array( select struct(
    1 as id,
    'name' as name)
  ) as my_array

  UNION ALL 

  select null as my_array

🎩 parent specification with constraints

schema.yml

  - name: nested_fields
    config:
      materialized: table
      contract: {enforced: true}
    columns:
      - name: a
        data_type: string
        description: "field a"
      - name: b.id
        data_type: integer
        description: "nested field b.id"
        tests:
          - not_null
      - name: b.name
        data_type: string
      - name: b
        description: "top-level"
        data_type: struct
        constraints:
          - type: not_null

nested_fields.sql

select
  'string' as a,
  struct(
    1 as id,
    'name' as name
  ) as b

  UNION ALL 

select 'string2' as a, null as b

@cla-bot cla-bot bot added the cla:yes label Jul 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jul 1, 2023

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-bigquery contributing guide.

@MichelleArk MichelleArk force-pushed the 782/parent-subfields-nested-fields branch from bbe8814 to bac87f5 Compare July 1, 2023 01:30
@MichelleArk MichelleArk marked this pull request as ready for review July 5, 2023 17:21
@MichelleArk MichelleArk requested a review from a team as a code owner July 5, 2023 17:21
@MichelleArk MichelleArk requested review from mikealfare and colin-rogers-dbt and removed request for mikealfare July 5, 2023 17:21
@MichelleArk
Copy link
Contributor Author

@colin-rogers-dbt - I've tagged you for review as this is follow-on work from #738.

dbt/adapters/bigquery/column.py Outdated Show resolved Hide resolved
dbt/adapters/bigquery/column.py Outdated Show resolved Hide resolved
dbt/adapters/bigquery/column.py Outdated Show resolved Hide resolved
@MichelleArk MichelleArk requested a review from mikealfare July 6, 2023 14:37
@@ -252,8 +273,22 @@ def _format_nested_data_type(unformatted_nested_data_type: Union[str, Dict[str,
if isinstance(unformatted_nested_data_type, str):
return unformatted_nested_data_type
else:
parent_data_type, *parent_constraints = unformatted_nested_data_type.pop(
_PARENT_DATA_TYPE_KEY, ""
).split() or [None]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the or [None] here? I thought I tried this in a toy example where .pop() returned "" and it still worked as anticipated. But I might have missed some edge case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately not :( it broke a couple unit tests with the following error:

>>> foo, *bar = "".split()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected at least 1, got 0)

{%- endfor -%}
{%- if (col_err | length) > 0 -%}
{{ exceptions.column_type_missing(column_names=col_err) }}
{%- endif -%}
Copy link
Contributor Author

@MichelleArk MichelleArk Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces some unsatisfying duplication in the bigquery version of the get_empty_schema_sql macro.

However, it's necessary to do this before the columns get nested to provide granular error messages as opposed to deferring it to the handling in default__get_empty_schema_sql once the columns are already nested and it is not possible to determine which nested field had a missing data_type value.

Once the nested handling of model contracts moves into core, this duplication should be resolved and only present in the default__get_empty_schema_sql implementation.

@MichelleArk
Copy link
Contributor Author

🎩 - Missing data_type:

❯ dbt run --select nested_fields --project-dir ~/basic-dbt
17:16:08  Running with dbt=1.6.0-b8
17:16:08  Registered adapter: bigquery=1.6.0-b4
17:16:08  Found 5 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
17:16:08  
17:16:10  Concurrency: 1 threads (target='dev')
17:16:10  
17:16:10  1 of 1 START sql table model dbt_marky.nested_fields ........................... [RUN]
17:16:10  1 of 1 ERROR creating sql table model dbt_marky.nested_fields .................. [ERROR in 0.83s]
17:16:10  
17:16:10  Finished running 1 table model in 0 hours 0 minutes and 2.01 seconds (2.01s).
17:16:11  
17:16:11  Completed with 1 error and 0 warnings:
17:16:11  
17:16:11  Compilation Error in model nested_fields (models/nested_fields.sql)
17:16:11    Contracted models require data_type to be defined for each column. Please ensure that the column name and data_type are defined within the YAML configuration for the ['a', 'b.name'] column(s).
17:16:11    
17:16:11    > in macro bigquery__get_empty_schema_sql (macros/utils/get_columns_spec_ddl.sql)
...

@MichelleArk MichelleArk merged commit 291713c into main Jul 11, 2023
@MichelleArk MichelleArk deleted the 782/parent-subfields-nested-fields branch July 11, 2023 16:36
github-actions bot pushed a commit that referenced this pull request Jul 11, 2023
…els + arrays of structs (#806)

(cherry picked from commit 291713c)
MichelleArk added a commit that referenced this pull request Jul 12, 2023
…els + arrays of structs (#806) (#817)

(cherry picked from commit 291713c)

Co-authored-by: Michelle Ark <MichelleArk@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants