Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in incremental models when using struct columns #202

Closed
SCouto opened this issue Aug 13, 2021 · 1 comment · Fixed by #204
Closed

Error in incremental models when using struct columns #202

SCouto opened this issue Aug 13, 2021 · 1 comment · Fixed by #204
Labels
bug Something isn't working

Comments

@SCouto
Copy link
Contributor

SCouto commented Aug 13, 2021

Describe the bug

Hi everyone

I think I've found an issue when using structs and incremental model

Find below a create table statement from the source table

CREATE TABLE `<someSchema>`.`<sourceTable>` (
  `properties` STRUCT<`site`: STRING>,
  `channel` STRING,
  `timestamp` STRING,
  `anotherDate` STRING,
  `aDate ` STRING)
  USING parquet
  PARTITIONED BY (aDate)
  LOCATION 's3a://<someBucket>'

If the model moves the struct field as is to the sink table (with a simple select). It works the first time, but fails the second one

Here is the create table of the example sink table

CREATE TABLE `<someSchema>`.`dbtsink` (
      `properties` STRUCT<`site`: STRING>,
      `channel` STRING,
      `timestamp` STRING,
      `anotherDate ` STRING,
      `aDate` STRING)
USING parquet
PARTITIONED BY (anotherDate)

As I said, second execution raise this error:

Runtime Error in model dbtsink (models/anotherDate/dbtsink.sql)
  Database Error
    Error running query: org.apache.spark.sql.AnalysisException: cannot resolve '`site`' given input columns: [dbtsink__dbt_tmp.channel, dbtsink__dbt_tmp.anotherDate, dbtsink__dbt_tmp.aDate, dbtsink__dbt_tmp.properties, dbtsink__dbt_tmp.timestamp]; line 4 pos 25;
    'InsertIntoStatement 'UnresolvedRelation [someSchema, dbtsink], false, false
    +- 'Project [properties#6526, 'site, channel#6527, timestamp#6528, aDate#6541, anotherDate#6540]
       +- SubqueryAlias dbtsink__dbt_tmp
          +- Project [properties#6526, channel#6527, timestamp#6528, anotherDate#6540, aDate#6541]
             +- Filter (((aDate#6541 > 2021060100) AND (aDate#6541 <= 2021070609)) AND (anotherDate#6540 = 2021070609))
                +- SubqueryAlias spark_catalog.someSchema.sourceTable
                   +- Relation[context#6524,traits#6525,properties#6526,channel#6527,timestamp#6528,projectId#6529,integrations#6530,messageId#6531,originalTimestamp#6532,receivedAt#6533,sentAt#6534,userId#6535,anonymousId#6536,type#6537,providerId#6538,version#6539,anotherDate#6540,aDate#6541] parquet

Steps To Reproduce

  • Create table with struct is shown in the section above
  • Populate that table and run a model as follows twice
  • First run will work just fine, second one it's where the error is found
select 
properties,
channel,
timestamp,
anotherDate,
aDate
from {{ source('someSchema', 'sourceTable') }}
where aDate > '{{ var("aDateLowerLimit") }}' and aDate <= '{{ var("aDateUpperLimit") }}'
and anotherDate = '{{ var("anotherDate") }}'

Expected behavior

It should move data to the corresponding partition without raising an error

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

The output of dbt --version:

installed version: 0.20.1
   latest version: 0.20.1

Up to date!

Plugins:
  - bigquery: 0.20.1
  - snowflake: 0.20.1
  - redshift: 0.20.1
  - postgres: 0.20.1
  - spark: 0.20.1

The operating system you're using:
Mac Os

The output of python --version:

python --version
Python 3.9.6

Additional context

After checking the issue I think problem may be here which is using a regex to parse column names
https://github.com/dbt-labs/dbt-spark/blob/master/dbt/adapters/spark/impl.py#L222-L249

the regex

    INFORMATION_COLUMNS_REGEX = re.compile(
        r"\|-- (.*): (.*) \(nullable = (.*)\b", re.MULTILINE)

It's retrieving all the columns, even the inner ones from the struct

image

I'm working on a fix already.

@SCouto SCouto added bug Something isn't working triage labels Aug 13, 2021
@jtcohen6
Copy link
Contributor

Nice digging @SCouto! And thanks for working on the fix :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants