Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Agate from coercing values in query result sets #3499

Merged
merged 3 commits into from
Jul 1, 2021

Conversation

drewbanin
Copy link
Contributor

@drewbanin drewbanin commented Jun 26, 2021

resolves #2984, #2666
related to #3413

Description

dbt uses Agate to represent internal dataframes (eg. seed file tabular representations, or in-memory query result sets). Agate's type coercions are interesting and helpful for identifying column types in csv files, but they can lead to incorrect/confusing results when applied to query result sets. This PR prevents type coercion from happening when marshaling a query resultset into an Agate table.

Example

I used the following macro to debug / reproduce some type coercion issues locally (using Snowflake):

{% macro debug() %}

    {% set query %}

        select '0005' as stringnum, 'false' as stringbool, 10 as intfield, false as boolfield, '01T00000aabbccdd' as dtstring
        union all
        select '0006' as stringnum, 'true' as stringbool, 20 as intfield, true as boolfield, '01T00000aabbccde' as dtstring

    {% endset %}

    {% set res = run_query(query) %}
    {% do res.print_table() %}

{% endmacro %}

Before this change:

STRINGNUM (bad) STRINGBOOL (bad) INTFIELD BOOLFIELD DTSTRING (would be funny if it wasn't so bad)
5 False 10 False 0101-01-01 00:00:00
6 True 20 True 0101-01-01 00:00:00

After this change:

STRINGNUM STRINGBOOL INTFIELD BOOLFIELD DTSTRING
0005 false 10 False 01T00000aabbccdd
0006 true 20 True 01T00000aabbccde

About the implementation

This PR assumes that we never want to coerce types in tables returned by the database, which I think is correct, but would welcome any holes that can be poked in that assumption. I was originally going to try to use the cursor.description provided by dbapi implementation to map columns in a query resultset to Python types and then use those Python types to "force" types for all of the columns in the Agate table. In practice, this would require us to add some logic to every adapter plugin which felt both onerous and unnecessary. Instead, we can just peek at the actual query results and "force" string columns in the resultset to be retained as strings in the Agate table that dbt produces. This works because:

  • If the database is returning a string value in a field, we can feel 100% confident that the right type for the column in Agate is a string
  • Columns in a database result cannot be of mixed types, so if we find one string value in a column, we know the whole column should be treated as a string
  • Agate won't coerce a not-string into an unexpected type (eg. an int returned by the database can't become a bool in Agate)

What else

I tried to make a surgical change here. By putting this logic in the agate_helper client, all adapter plugins should be able to take advantage of the change. Longer term, I think we should consider making a less-surgical change that removes Agate entirely. Let's pick up that discussion in #3413 though :)

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Jun 26, 2021
@drewbanin drewbanin temporarily deployed to Redshift June 26, 2021 17:13 Inactive
@drewbanin drewbanin temporarily deployed to Redshift June 26, 2021 17:13 Inactive
@drewbanin drewbanin temporarily deployed to Postgres June 26, 2021 17:13 Inactive
@drewbanin drewbanin temporarily deployed to Snowflake June 26, 2021 17:13 Inactive
@drewbanin drewbanin temporarily deployed to Snowflake June 26, 2021 17:13 Inactive
@drewbanin drewbanin force-pushed the fix/agate-undesirable-casting branch 2 times, most recently from 621647d to a5a2dc7 Compare June 26, 2021 17:18
for _row in data:
row = []
for value in list(_row.values()):
for col_name in column_names:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our existing implementation assumed that the order of keys in a dictionary was guaranteed... but IIRC that's not true for all versions of Python. My Python is rusty, so maybe this change is not necessary. I can revert if need be :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not my area, but I do remember hearing that this changed in recent versions of python. Indeed this blog post suggests that dictionaries are ordered by insertion as of python 3.6, and that this is officially guaranteed as of python 3.7. dbt currently requires py36 or higher.

row.append(value)
# Represent container types as json strings
value = json.dumps(value, cls=dbt.utils.JSONEncoder)
text_only_columns.add(col_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally just going to peek at the first row in the result set, but it's possible that the value in a string column returned by the database could be NULL. This implementation looks at all of the values in the result set, which we're actually already doing anyway to json-encode dict/list/tuple types.

CHANGELOG.md Outdated
@@ -21,6 +21,7 @@ Contributors:
### Fixes

- Handle quoted values within test configs, such as `where` ([#3458](https://github.com/fishtown-analytics/dbt/issues/3458), [#3459](https://github.com/fishtown-analytics/dbt/pull/3459))
- Fix type coercion issues when fetching query result sets ([#2984](https://github.com/fishtown-analytics/dbt/issues/2984), [#3499](https://github.com/fishtown-analytics/dbt/pull/3499))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 0.20.0 the right home for this? Or should I slot it into 0.21.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer 0.21, rather than sneaking it in during RC time. Even though it is technically a bug fix, I imagine it might be breaking behavior for folks who do a lot of custom query-results munging with Jinja + agate.

@drewbanin drewbanin temporarily deployed to Redshift June 26, 2021 17:23 Inactive
@drewbanin drewbanin temporarily deployed to Redshift June 26, 2021 17:23 Inactive
@drewbanin drewbanin temporarily deployed to Bigquery June 26, 2021 17:23 Inactive
@drewbanin drewbanin temporarily deployed to Bigquery June 26, 2021 17:23 Inactive
@drewbanin drewbanin temporarily deployed to Postgres June 26, 2021 17:23 Inactive
@drewbanin drewbanin temporarily deployed to Snowflake June 26, 2021 17:23 Inactive
@drewbanin drewbanin temporarily deployed to Snowflake June 26, 2021 17:23 Inactive
@drewbanin drewbanin force-pushed the fix/agate-undesirable-casting branch from a5a2dc7 to 1ad1c83 Compare June 26, 2021 17:24
@drewbanin drewbanin marked this pull request as ready for review June 26, 2021 17:27
@drewbanin drewbanin temporarily deployed to Postgres June 26, 2021 17:29 Inactive
@drewbanin drewbanin temporarily deployed to Redshift June 26, 2021 17:29 Inactive
@drewbanin drewbanin temporarily deployed to Redshift June 26, 2021 17:29 Inactive
@drewbanin drewbanin temporarily deployed to Snowflake June 26, 2021 17:29 Inactive
@drewbanin drewbanin temporarily deployed to Snowflake June 26, 2021 17:29 Inactive
@drewbanin drewbanin temporarily deployed to Postgres June 26, 2021 19:42 Inactive
@drewbanin drewbanin temporarily deployed to Redshift June 26, 2021 19:42 Inactive
@jtcohen6 jtcohen6 requested a review from leahwicz June 28, 2021 22:59
@leahwicz leahwicz requested a review from gshank June 29, 2021 12:20
Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@drewbanin
Copy link
Contributor Author

Right on, thanks. I updated the changelog to include this fix in the 0.21 section. What's the right base branch for this PR? Is it develop or something else?

@drewbanin drewbanin temporarily deployed to Bigquery June 30, 2021 00:28 Inactive
@drewbanin drewbanin temporarily deployed to Bigquery June 30, 2021 00:28 Inactive
@drewbanin drewbanin temporarily deployed to Snowflake June 30, 2021 00:28 Inactive
@drewbanin drewbanin temporarily deployed to Snowflake June 30, 2021 00:28 Inactive
@drewbanin drewbanin temporarily deployed to Postgres June 30, 2021 00:28 Inactive
@drewbanin drewbanin temporarily deployed to Redshift June 30, 2021 00:28 Inactive
@drewbanin drewbanin temporarily deployed to Redshift June 30, 2021 00:28 Inactive
@jtcohen6
Copy link
Contributor

develop is right!

@drewbanin drewbanin merged commit 83b98c8 into develop Jul 1, 2021
@drewbanin drewbanin deleted the fix/agate-undesirable-casting branch July 1, 2021 14:53
@2rara 2rara mentioned this pull request Oct 11, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RPC server sometimes returns the wrong type
3 participants