-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load CSV files fails on date and numeric types #139
Comments
Hey @chinwobble, thanks for the detailed writeup and reproduction case. I assume you're running with Spark 3.x, since that's where we saw this issue first crop up; could you confirm? Here are the statements that
This works in our local dockerized version, which uses Apache Spark 2.4.5, but it does not work when I run it against a Databricks cluster running Spark 3.0.1. Instead, it does work if I add an override in
Then, in a staging model, I can cast these I agree, however, that the best solution here is to add those castings directly within the Crucially, we need to {%- set column_override = model['config'].get('column_types', {}) -%}
...
{% set sql %}
insert into {{ this.render() }} values
{% for row in chunk -%}
({%- for col_name in agate_table.column_names -%}
{%- set inferred_type = adapter.convert_type(agate_table, loop.index0) -%}
{%- set type = column_override.get(col_name, inferred_type) -%}
cast(%s as {{type}})
{%- if not loop.last%},{%- endif %}
{%- endfor -%})
{%- if not loop.last%},{%- endif %}
{%- endfor %}
{% endset %} Is that a fix you'd be interested in contributing? |
@jtcohen6 yes I was testing on vanilla spark 3.0.0 Also unrelated, how were you able to get all the SQL that was generated?
Is there an easier way that I'm not aware of? |
I tried adding those two lines of code in the macro I noticed some issues: {%- set inferred_type = adapter.convert_type(agate_table, loop.index0) -%}
{%- set type = column_override.get(col_name, inferred_type) -%}
For numeric data types we prob want to use |
@chinwobble I grabbed that SQL from
This is why we prefer the user-configured column type (if available), and only use the inferred type if there is no type supplied by the user.
You're right, sorry about that! There was a line missing in the code snippet pasted above. I just updated. Here's what I get back in the logs, which yields a successful load: insert into dbt_jcohen.loan_account values
(cast(%s as bigint),cast(%s as string),cast(%s as date),cast(%s as date),cast(%s as bigint),cast(%s as bigint)) ( An alternative approach here would be to use |
@jtcohen6 I'm trying to contribute this fix back into the repository. I'm starting by creating a unit test the seed macros:
I've created this test class: # test_seed_macros
import unittest
from unittest import mock
import os
import re
from jinja2 import Environment, FileSystemLoader
from dbt.clients import agate_helper
from dbt.clients.jinja import MaterializationExtension
from tempfile import mkdtemp
SAMPLE_CSV_DATA = """a,b,c,d,e,f,g
1,n,test,3.2,20180806T11:33:29.320Z,True,NULL
2,y,asdf,900,20180806T11:35:29.320Z,False,a string"""
class TestSparkSeedMacros(unittest.TestCase):
def setUp(self):
self.jinja_env = Environment(loader=FileSystemLoader(
['dbt/include/spark/macros',
'dbt/include/spark/macros/materializations']),
extensions=['jinja2.ext.do', MaterializationExtension])
self.config = {}
self.default_context = {
'validation': mock.Mock(),
'model': mock.Mock(),
'exceptions': mock.Mock(),
'config': mock.Mock()
}
self.default_context['config'].get = lambda key, default=None, **kwargs: self.config.get(key, default)
self.tempdir = mkdtemp()
def __get_template(self, template_filename):
return self.jinja_env.get_template(template_filename, globals=self.default_context)
def __run_macro(self, template, name, agate_table):
self.default_context['model'].alias = 'test'
value = getattr(template.module, name)(self.default_context['model'], agate_table)
return re.sub(r'\s\s+', ' ', value)
def test_macros_load(self):
self.jinja_env.get_template('seed.sql')
def test_macros_create_csv_table(self):
template = self.__get_template('seed.sql')
path = os.path.join(self.tempdir, 'input.csv')
with open(path, 'wb') as fp:
fp.write(SAMPLE_CSV_DATA.encode('utf-8'))
agate_table = agate_helper.from_csv(path, [])
sql = self.__run_macro(template, 'spark__create_csv_table', agate_table).strip()
self.assertEqual(sql, "create table my_table as select 1") When I run this code I get the following error:
I assume its not working because of the snippet in
I think ideally I would just want integration tests that tests the seed on spark 2.4.x and spark 3.0.0 but I'm not familiar with circle CI and it seems tricky to setup locally. |
@chinwobble I really appreciate your instinct to start by writing unit tests for these macros. Unfortunately, I think materialization testing requires a few too many dbt-specific syntactical pieces—like Instead, we've had success historically in dbt-core (and the "core four" adapters) writing targeted integration tests, with projects reflecting the use cases and edge cases we'd expect users to encounter. As discussed over in #141, we should include the |
@jtcohen6 As you suggested I've tried to create some integration tests to test materialization. First I need to start by making sure that dbt-spark is tested on both spark 3 and spark 2.4. I've created this PR to try to achieve this. However the hive-metastore in spark 3 isn't quite working. Could you have a look at it? |
Describe the bug
When you try to seed csv into (hive) it fails if all the types are string.
Steps To Reproduce
Expected behavior
Should be able to insert csv data easily
If you specify a column_type as double / decimal in column_types, the csv load treats them as string
https://github.com/fishtown-analytics/dbt/blob/v0.18.1/core/dbt/context/providers.py#L732
I want loan_amount to be treated as a decimal. If I omit it from the
+column_types
then it gets treated as double.There doesn't seem to be any easy way to override the csv loading behaviour.
Screenshots and log output
The SQL that is generated is
The desired output should be something like this
Additional info
I've customised a the macro a little to support cast as date, but its messy.
The text was updated successfully, but these errors were encountered: