Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unclear Error When Loading Synthea Vocabularies #90

Closed
TheCedarPrince opened this issue Oct 24, 2024 · 4 comments
Closed

[BUG] Unclear Error When Loading Synthea Vocabularies #90

TheCedarPrince opened this issue Oct 24, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@TheCedarPrince
Copy link
Collaborator

When running the following lines

file_dict=$(python3 scripts/python/get_csv_filepaths.py path/to/vocab/csvs)
dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"

I get multiple errors with the run-operation and had to patch the macros/load_data_duckdb.sql as follows:

    {% for n, p in file_dict.items() %}
        {% do run_query("DROP TABLE IF EXISTS " ~ target_schema ~ "." ~ n.lower() ~ ";") %}
        {% do run_query("CREATE TABLE IF NOT EXISTS " ~ target_schema ~ "." ~ n.lower() ~ " AS SELECT * FROM read_csv('" ~ p ~ "', quote = \'\"\', delim = '\t', ignore_errors = true);") %}
    {% endfor %}

Otherwise, I see errors like this:

(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"
01:41:34  Running with dbt=1.8.7
01:41:34  Registered adapter: duckdb=1.8.0
01:41:35  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
01:41:39  Encountered an error while running operation: Runtime Error
  Invalid Input Error: CSV Error on Line: 245521
  Original Line: 45883506	'Baby' or low-dose aspirin	Meas Value	LOINC	Answer	S	LA14345-5	19700101	20991231	
  Value with unterminated quote found.
  
  Possible fixes:
  * Enable ignore errors (ignore_errors=true) to skip this row
  * Set quote do empty or to a different value (e.g., quote='')
  
    file=/home/thecedarprince/Programs/SYNTHEA/vocabulary/CONCEPT.csv
    delimiter = 	 (Auto-Detected)
    quote = ' (Auto-Detected)
    escape = \ (Auto-Detected)
    new_line = \n (Auto-Detected)
    header = true (Auto-Detected)
    skip_rows = 0 (Auto-Detected)
    comment = \0 (Auto-Detected)
    date_format =  (Auto-Detected)
    timestamp_format =  (Auto-Detected)
    null_padding=0
    sample_size=20480
    ignore_errors=false
    all_varchar=0

Note: even with the patching, I still face errors later when loading vocabularies into DuckDB so this patching doesn't really work.

@lawrenceadams
Copy link
Collaborator

lawrenceadams commented Oct 24, 2024

I've never used this macro @TheCedarPrince - but I have used duckdb/postgres a lot to do this and spent AGES trying to figure out how to do this.

Turns out all you need to do is turn off quote parsing - i.e.

read_csv('" ~ p ~ "', quote = '')

and it should work neatly from memory

@TheCedarPrince
Copy link
Collaborator Author

Yup! That fixed it on my side! I can make a PR later today unless if you beat me first @lawrenceadams !

@lawrenceadams
Copy link
Collaborator

You fire on! You caught it! Glad to hear

@TheCedarPrince
Copy link
Collaborator Author

Closed thanks to #93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants