[AP-953] Add parquet support #149

koszti · 2021-03-07T20:57:05Z

Problem

Target-snowflake is using CSV files to load data from s3 into tables, but ~~CSV is not the most efficient file format when loading into columnar tables.~~

Proposed changes

Add optional parquet support. To select between CSV and Parquet file format, create the file format snowflake object by one of the following SQLs:

To use CSV:

CREATE FILE FORMAT {database}.{schema}.{file_format_name}
TYPE = 'CSV' ESCAPE='\\' FIELD_OPTIONALLY_ENCLOSED_BY='"';

To use Parquet:

CREATE FILE FORMAT {database}.{schema}.{file_format_name} TYPE = 'PARQUET';

The required file format specific functions will be selected automatically. The code detects the file format type by running SHOW FILE FORMATS LIKE <file_format_name> SQL command

Key functions in the code:

parquet.py: records_to_dataframe: Transforms a batch of singer record messages to pandas dataframe
parquet.py: dataframe.to_parquet: Writes the DataFrame to the binary parquet format.
parquet.py: create_merge_sql: Snowflake MERGE SQL generator for Parquet files
csv.py: create_merge_sql: Snowflake MERGE SQL generator for CSV files.

Types of changes

What types of changes does your code introduce to PipelineWise?

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Description above provides context of the change
I have added tests that prove my fix is effective or that my feature works
Unit tests for changes (not needed for documentation changes)
CI checks pass with my changes
Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions

target_snowflake/db_sync.py

target_snowflake/file_format.py

Samira-El · 2021-03-09T09:00:23Z

target_snowflake/file_format.py

+ self.file_format_type = self._detect_file_format_type(file_format, query_fn)
+
+ # Map file format specific functions dynamically
+ file_format_module = getattr(target_snowflake.file_formats, self.file_format_type)


I think there is a cleaner way of doing this. The factory design pattern is perfect to use here.

Addressed by 2cca9fe

and little shortening added by 42758b4

target_snowflake/file_formats/parquet.py

target_snowflake/upload_clients/s3_upload_client.py

target_snowflake/file_formats/parquet.py

Samira-El · 2021-03-09T09:15:28Z

target_snowflake/file_formats/csv.py

+ s3_key,
+ file_format_name,
+ pk_merge_condition,
+ ', '.join(['{}=s.{}'.format(c['name'],


You can avoid having to duplicate c['name'] by doing '{0}=s.{0}'.format(c['name'])
`

addressed by 02bf2d0

[AP-898] Add parquet support

05af1e7

koszti changed the title ~~[AP-898] Add parquet support~~ [AP-953] Add parquet support Mar 7, 2021

koszti added 3 commits March 7, 2021 21:15

[AP-953] Fix comments

2e217b1

[AP-953] Fix comments

bc5bdeb

Add more integration tests for parquet and handle some edge cases

68f0fe0

Samira-El reviewed Mar 9, 2021

View reviewed changes

koszti added 3 commits March 15, 2021 09:41

[AP-953] Use shorter string formatter

02bf2d0

[AP-953] Use factory pattern

2cca9fe

[AP-953] Remove not required comments

42758b4

Samira-El approved these changes Mar 16, 2021

View reviewed changes

koszti merged commit 8b5f756 into master Mar 17, 2021

koszti deleted the AP-898 branch March 17, 2021 09:31

aaronsteers mentioned this pull request Apr 22, 2021

adds retain_s3_files and s3_file_naming_scheme #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AP-953] Add parquet support #149

[AP-953] Add parquet support #149

koszti commented Mar 7, 2021 •

edited

Loading

Samira-El Mar 9, 2021

koszti Mar 15, 2021 •

edited

Loading

Samira-El Mar 9, 2021

koszti Mar 15, 2021 •

edited

Loading

[AP-953] Add parquet support #149

[AP-953] Add parquet support #149

Conversation

koszti commented Mar 7, 2021 • edited Loading

Problem

Proposed changes

Types of changes

Checklist

Samira-El Mar 9, 2021

Choose a reason for hiding this comment

koszti Mar 15, 2021 • edited Loading

Choose a reason for hiding this comment

Samira-El Mar 9, 2021

Choose a reason for hiding this comment

koszti Mar 15, 2021 • edited Loading

Choose a reason for hiding this comment

koszti commented Mar 7, 2021 •

edited

Loading

koszti Mar 15, 2021 •

edited

Loading

koszti Mar 15, 2021 •

edited

Loading