Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

[AP-953] Add parquet support #149

Merged
merged 7 commits into from
Mar 17, 2021
Merged

[AP-953] Add parquet support #149

merged 7 commits into from
Mar 17, 2021

Conversation

koszti
Copy link
Contributor

@koszti koszti commented Mar 7, 2021

Problem

Target-snowflake is using CSV files to load data from s3 into tables, but CSV is not the most efficient file format when loading into columnar tables.

Proposed changes

Add optional parquet support. To select between CSV and Parquet file format, create the file format snowflake object by one of the following SQLs:

To use CSV:

CREATE FILE FORMAT {database}.{schema}.{file_format_name}
TYPE = 'CSV' ESCAPE='\\' FIELD_OPTIONALLY_ENCLOSED_BY='"';

To use Parquet:

CREATE FILE FORMAT {database}.{schema}.{file_format_name} TYPE = 'PARQUET';

The required file format specific functions will be selected automatically. The code detects the file format type by running SHOW FILE FORMATS LIKE <file_format_name> SQL command

Key functions in the code:

Types of changes

What types of changes does your code introduce to PipelineWise?

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

  • Description above provides context of the change
  • I have added tests that prove my fix is effective or that my feature works
  • Unit tests for changes (not needed for documentation changes)
  • CI checks pass with my changes
  • Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
  • Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
  • Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions

@koszti koszti changed the title [AP-898] Add parquet support [AP-953] Add parquet support Mar 7, 2021
target_snowflake/db_sync.py Show resolved Hide resolved
target_snowflake/file_format.py Show resolved Hide resolved
self.file_format_type = self._detect_file_format_type(file_format, query_fn)

# Map file format specific functions dynamically
file_format_module = getattr(target_snowflake.file_formats, self.file_format_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a cleaner way of doing this. The factory design pattern is perfect to use here.

Copy link
Contributor Author

@koszti koszti Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by 2cca9fe

and little shortening added by 42758b4

target_snowflake/file_formats/parquet.py Show resolved Hide resolved
target_snowflake/file_formats/parquet.py Show resolved Hide resolved
s3_key,
file_format_name,
pk_merge_condition,
', '.join(['{}=s.{}'.format(c['name'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid having to duplicate c['name'] by doing '{0}=s.{0}'.format(c['name'])
`

Copy link
Contributor Author

@koszti koszti Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by 02bf2d0

@koszti koszti merged commit 8b5f756 into master Mar 17, 2021
@koszti koszti deleted the AP-898 branch March 17, 2021 09:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants