WIP archive_load_files #174

llehtinen · 2021-05-27T08:15:57Z

Problem

https://transferwise.atlassian.net/browse/AP-1011

Proposed changes

https://docs.google.com/document/d/11UTlmWVJS9aGickmyxpXOUhrv4RTPnm8zip_IqeR2J0/edit?ts=609e4106

Types of changes

What types of changes does your code introduce to PipelineWise?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Description above provides context of the change
I have added tests that prove my fix is effective or that my feature works
Unit tests for changes (not needed for documentation changes)
CI checks pass with my changes
Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions

koszti · 2021-05-27T09:36:36Z

target_snowflake/upload_clients/base_upload_client.py

@@ -24,3 +24,9 @@ def delete_object(self, stream: str, key: str) -> None:
 """
 Delete object
 """
+
+ @abstractmethod
+ def copy_object(self, source_key: str, target_key: str, target_metadata: dict) -> None:


The SnowflakeUploadClient doesn't implement this.

The copy_object acceptable only for s3_upload_client but I think an implementation is still required in the SnowflakeUploadClient as well. Maybe we can raise NotImplementedError with some human readable error message.

koszti · 2021-05-27T10:02:08Z

target_snowflake/__init__.py

+ if config.get('archive_load_files.enabled'):
+ # keep track of min and max
+ archive_load_files_primary_column =\
+ config.get('archive_load_files.primary_column') or primary_key_string


this should be done only once when processing the SCHEMA message and not in every RECORD message.

Primary key column(s) is available in the schema message as o['key_properties'] and the incremental key(s) as o['bookmark_properties'].

koszti · 2021-05-27T10:19:43Z

target_snowflake/__init__.py

+ 's3_key': 'tbd', # TODO: Turn archive_load_files.naming_convention into actual key
+ 'database': 'tbd', # TODO: Where to get this
+ 'schema': 'tbd', # TODO: Where to get this
+ 'table': stream,


these values, especially the s3_key should be defined at an another place, ideally only when loading the batch to s3.

The stream is a concatenated string of db-schema and db-table for example: public-table_one and that's already sent to flush_streams as part of records_to_load list.

The database is not acceptable in general in target-snowflake. One ppw stream can read only from one database and DB-schema-table hierarchy is not acceptable to every data source type, hence target-snowflake is not using the concept of database. (pg vs mariadb for example)

@koszti One issue with s3_key is that it requires access to the config. From here, currently, config is not passed further than flush_streams. Would you prefer to have the entire config propagated further from there, or add the archive_load_files config as init param to the S3UploadClient, or something else?

koszti · 2021-05-27T10:33:48Z

target_snowflake/__init__.py

@@ -170,6 +171,31 @@ def persist_lines(config, lines, table_cache=None, file_format_type: FileFormatT
 else:
 records_to_load[stream][primary_key_string] = o['record']

+ if config.get('archive_load_files.enabled'):


we also need to check if we're using the S3UploadClient and not the SnowflakeUploadClient. Archiving load files is acceptable only if we use the S3UploadClient.

The SnowflakeUploadClient is using snowflake managed table stages and we can't archive anything onto it.

There is a check for that at around here.

llehtinen · 2021-06-03T14:27:37Z

target_snowflake/__init__.py

+ # Determine archive_load_files_primary_column
+ # 1) Use incremental replication key if defined
+ # 2) Otherwise use primary key
+ if False and 'bookmark_properties' in o and len(o['bookmark_properties']) > 0:


TBD: What's the proper way to identify incremental replication key? Unit test data had lsn here.

check messages-simple-table.json, maybe this is a better example. The incremental key is defined in details only in the STATE message, but taps are not sending them necessarily always after the SCHEMA and before the RECORD messages. The order is not defined by the singer-spec

Taps are copying replication key into the bookmark_properties. If it's logical replication then tap-postgres using the lsn keyword. The best what we can do is to get o['bookmark_properties'] and if the result is one of the column in the schema then we use it as archive_load_files_primary_column. If not exists then we use the PK: o['key_properties'][0].

This all could be complicated and better to introduce a new function in stream_utils.py. We can implement and unit test the new function separately and here it will look like this:

archive_load_files_primary_column = stream_utils.get_archive_load_files_primary_columns(o)

The new function will use three propertes of the input SCHEMA message (here it's called o):

o['key_properties']: List of primary keys in the stream. What we should return if it's a composite key? And how we should track it?
o['bookmark_properties']: List of incremental keys - Only one value is supported, so getting the first value is OK
o['schema']['properties']: List of columns in the schema

llehtinen · 2021-06-03T14:52:57Z

target_snowflake/__init__.py

@@ -113,6 +113,9 @@ def persist_lines(config, lines, table_cache=None, file_format_type: FileFormatT
 batch_size_rows = config.get('batch_size_rows', DEFAULT_BATCH_SIZE_ROWS)
 batch_wait_limit_seconds = config.get('batch_wait_limit_seconds', None)
 flush_timestamp = datetime.utcnow()
+ archive_load_files_enabled =\


Think of better way to check the config - if archive_load_files.enabled but s3 bucket isn't defined, raise exception

Config validation is at https://github.com/transferwise/pipelinewise-target-snowflake/blob/master/target_snowflake/db_sync.py#L65 , where we can do something like this:

# Check if archive load files option is using external stages archive_load_files = config.get('archive_load_files', {}) if archive_load_files.get('enabled') and not config.get('s3_bucket', None): errors.append('Archive load files option can be used only with external s3 stages. Please define s3_bucket')

Once the validation is done we can get the values here by:

archive_load_files_enabled = config.get('archive_load_files', {}).get('enabled', None)

Please note that separating config variables with dots (db_conn.s3_bucket and archive_load_files.enabled) are not working. We need to access them as nested dictionaries.

Also the db_conn dictionary keys are accessible directly in the config dict and we don't need the prefix.

koszti · 2021-06-04T10:51:53Z

target_snowflake/__init__.py

+ archive_load_files_primary_column = o['key_properties'][0] # Behavior when key multi-col?
+
+ archive_load_files_data[stream] = {
+ 'tap': config.get('id'),


Suggested change

'tap': config.get('id'),

'tap': config.get('tap_id'),

target-snowflake can't access the original YAML file. The config is a generated JSON of the items under db_conn and some other propertes listed here

koszti · 2021-06-04T10:53:45Z

target_snowflake/__init__.py

+ # Keep track of min and max of the designated column
+ values = archive_load_files_data[stream]
+ archive_primary_column_name = values['column']
+ archive_primary_column_value = o['record'][archive_primary_column_name]


What we should return if archive_primary_column_name is a composite primary key? How we should track it?

koszti · 2021-06-04T10:56:26Z

tests/unit/test_target_snowflake.py

+ self.config['id'] = 'test-tap-id'
+ self.config['archive_load_files.enabled'] = True
+ self.config['db_conn.s3_bucket'] = 'dummy_bucket'


Suggested change

self.config['db_conn.s3_bucket'] = 'dummy_bucket'

self.config['s3_bucket'] = 'dummy_bucket'

Separating by dots is not valid and here we don't use the original YAML file. The config is a generated JSON of the items under db_conn and some other propertes listed here

koszti · 2021-06-04T10:56:43Z

tests/unit/test_target_snowflake.py

+ self.config['id'] = 'test-tap-id'
+ self.config['archive_load_files.enabled'] = True
+ self.config['db_conn.s3_bucket'] = 'dummy_bucket'


same as above

koszti · 2021-06-04T10:56:58Z

tests/unit/test_target_snowflake.py

+ self.config['id'] = 'test-tap-id'
+ self.config['archive_load_files.enabled'] = True
+ self.config['db_conn.s3_bucket'] = 'dummy_bucket'


same as above

llehtinen · 2021-06-09T18:30:23Z

.circleci/config.yml

@@ -32,7 +32,7 @@ jobs:
 command: |
 . venv/bin/activate
 export LOGGING_CONF_FILE=$(pwd)/sample_logging.conf
- pytest tests/integration/ -vv --cov target_snowflake --cov-fail-under=86
+ pytest tests/integration/ -k test_archive_load_files -vv 


To be reverted, this was used for faster turnaround when testing

koszti · 2021-06-23T13:32:50Z

this PR is obsolete and covered by #178

WIP archive_load_files

63d255c

llehtinen added the Draft PR or issue still in draft mode label May 27, 2021

koszti reviewed May 27, 2021

View reviewed changes

Lauri Lehtinen added 3 commits June 1, 2021 15:31

Addressing PR comments

8ad89f0

Reset min/max

64d634b

Add some tests

056c501

llehtinen commented Jun 3, 2021

View reviewed changes

koszti reviewed Jun 4, 2021

View reviewed changes

Lauri Lehtinen added 4 commits June 8, 2021 13:05

Changes based on review/discussions

b915b32

Pylint

59348ba

Pylint

7259ad1

Try to fix tests

7b0f5a0

llehtinen force-pushed the ap1011_archive_load_files branch from dd2245f to 76164bf Compare June 9, 2021 18:28

llehtinen commented Jun 9, 2021

View reviewed changes

Integration test

3317131

llehtinen force-pushed the ap1011_archive_load_files branch from 76164bf to 3317131 Compare June 9, 2021 18:33

Lauri Lehtinen added 2 commits June 10, 2021 15:43

Enable all int tests

1090cfb

Use custom exception

e75665d

llehtinen force-pushed the ap1011_archive_load_files branch from 4281cd0 to e75665d Compare June 10, 2021 17:25

llehtinen mentioned this pull request Jun 11, 2021

AP-1011 archive_load_files feature #178

Merged

13 tasks

koszti closed this Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP archive_load_files #174

WIP archive_load_files #174

llehtinen commented May 27, 2021

koszti May 27, 2021

koszti May 27, 2021 •

edited

Loading

koszti May 27, 2021 •

edited

Loading

llehtinen Jun 1, 2021

koszti May 27, 2021 •

edited

Loading

llehtinen Jun 3, 2021 •

edited

Loading

koszti Jun 4, 2021 •

edited

Loading

llehtinen Jun 3, 2021

koszti Jun 4, 2021 •

edited

Loading

koszti Jun 4, 2021

koszti Jun 4, 2021

koszti Jun 4, 2021

koszti Jun 4, 2021

koszti Jun 4, 2021

llehtinen Jun 9, 2021

koszti commented Jun 23, 2021

	self.config['db_conn.s3_bucket'] = 'dummy_bucket'
	self.config['s3_bucket'] = 'dummy_bucket'

WIP archive_load_files #174

WIP archive_load_files #174

Conversation

llehtinen commented May 27, 2021

Problem

Proposed changes

Types of changes

Checklist

Choose a reason for hiding this comment

koszti May 27, 2021 • edited Loading

Choose a reason for hiding this comment

koszti May 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koszti May 27, 2021 • edited Loading

Choose a reason for hiding this comment

llehtinen Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

koszti Jun 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koszti Jun 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koszti commented Jun 23, 2021

koszti May 27, 2021 •

edited

Loading

koszti May 27, 2021 •

edited

Loading

koszti May 27, 2021 •

edited

Loading

llehtinen Jun 3, 2021 •

edited

Loading

koszti Jun 4, 2021 •

edited

Loading

koszti Jun 4, 2021 •

edited

Loading