Skip to content

Commit

Permalink
WP-9041 merge latest change upstream (#10)
Browse files Browse the repository at this point in the history
* Fix/config parsing (#21)

* allow search_prefix to be None

* handle both list and string for key_properties and date_overrides

* pylint

* Bump to v1.2.2 (#22)

* Bump to v1.2.2

* Changelog

* Check if search_prefix is present before popping (#23)

* Bump to v1.2.3 (#24)

* TDL-13258 move tests from tap-tester to tap-s3-csv (#29)

* TDL-13258:Added integration tests and resources to tap-s3-csv from tap-tester

* Add context and triggers to circleci config

* Run nosetests on the correct folder

* Remove nose tests because there are no unit tests

* Fix test properties

* TDL-13258:Updated non_rectangular_files test case in types_and_data

* Combine related tests into one

Co-authored-by: Savan Chovatiya <savan.chovatiya@CDSYS.LOCAL>
Co-authored-by: Collin Simon <csimon@talend.com>

* TDL-12589: Added the support of JSONL files (#31)

* TDL-12589: Added the support of JSONL files

* TDL-12589: Formated code

* TDL-12589: test updated

* TDL-12589: Updated config.yml to expect failures

* TDL-12589: added stitch api tocken

* TDL-12589: Updated config and conversion of datatype

* TDL-12589: Updated priority of datatype like:
list
date-time
dict
integer
number
null - default in evenryone
string - default in evenryone

* TDL-12589: Updated as per priority

* TDL-12589: removed pylint failures

* TDL-12589: replaced

* TDL-12589: Added warning message for list inside list

* TDL-12589: Optimized code

* TDL-12589: Removed white space

* TDL-12589: Skipping row of JOSNL file if it is empty instaid of raising error.

* TDL: Rmoved extra white space

* TDL-12589: Updated test files

* TDL-12589: Updated code as per review comments changes

* TDL-12589: Added Unittests for the same

* TDL-12589: Pylint error resolved

* TDL-12589: Changed remove fields log from info to debug

* TDL-12589: Updated conversion code to support + sign

Co-authored-by: dbshah1212 <root@ubuntu16.cdsys.local>

* TDL-12464: Added support for handling the duplicate headers in the CS… (#30)

* TDL-12464: Added support for handling the duplicate headers in the CSV file

* Changed warning message

* Updated unit tests according to the warning message

* TDL-12464: Adding code to leverage duplicate headers support provided in simger-encoding library

* TDL-12464: Removed the unwanted code and made compatible with master repo

* TDL-12464: Upgraded singer-encodings library to fetch the latest version

* TDL-12464: Changing the data type of 'sdc_extra' key in the event

* TDL-12464: Updating test cases as per the code optimization

* TDL-12464: Updating version of singer-encoding library

* TDL-12464: Updating version of singer-python and backoff modules

Co-authored-by: Karan Panchal (C) <karan.panchal@CDSYS.LOCAL>
Co-authored-by: harshpatel4_crest <harsh.patel4@crestdatasys.com>

* TDL-12486: Added support of compressed files (#32)

* TDL-12486: Added support of compressed files

* TDL-12486: Updated singer encoding dependency

* TDL-12486: Added more doc strings.

* TDL-12486: Upgraded dependencies changed the logic of taking samples from zip

* TDL-12486: Increase coverage to test compressed files

* TDL-12486: Upgraded the singer-encoding version to 0.1.0

* TDL-12486: Removed trailing-whitespace

* TDL-12486: Updated test case of S3AllFilesSupport

* TDL-12486: Removed comman self.conn_id

* TDL-12486: Changes reverted.

* TDL-12486: Changed start date format

* TDL-12486: Updated date format in test_All_supported_files.

* TDL-12486: Change in logger messages

Co-authored-by: dbshah1212 <root@ubuntu16.cdsys.local>

* Tdl 12589 change sdc extra logs from debug to warn (#33)

* TDL-12589: Changed sdc_extra log from debug to warn

* TDL-12589: Changed message to sync with csv message

* TDL-12589: Updated message

Co-authored-by: dbshah1212 <root@ubuntu16.cdsys.local>

* version bump to 1.3.0 (#34)

* Strictly enforce the ordering of type checking for integer vs number (#35)

* Strictly enforce the ordering of type checking for integer vs number

* Bump to v1.3.1 (#36)

* TDL-14068:fixed key-error exception (#38)

* TDL-14068:fixed key-error exception

* Added unit test cases and integration tests

* Running one integration test for debugging

* Debugging integration test case

* Updated integration test

* Updated integration test expected output

* Updated config.yml for running all integration test again

* Fix/tdl 14038 filename issue (#37)

* TLD-14038: Skipping the .gz which gzip using --no-name

* TDL-14038: Added final count of total skipped files for discover mode and sync mode

* tdl-14038: Updated warning message and added unit test for the same

* TDL-14038: Removed global variable and added integration test

* TDL-14038: Updated comments

* TDL-14038: Added blank line

* TDL-14038: Removed: trailing-whitespace

* TDL-14038: Added comment of pylint disable

* TDL-14038: Updated pylint comment

* TDL-14038: Updated the test file class name

* TDL-14038: Removed self file call and added global.

* TDL: Remove warning message for 0 file skipped

* TDL-14038: Removed trailing white space

* TDL-14068: Fixed key error exception.

* TDL-14038: Reverted another bug changes

* TDL-14038: updated skipped_files_count

* TDL-14038: Updated message, comments and counts

* TDL-14038: Removed trailing-whitespace

* TDL-14038: Updated unit test cases

* TDL-14038: Updated sync file code.

* Resolved: use-maxsplit-arg

* Refactor how we handle nameless files

* Fix comment placement

* Mention tar as a problem too

* Make pylint happy

Co-authored-by: dbshah1212 <root@ubuntu16.cdsys.local>
Co-authored-by: Andy Lu <andy@stitchdata.com>

* Bump to v1.3.2, update changelog (#39)

* Bump to v1.3.2, update changelog

* Update changelog

* bump singer-encodings 0.1.1 (#41)

* bump 1.3.3 (#42)

* TDL-14228: Generate catalog file with the properties key if no samples found for sampling. (#40)

* Updated sampled schema when no samples found

* Running one integration test for debugging

* Debugging integration test

* Debugging integration test

* Updated integration test for catalog_with_empty_properties

* Running all integration test again

* Fix/wrong file extention error handling (#43)

* fix: Handled Unicode and JsonDecoder Error for wrong extention file.

* fix: Updated sync code and test case

* Fix: Handled StopIteration error for empty csv file.

* fix: Added unit test of StopIteration code handling

* fix: Resolved pylint errors

* Fix: removed trailing white space

* fix: disabled use-maxsplit-arg as we haven't change the code as part of this branch

* fix: Removed exception and added Warning for empty Jsonl file.

* fix: Handled pylint error

* fix: Skipping records with empty json

* fix: Added unit tests and integration tests for empty json jsonl file.

* fix: Skipping Empty Josn whily syncing as well

* Skipping empty lines of CSV in sampling and sync

* fix: Upgraded latest version of singer-encoding.

* fix: Added some test files

* fix: Removed unused variable declaration

* fix: Added UnicodeDecodeError and JSONDecodeError handling scenario in comment.

* fix: Final touch

* Update spell mistake

* Corrected typo

* Updated warning messages and empty jsonl file in skip count

* fix: Put warning of skipping empty jsonl files.

* fix: Updated comment

Co-authored-by: dbshah1212 <root@ubuntu16.cdsys.local>
Co-authored-by: savan-chovatiya <savan.chovatiya@crestdatasys.com>
Co-authored-by: Kyle Allan <KAllan357@gmail.com>

* Bump to version 1.3.4 (#45)

* Bump to version 1.3.4

* Bump to version 1.3.4

* Bump to version 1.3.4

* Bump to version 1.3.4

* Bump to version 1.3.4

Co-authored-by: KrishnanG <kgurusamy@talend.com>

* Bump csv field width (#47)

* Maybe increase the field width we can handle

* Fix typo

* Just use sys.maxsize

* Make pylint happy

* Version bump to `v1.3.5` (#48)

* Make entry consistent with others

* Bump to v1.3.5, update changelog

Co-authored-by: Nick McCoy <33731945+nick-mccoy@users.noreply.github.com>
Co-authored-by: cosimon <cosimon@users.noreply.github.com>
Co-authored-by: savan-chovatiya <80703490+savan-chovatiya@users.noreply.github.com>
Co-authored-by: Savan Chovatiya <savan.chovatiya@CDSYS.LOCAL>
Co-authored-by: Collin Simon <csimon@talend.com>
Co-authored-by: dbshah1212 <35164219+dbshah1212@users.noreply.github.com>
Co-authored-by: dbshah1212 <root@ubuntu16.cdsys.local>
Co-authored-by: karanpanchal-crest <karan.panchal@crestdatasys.com>
Co-authored-by: Karan Panchal (C) <karan.panchal@CDSYS.LOCAL>
Co-authored-by: harshpatel4_crest <harsh.patel4@crestdatasys.com>
Co-authored-by: Leslie VanDeMark <38043390+leslievandemark@users.noreply.github.com>
Co-authored-by: Andy Lu <andy@stitchdata.com>
Co-authored-by: zachharris1 <69470481+zachharris1@users.noreply.github.com>
Co-authored-by: savan-chovatiya <savan.chovatiya@crestdatasys.com>
Co-authored-by: Kyle Allan <KAllan357@gmail.com>
Co-authored-by: KrisPersonal <66801357+KrisPersonal@users.noreply.github.com>
Co-authored-by: KrishnanG <kgurusamy@talend.com>
  • Loading branch information
18 people authored May 11, 2022
1 parent ebacc13 commit ee70416
Show file tree
Hide file tree
Showing 6 changed files with 152 additions and 72 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Changelog

## 1.3.1

- Merge in upstream changes, below:
- Increase the limit on the width of a field in the CSV files read by the tap [#47](https://github.com/singer-io/singer-encodings/pull/47)

## 1.3.0

- Reintroduce ability to assume role for external AWS account
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
from setuptools import setup

setup(name='tap-s3-csv',
version='1.3.0',
version='1.3.1',
description='Singer.io tap for extracting CSV files from S3',
author='Stitch',
url='https://singer.io',
classifiers=['Programming Language :: Python :: 3 :: Only'],
py_modules=['tap_s3_csv'],
install_requires=[
'backoff==1.8.0',
'boto3==1.17.0',
'boto3==1.9.57',

This comment has been minimized.

Copy link
@yi-varicent

yi-varicent May 11, 2022

expected?

'singer-encodings==0.1.2',
'singer-python==5.12.1',
'voluptuous==0.10.5'
Expand Down
17 changes: 11 additions & 6 deletions tap_s3_csv/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@
LOGGER = singer.get_logger()

REQUIRED_CONFIG_KEYS = ["bucket"]
REQUIRED_CONFIG_KEYS_EXTERNAL_SOURCE = ["bucket", "account_id", "external_id", "role_name"]
REQUIRED_CONFIG_KEYS_EXTERNAL_SOURCE = [
"bucket", "account_id", "external_id", "role_name"]


def do_discover(config):
Expand All @@ -34,7 +35,8 @@ def do_sync(config, catalog, state):
for stream in catalog['streams']:
stream_name = stream['tap_stream_id']
mdata = metadata.to_map(stream['metadata'])
table_spec = next(s for s in config['tables'] if s['table_name'] == stream_name)
table_spec = next(
s for s in config['tables'] if s['table_name'] == stream_name)
if not stream_is_selected(mdata):
LOGGER.info("%s: Skipping - not selected", stream_name)
continue
Expand All @@ -50,6 +52,7 @@ def do_sync(config, catalog, state):

LOGGER.info('Done syncing.')


def validate_table_config(config):
# Parse the incoming tables config as JSON
tables_config = config['tables']
Expand All @@ -59,17 +62,19 @@ def validate_table_config(config):
table_config.pop('search_prefix')
if table_config.get('key_properties') == "" or table_config.get('key_properties') is None:
table_config['key_properties'] = []
elif table_config.get('key_properties'):
table_config['key_properties'] = [s.strip() for s in table_config['key_properties']]

elif table_config.get('key_properties') and isinstance(table_config['key_properties'], str):
table_config['key_properties'] = [s.strip()
for s in table_config['key_properties'].split(',')]
if table_config.get('date_overrides') == "" or table_config.get('date_overrides') is None:
table_config['date_overrides'] = []
elif table_config.get('date_overrides') and isinstance(table_config['date_overrides'], str):
table_config['date_overrides'] = [s.strip() for s in table_config['date_overrides'].split(',')]
table_config['date_overrides'] = [s.strip()
for s in table_config['date_overrides'].split(',')]

# Reassign the config tables to the validated object
return CONFIG_CONTRACT(tables_config)


@singer.utils.handle_top_exception(LOGGER)
def main():
args = singer.utils.parse_args(REQUIRED_CONFIG_KEYS)
Expand Down
16 changes: 10 additions & 6 deletions tap_s3_csv/conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

LOGGER = singer.get_logger()

#pylint: disable=too-many-return-statements
# pylint: disable=too-many-return-statements


def infer(key, datum, date_overrides, check_second_call=False):
"""
Returns the inferred data type
Expand Down Expand Up @@ -50,7 +52,7 @@ def process_sample(sample, counts, lengths, table_spec):
for key, value in sample.items():
if key not in counts:
counts[key] = {}

length = len(value)
if key not in lengths or length > lengths[key]:
lengths[key] = length
Expand Down Expand Up @@ -95,6 +97,7 @@ def pick_datatype(counts):

return to_return


def generate_schema(samples, table_spec, string_max_length: bool):
counts, lengths = {}, {}
for sample in samples:
Expand Down Expand Up @@ -125,7 +128,8 @@ def generate_schema(samples, table_spec, string_max_length: bool):
if string_max_length:
schema[key]['anyOf'][1]['maxLength'] = lengths[key]
else:
schema[key] = datatype_schema(datatype, lengths[key], string_max_length)
schema[key] = datatype_schema(
datatype, lengths[key], string_max_length)

return schema

Expand All @@ -139,7 +143,7 @@ def datatype_schema(datatype, length, string_max_length: bool):
]
}
if string_max_length:
schema['anyOf'][1]['maxLength'] = length
schema['anyOf'][1]['maxLength'] = length
elif datatype == 'dict':
schema = {
'anyOf': [
Expand All @@ -148,7 +152,7 @@ def datatype_schema(datatype, length, string_max_length: bool):
]
}
if string_max_length:
schema['anyOf'][1]['maxLength'] = length
schema['anyOf'][1]['maxLength'] = length
else:
types = ['null', datatype]
if datatype != 'string':
Expand All @@ -157,5 +161,5 @@ def datatype_schema(datatype, length, string_max_length: bool):
'type': types,
}
if string_max_length:
schema['maxLength'] = length
schema['maxLength'] = length
return schema
Loading

0 comments on commit ee70416

Please sign in to comment.