-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File-based CDK + Source S3 (v4): Pass configured file encoding to stream reader #29110
Conversation
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
…te into alex/file_based_encoding
try: | ||
params = {"client": self.s3_client} | ||
except Exception as exc: | ||
raise exc | ||
|
||
logger.debug(f"try to open {file.uri}") | ||
try: | ||
result = smart_open.open(f"s3://{self.config.bucket}/{file.uri}", transport_params=params, mode=mode.value) | ||
result = smart_open.open(f"s3://{self.config.bucket}/{file.uri}", transport_params=params, mode=mode.value, encoding=encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a little tricky to test this until we have the adapter. I tested by merging the codebase with this branch to run the v4 source and reading s3://airbyte-acceptance-test-source-s3/csv_tests/csv_encoded_as_cp1252.csv
, which is encoded as cp1252
@@ -95,7 +95,7 @@ def validate_quote_char(cls, v: str) -> str: | |||
|
|||
@validator("escape_char") | |||
def validate_escape_char(cls, v: str) -> str: | |||
if len(v) != 1: | |||
if v is not None and len(v) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
escape_char is an optional field
@@ -16,6 +16,7 @@ | |||
class JsonlParser(FileTypeParser): | |||
|
|||
MAX_BYTES_PER_FILE_FOR_SCHEMA_INFERENCE = 1_000_000 | |||
ENCODING = "utf8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding isn't configurable in legacy S3 source. We can move this to a config if needed
@@ -50,8 +53,8 @@ def parse_records( | |||
) -> Iterable[Dict[str, Any]]: | |||
parquet_format = config.format[config.file_type] if config.format else ParquetFormat() | |||
if not isinstance(parquet_format, ParquetFormat): | |||
raise ValueError(f"Expected ParquetFormat, got {parquet_format}") # FIXME test this branch! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._expected_encoding = expected_encoding | ||
|
||
def open_file(self, file: RemoteFile, mode: FileReadMode, encoding: Optional[str], logger: logging.Logger) -> io.IOBase: | ||
assert encoding == self._expected_encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a great test, but the actual decoding is done outside of the CDK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we defining a class MockFileBasedStreamReader
while we could just "Mock(spec=AbstractFileBasedStreamReader)`? Having the mock would allow us to assert calls (and therefore validate arguments like we're doing with this assert
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ❌ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
…te into alex/file_based_encoding
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ❌ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ❌ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ❌ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ❌ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ❌ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blocker as my comments only relates to testing
self._expected_encoding = expected_encoding | ||
|
||
def open_file(self, file: RemoteFile, mode: FileReadMode, encoding: Optional[str], logger: logging.Logger) -> io.IOBase: | ||
assert encoding == self._expected_encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we defining a class MockFileBasedStreamReader
while we could just "Mock(spec=AbstractFileBasedStreamReader)`? Having the mock would allow us to assert calls (and therefore validate arguments like we're doing with this assert
) | ||
.set_stream_reader(TemporaryParquetFilesStreamReader(files=_single_parquet_file, file_type="parquet")) | ||
.set_file_type("parquet") | ||
.set_expected_read_error(ConfigValidationError, "Error creating stream config object. Contact Support if you need assistance.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is invalid because of the csv, right? Could we have a better error message? Also, could this be a unit test?
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
/approve-and-merge reason="S3 pipeline will not pass because the connector version isn't incremented" |
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
…eam reader (airbytehq#29110) * Add encoding to open_file interface * pass the encoding set in the config * cleanup * cleanup * Automated Commit - Formatting Changes * Add missing test * Automated Commit - Formatting Changes * Update infer_schema too * Automated Commit - Formatting Changes * Update unit test * add a unit test * fix * format * format * remove newline * use a mock * fix * format --------- Co-authored-by: girarda <girarda@users.noreply.github.com>
What
Enable S3 v4 to read files with various encoding
How
Recommended reading order
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_based_stream_reader.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/csv_parser.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/avro_parser.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/jsonl_parser.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/parquet_parser.py
airbyte-integrations/connectors/source-s3/source_s3/v4/stream_reader.py