File cdk parser and cursor updates #28900

clnoll · 2023-08-01T02:19:43Z

What

This PR consists of a few updates to handle some issues encountered during integration testing.

Updates the parquet parser so that we're returning all partitions. Unfortunately I haven't found a great way to add local test coverage for the exact but that was encountered, but it is covered by S3 CATs. I created an issue to add it at a later time.
Updates the DefaultFileBasedCursor datetime format to agree with the format expected by CATs.
Returns the cursor in the state message, as required by CATs.

Recommended reading order

Parquet parser
File-based cursor

brianjlai

Mostly non-blocking comments!

But wanted to discuss the switch from milliseconds to seconds precision. Are we converting the entire file based CDK to the less precise type to satisfy the S3 CAT tests? Because if so it does feel a little bit backwards that we're reverting precision in the CDK to support a specific connector. Not taking CAT backwards compatibility into account, we do have the ability to add more transformations from a legacy config to the new config since we already have to do other ones.

brianjlai · 2023-08-01T17:12:41Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/cursor/default_file_based_cursor.py

@@ -13,7 +13,7 @@

 class DefaultFileBasedCursor(FileBasedCursor):
    DEFAULT_DAYS_TO_SYNC_IF_HISTORY_IS_FULL = 3
-    DATE_TIME_FORMAT = "%Y-%m-%dT%H:%M:%S.%fZ"
+    DATE_TIME_FORMAT = "%Y-%m-%dT%H:%M:%SZ"


interesting, okay i'll revert my start_date PR back to this then. I was originally converting back out to milliseconds

So do our CAT tests in general just not support anything more granular than seconds? I'm curious because I do see some configs where we use more granular precisions. Or is this specifically to retain the precision being used by S3?

Hey sorry, I took a closer look at where the granularity came from that was causing CATs to fail, and realized that we just need to be internally consistent between the cursor granularity and the records that we're emitting. I updated the code & tests to use %f with everyting.

got it. thanks for checking!

brianjlai · 2023-08-01T17:15:08Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/parquet_parser.py

        schema = {field.name: ParquetParser.parquet_type_to_schema_type(field.type, parquet_format) for field in parquet_schema}
+        # Inferred partition schema
+        partition_columns = {x.split("=")[0]: {"type": "string"} for x in self._extract_partitions(file.uri)}


nit can we use a more descriptive variable name for the current partition than x

brianjlai · 2023-08-01T17:17:50Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/parquet_parser.py

+
+    @staticmethod
+    def _extract_partitions(filepath: str) -> List[str]:
+        return [unquote(x) for x in filepath.split(os.sep) if "=" in x]


brianjlai · 2023-08-01T17:46:14Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/cursor/default_file_based_cursor.py

-        state = {
-            "history": self._file_to_datetime_history,
-        }
+        state = {"history": self._file_to_datetime_history, "_ab_source_file_last_modified": self._get_cursor()}


In your CAT testing that had discrepencies was _ab_source_file_last_modified always present in the state message even if the cursor was None? Or should it be omitted if so

Hmm that isn't actually covered by the CATs. Do you know what we usually do in that situation?

WDYT about setting it to datetime.min?

Actually I think I'll leave it as-is for now, since we don't actually use the cursor field for deciding what to sync; it's just being used as a tool for integration tests.

…istent

brianjlai

looks good to me!

brianjlai · 2023-08-01T23:36:45Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

@@ -78,7 +78,7 @@ def read_records_from_slice(self, stream_slice: StreamSlice) -> Iterable[Mapping
        parser = self.get_parser(self.config.file_type)
        for file in stream_slice["files"]:
            # only serialize the datetime once
-            file_datetime_string = file.last_modified.strftime("%Y-%m-%dT%H:%M:%SZ")
+            file_datetime_string = file.last_modified.strftime("%Y-%m-%dT%H:%M:%S.%fZ")


nit: lets put this as a constant on the class just in case we do end up using time in multiple places

* File-based CDK: update parquet parser to handle partitions * File-based CDK: make the record output & cursor date time format consistent

clnoll requested a review from a team as a code owner August 1, 2023 02:19

octavia-squidington-iii added the CDK Connector Development Kit label Aug 1, 2023

clnoll requested review from girarda, brianjlai and maxi297 August 1, 2023 02:19

brianjlai reviewed Aug 1, 2023

View reviewed changes

clnoll force-pushed the file-cdk-parser-and-cursor-updates branch from bb2ff24 to fef04b8 Compare August 1, 2023 22:14

clnoll requested a review from brianjlai August 1, 2023 22:18

clnoll added 3 commits August 1, 2023 18:22

File-based CDK: update parquet parser to handle partitions

1563998

File-based CDK: make the record output & cursor date time format cons…

16af96d

…istent

CI fix

d921f11

clnoll force-pushed the file-cdk-parser-and-cursor-updates branch from fef04b8 to d921f11 Compare August 1, 2023 22:32

brianjlai approved these changes Aug 1, 2023

View reviewed changes

Make file-based stream date time format a class attribute

ef3c3f4

clnoll merged commit 09ebb47 into master Aug 2, 2023

clnoll deleted the file-cdk-parser-and-cursor-updates branch August 2, 2023 01:47

clnoll restored the file-cdk-parser-and-cursor-updates branch August 2, 2023 01:48

bnchrch pushed a commit that referenced this pull request Aug 3, 2023

File cdk parser and cursor updates (#28900)

031b6b2

* File-based CDK: update parquet parser to handle partitions * File-based CDK: make the record output & cursor date time format consistent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File cdk parser and cursor updates #28900

File cdk parser and cursor updates #28900

clnoll commented Aug 1, 2023

brianjlai left a comment

brianjlai Aug 1, 2023

clnoll Aug 1, 2023

brianjlai Aug 1, 2023

brianjlai Aug 1, 2023

clnoll Aug 1, 2023

brianjlai Aug 1, 2023

clnoll Aug 1, 2023

brianjlai Aug 1, 2023

clnoll Aug 2, 2023

clnoll Aug 2, 2023

clnoll Aug 2, 2023

brianjlai left a comment

brianjlai Aug 1, 2023

File cdk parser and cursor updates #28900

File cdk parser and cursor updates #28900

Conversation

clnoll commented Aug 1, 2023

What

Recommended reading order

brianjlai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianjlai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment