-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File cdk parser and cursor updates #28900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly non-blocking comments!
But wanted to discuss the switch from milliseconds to seconds precision. Are we converting the entire file based CDK to the less precise type to satisfy the S3 CAT tests? Because if so it does feel a little bit backwards that we're reverting precision in the CDK to support a specific connector. Not taking CAT backwards compatibility into account, we do have the ability to add more transformations from a legacy config to the new config since we already have to do other ones.
@@ -13,7 +13,7 @@ | |||
|
|||
class DefaultFileBasedCursor(FileBasedCursor): | |||
DEFAULT_DAYS_TO_SYNC_IF_HISTORY_IS_FULL = 3 | |||
DATE_TIME_FORMAT = "%Y-%m-%dT%H:%M:%S.%fZ" | |||
DATE_TIME_FORMAT = "%Y-%m-%dT%H:%M:%SZ" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, okay i'll revert my start_date PR back to this then. I was originally converting back out to milliseconds
So do our CAT tests in general just not support anything more granular than seconds? I'm curious because I do see some configs where we use more granular precisions. Or is this specifically to retain the precision being used by S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey sorry, I took a closer look at where the granularity came from that was causing CATs to fail, and realized that we just need to be internally consistent between the cursor granularity and the records that we're emitting. I updated the code & tests to use %f
with everyting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. thanks for checking!
schema = {field.name: ParquetParser.parquet_type_to_schema_type(field.type, parquet_format) for field in parquet_schema} | ||
# Inferred partition schema | ||
partition_columns = {x.split("=")[0]: {"type": "string"} for x in self._extract_partitions(file.uri)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit can we use a more descriptive variable name for the current partition than x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done.
|
||
@staticmethod | ||
def _extract_partitions(filepath: str) -> List[str]: | ||
return [unquote(x) for x in filepath.split(os.sep) if "=" in x] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done.
state = { | ||
"history": self._file_to_datetime_history, | ||
} | ||
state = {"history": self._file_to_datetime_history, "_ab_source_file_last_modified": self._get_cursor()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your CAT testing that had discrepencies was _ab_source_file_last_modified
always present in the state message even if the cursor was None
? Or should it be omitted if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that isn't actually covered by the CATs. Do you know what we usually do in that situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about setting it to datetime.min
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think I'll leave it as-is for now, since we don't actually use the cursor field for deciding what to sync; it's just being used as a tool for integration tests.
bb2ff24
to
fef04b8
Compare
fef04b8
to
d921f11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
@@ -78,7 +78,7 @@ def read_records_from_slice(self, stream_slice: StreamSlice) -> Iterable[Mapping | |||
parser = self.get_parser(self.config.file_type) | |||
for file in stream_slice["files"]: | |||
# only serialize the datetime once | |||
file_datetime_string = file.last_modified.strftime("%Y-%m-%dT%H:%M:%SZ") | |||
file_datetime_string = file.last_modified.strftime("%Y-%m-%dT%H:%M:%S.%fZ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lets put this as a constant on the class just in case we do end up using time in multiple places
* File-based CDK: update parquet parser to handle partitions * File-based CDK: make the record output & cursor date time format consistent
What
This PR consists of a few updates to handle some issues encountered during integration testing.
DefaultFileBasedCursor
datetime format to agree with the format expected by CATs.Recommended reading order