-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File-based CDK: handle legacy path_prefix + globs #29389
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases and the code change looks good!
Questions:
- are we losing any expressivity going from two fields to one?
- do we need to document a migration path for users to re-write their config in the v4 format? For instance, if a user wants to update the config of an existing connector configured with the legacy spec
@girarda to answer your questions - I don't think we're losing expressivity by collapsing 2 fields to 1. In v3, the path prefix is just passed straight to S3, and then the globs are matched against the result. So in some configs there is redundancy between the prefix and the globs, because if you want the glob to match a URI with a particular prefix your glob may have to repeat the prefix. Users can also only specify one prefix in v3, whereas now we allow multiple. As far as a migration path goes, I think we could script that using the config conversion tools. |
6f15d0d
to
d7dad51
Compare
d7dad51
to
7b95e0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as a migration path goes, I think we could script that using the config conversion tools
@clnoll can you write a follow up issue to migrate existing configs to v4?
When doing some testing I noticed that we weren't always matching the same files in v3 and v4 of the S3 connector.
This is because we weren't quite handling the differences in semantics between having separate
path_prefix
andpath_pattern
(from v3) to simply usingglobs
(and extracting a path prefix for more efficient S3 querying), when handling v3 config in v4 code.The simplest way that I found to honor the legacy config was to keep
path_prefix
as-is in the v4 stream reader; if it's present we'll use it, but otherwise we'll just infer the prefix from the glob as we have been doing in v4.I'm mapping
path_prefix
tolegacy_prefix
and making it a hidden field because I don't think this is something we should support in v4. It's much cleaner and resolves some ambiguities that users have found confusing - to just have a single place,globs
, where we determine which files are matched.Testing
I used this script to run the tests. Prior to this I ran a separate script to create and populate all the files in the
FILEPATHS
list in theclnoll-file-based-cdk-test-files
bucket (the filepaths,path_prefixes
andpath_patterns
were created from patterns identified in user configs). Please let me know if there are cases you feel like aren't sufficiently covered and I'll add the files and prefixes/patterns.