-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-561] Unquote S3 key name before submitting to Glue #81
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼 LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a comment!
@@ -114,7 +115,7 @@ def lambda_handler(event, context) -> None: | |||
else: | |||
for s3_event in s3_event_records["Records"]: | |||
bucket_name = s3_event["s3"]["bucket"]["name"] | |||
object_key = s3_event["s3"]["object"]["key"] | |||
object_key = parse.unquote(s3_event["s3"]["object"]["key"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have some test cases (or some actual pilot data with the special characters in their name) to test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would be testing the behavior of urllib.parse.unquote, which we can assume behaves well. If we look at the behavior of this function:
Replace %xx escapes with their single-character equivalent.
there aren't any realistic side-effects of using this function vs passing the key as we receive it from SQS. (Unless Care Evolution changed the format of their exports to include unicode encodings? Do we really have to consider that?)
Additionally, this will be tested as part of our integration tests once we complete https://sagebionetworks.jira.com/browse/ETL-562
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessarily testing urllib.parse.unquote
directly but rather testing that when we run this lambda_handler
function entirely, we are submitting object keys to the glue workflow that are expected (in this particular scenario: decoded) i.e: all that other code is not doing something to the object key that is unexpected for example. This way in the future if we add more code to this lambda_handler
function, we should still expect that whatever that code is, what we submit to the glue workflow is as we expect
Now whether that is through our CI/CD integration tests with test data like you mentioned in a future ticket and/or through tests in our test_s3_to_glue_lambda
, that's fine on either end. It feels like though, if something were to break, it may be worthwhile catching it in the test_s3_to_glue_lambda
rather than later in our integration tests (though may be harder to mimic the behavior)
Maybe that warrants a bigger discussion of if we want to enforce an expected, consistent format for the names of the S3 files before it gets processed so we don't have a lot of variability in the names and have to handle it each time it pops up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I refactored app.py and added a test to see if we are handling quoted key names appropriately.
c03e844
to
d497ed8
Compare
See ticket for details.