Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-561] Unquote S3 key name before submitting to Glue #81

Merged
merged 1 commit into from
Sep 29, 2023
Merged

Conversation

philerooski
Copy link
Contributor

See ticket for details.

@philerooski philerooski requested a review from a team as a code owner September 27, 2023 20:24
@philerooski philerooski temporarily deployed to develop September 27, 2023 20:27 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop September 27, 2023 20:32 — with GitHub Actions Inactive
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 LGTM!

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a comment!

@@ -114,7 +115,7 @@ def lambda_handler(event, context) -> None:
else:
for s3_event in s3_event_records["Records"]:
bucket_name = s3_event["s3"]["bucket"]["name"]
object_key = s3_event["s3"]["object"]["key"]
object_key = parse.unquote(s3_event["s3"]["object"]["key"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some test cases (or some actual pilot data with the special characters in their name) to test this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on test case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that would be testing the behavior of urllib.parse.unquote, which we can assume behaves well. If we look at the behavior of this function:

Replace %xx escapes with their single-character equivalent.

there aren't any realistic side-effects of using this function vs passing the key as we receive it from SQS. (Unless Care Evolution changed the format of their exports to include unicode encodings? Do we really have to consider that?)

Additionally, this will be tested as part of our integration tests once we complete https://sagebionetworks.jira.com/browse/ETL-562

Copy link
Contributor

@rxu17 rxu17 Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessarily testing urllib.parse.unquote directly but rather testing that when we run this lambda_handler function entirely, we are submitting object keys to the glue workflow that are expected (in this particular scenario: decoded) i.e: all that other code is not doing something to the object key that is unexpected for example. This way in the future if we add more code to this lambda_handler function, we should still expect that whatever that code is, what we submit to the glue workflow is as we expect

Now whether that is through our CI/CD integration tests with test data like you mentioned in a future ticket and/or through tests in our test_s3_to_glue_lambda, that's fine on either end. It feels like though, if something were to break, it may be worthwhile catching it in the test_s3_to_glue_lambda rather than later in our integration tests (though may be harder to mimic the behavior)

Maybe that warrants a bigger discussion of if we want to enforce an expected, consistent format for the names of the S3 files before it gets processed so we don't have a lot of variability in the names and have to handle it each time it pops up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I refactored app.py and added a test to see if we are handling quoted key names appropriately.

@philerooski philerooski temporarily deployed to develop September 27, 2023 23:44 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop September 27, 2023 23:44 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop September 27, 2023 23:44 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop September 27, 2023 23:44 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop September 27, 2023 23:48 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop September 27, 2023 23:48 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop September 27, 2023 23:54 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop September 27, 2023 23:58 — with GitHub Actions Inactive
@philerooski philerooski merged commit ab93796 into main Sep 29, 2023
14 checks passed
@philerooski philerooski deleted the etl-561 branch September 29, 2023 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants