-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-673] Upload compressed JSON as part of S3 to JSON #135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Quality Gate passedIssues Measures |
There was a bug where I was uploading the files before closing their file objects, but I believe that's resolved now. |
current_output_path = output_path | ||
line_count = 0 | ||
# fmt: off | ||
# <python3.10 requires this backslash syntax, we currently run 3.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we do have a ticket to update to python 3.10, do you think we should do that sooner than later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we may end up abandoning Glue altogether so it doesn't make sense spending time upgrading python just to avoid these minor inconveniences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM! Thanks for the review and work here!
In addition to partitioning and writing JSON to JSON datasets as "part files", e.g.,
Write the JSON as a gzip archive to a separate
compressed_json
S3 prefix.This PR is meant to ease the transition towards exclusively using compressed JSON data as the input to JSON to Parquet. Hence, there are a few changes here which are meant to be refactored later on once we remove the logic for writing uncompressed part files. The primary changes are to the function
write_file_to_json_dataset
with some supporting changes in helper functions.You can get a more concrete feel for how this affects the behavior of S3 to JSON by examining the result of the integration tests in
s3://recover-dev-intermediate-data/etl-673/
.