Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-673] Upload compressed JSON as part of S3 to JSON #135

Merged
merged 2 commits into from
Aug 21, 2024
Merged

Conversation

philerooski
Copy link
Contributor

In addition to partitioning and writing JSON to JSON datasets as "part files", e.g.,

FitbitIntradayCombined_20220111-20230103.part0.ndjson
FitbitIntradayCombined_20220111-20230103.part1.ndjson
...

Write the JSON as a gzip archive to a separate compressed_json S3 prefix.

FitbitIntradayCombined_20220111-20230103.ndjson.gz

This PR is meant to ease the transition towards exclusively using compressed JSON data as the input to JSON to Parquet. Hence, there are a few changes here which are meant to be refactored later on once we remove the logic for writing uncompressed part files. The primary changes are to the function write_file_to_json_dataset with some supporting changes in helper functions.

You can get a more concrete feel for how this affects the behavior of S3 to JSON by examining the result of the integration tests in s3://recover-dev-intermediate-data/etl-673/.

@philerooski philerooski requested a review from a team as a code owner August 20, 2024 18:28
Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

@philerooski
Copy link
Contributor Author

There was a bug where I was uploading the files before closing their file objects, but I believe that's resolved now.

current_output_path = output_path
line_count = 0
# fmt: off
# <python3.10 requires this backslash syntax, we currently run 3.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we do have a ticket to update to python 3.10, do you think we should do that sooner than later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we may end up abandoning Glue altogether so it doesn't make sense spending time upgrading python just to avoid these minor inconveniences.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! Thanks for the review and work here!

@philerooski philerooski merged commit eadb2a9 into main Aug 21, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants