[ETL-673] Upload compressed JSON as part of S3 to JSON #135

philerooski · 2024-08-20T18:28:33Z

In addition to partitioning and writing JSON to JSON datasets as "part files", e.g.,

FitbitIntradayCombined_20220111-20230103.part0.ndjson
FitbitIntradayCombined_20220111-20230103.part1.ndjson
...

Write the JSON as a gzip archive to a separate compressed_json S3 prefix.

FitbitIntradayCombined_20220111-20230103.ndjson.gz

This PR is meant to ease the transition towards exclusively using compressed JSON data as the input to JSON to Parquet. Hence, there are a few changes here which are meant to be refactored later on once we remove the logic for writing uncompressed part files. The primary changes are to the function write_file_to_json_dataset with some supporting changes in helper functions.

You can get a more concrete feel for how this affects the behavior of S3 to JSON by examining the result of the integration tests in s3://recover-dev-intermediate-data/etl-673/.

BryanFauble

LGTM!

sonarqubecloud · 2024-08-20T22:08:28Z

Quality Gate passed

Issues
14 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
15.4% Duplication on New Code

See analysis details on SonarCloud

philerooski · 2024-08-20T22:40:08Z

There was a bug where I was uploading the files before closing their file objects, but I believe that's resolved now.

thomasyu888 · 2024-08-20T22:52:25Z

src/glue/jobs/s3_to_json.py

-        current_output_path = output_path
-        line_count = 0
+    # fmt: off
+    # <python3.10 requires this backslash syntax, we currently run 3.7


Thanks, we do have a ticket to update to python 3.10, do you think we should do that sooner than later?

No, we may end up abandoning Glue altogether so it doesn't make sense spending time upgrading python just to avoid these minor inconveniences.

thomasyu888

🔥 LGTM! Thanks for the review and work here!

philerooski requested a review from a team as a code owner August 20, 2024 18:28

Upload compressed JSON in S3 to JSON

cd9e5e7

philerooski force-pushed the etl-673 branch from 355ea44 to cd9e5e7 Compare August 20, 2024 18:46

philerooski temporarily deployed to develop August 20, 2024 18:46 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 20, 2024 18:49 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 20, 2024 19:04 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 20, 2024 19:06 — with GitHub Actions Inactive

BryanFauble approved these changes Aug 20, 2024

View reviewed changes

philerooski temporarily deployed to develop August 20, 2024 21:54 — with GitHub Actions Inactive

philerooski had a problem deploying to develop August 20, 2024 21:57 — with GitHub Actions Failure

philerooski had a problem deploying to develop August 20, 2024 21:57 — with GitHub Actions Error

squash bug where we upload file before buffer is closed

77048b5

philerooski force-pushed the etl-673 branch from ea50f7b to 77048b5 Compare August 20, 2024 22:08

philerooski temporarily deployed to develop August 20, 2024 22:08 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 20, 2024 22:11 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 20, 2024 22:17 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 20, 2024 22:19 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 20, 2024 22:20 — with GitHub Actions Inactive

thomasyu888 reviewed Aug 20, 2024

View reviewed changes

thomasyu888 approved these changes Aug 20, 2024

View reviewed changes

philerooski merged commit eadb2a9 into main Aug 21, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-673] Upload compressed JSON as part of S3 to JSON #135

[ETL-673] Upload compressed JSON as part of S3 to JSON #135

philerooski commented Aug 20, 2024

BryanFauble left a comment

sonarqubecloud bot commented Aug 20, 2024

philerooski commented Aug 20, 2024

thomasyu888 Aug 20, 2024

philerooski Aug 21, 2024

thomasyu888 left a comment

[ETL-673] Upload compressed JSON as part of S3 to JSON #135

[ETL-673] Upload compressed JSON as part of S3 to JSON #135

Conversation

philerooski commented Aug 20, 2024

BryanFauble left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Aug 20, 2024

Quality Gate passed

philerooski commented Aug 20, 2024

thomasyu888 Aug 20, 2024

Choose a reason for hiding this comment

philerooski Aug 21, 2024

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment