[ETL-580] JSON to Parquet - Write record count of each export to S3 #93

philerooski · 2023-12-08T03:17:57Z

Summary of changes:

src/glue/jobs/json_to_parquet.py
tests/test_json_to_parquet.py

Whenever a read, write, or transform operation is done with the data, we record the number of records for each export_end_date (which acts as an identifier for the original export). To support this functionality, functions which did any reading/writing/transforming were split up to be more atomic. Each of these functions requires a record_counts and logger_context object, which is passed to the function responsible for doing the counting: count_records_for_event.

Since there are typically multiple read/write/transform events per data type, we accumulate these within record_counts throughout the job. As the very last step in the job, we concatenate each of these event-specific counts within each data type and write them as a CSV file to S3. Oftentimes there is only one data type which needs counts in a job (after all, the jobs are data type specific), but for data types which have an associated "deleted" table containing deleted samples, we do counts for those under a separate data type. The docstrings for count_record_for_event and store_record_counts go into the nitty gritty details.

Originally I had tried logging the count information in the ECS format, hence there being some ancillary changes related to ECS logging. This didn't work out (see Jira ticket), but the ECS work will support future work on https://sagebionetworks.jira.com/browse/ETL-573 .

Also, a note on tests: JSON to Parquet tests were originally written as a blend of unit and integration tests. This will eventually need to be rewritten as pure unit tests, so any additional tests added have been written as pure unit tests. Fixing the rest of the tests was outside the scope of this ticket (which already took me ~3 weeks to complete) so those are mostly unchanged.

tests/test_json_to_parquet/TestFlatInsertedDateDataType_20230512.ndjson
tests/test_json_to_parquet/TestFlatInsertedDateDataType_20230612.ndjson

We expect our JSON to always contain an export_end_date field since we use that field while counting records. This adds that field to these test data.

config/develop/namespaced/glue-job-JSONToParquet.yaml
config/prod/namespaced/glue-job-JSONToParquet.yaml
templates/glue-job-JSONToParquet.j2
tests/Dockerfile.aws_glue_4

Add support for additional python modules in JSON to Parquet job (specifically, ECS logging)

src/glue/jobs/s3_to_json.py

Added a merge_dicts function which resolves the potential issue where: if a dict is passed to the logger, it could overwrite the same dict in the logger_context. Merging dictionaries in a smart way is not trivial, but I think this handles the most obvious case pretty well. Specifically: When passing additional labels (a dict), we merge this with the labels in the logger_context. Previously, the locally defined labels would overwrite the labels in the logger_context.

Enables us to merge nested structures in `logger_context` with local logging info, particularly the `labels` object in `logger_context`

BryanFauble · 2023-12-08T18:19:14Z

src/glue/jobs/json_to_parquet.py

@@ -332,13 +414,133 @@ def write_table_to_s3(
            format = "parquet",
            transformation_ctx="write_dynamic_frame")

+def count_records_for_event(
+        table: "pyspark.sql.dataframe.DataFrame",
+        event: str,


Since event is restricted to a set of string would this make sense to convert this to a String enum?

Technically I can't. StrEnum was introduced in 3.11 and this runs in 3.10.

I've never used StrEnum as a type hint before. From the Enum tutorial it seems like it was designed to provide syntactic sugar for working with classes, rather than as a type hint. Is there a benefit to adding the type hint in addition to the docstring and the check we do inside the function itself?

I was more thinking of Enum in general not specifically a StrEnum (I didn't even know that was a thing).

Member values can be anything: int, str, etc..

from enum import Enum class EventType(Enum): """The event associated with a count.""" READ = "READ" """This table has just now been read from the Glue table catalog and has not yet had any transformations done to it.""" DROP_DUPLICATES = "DROP_DUPLICATES" """This table has just now had duplicate records dropped (see the function `drop_table_duplicates`).""" DROP_DELETED_SAMPLES = "DROP_DELETED_SAMPLES" """This table has just now had records which are present in its respective "Deleted" table dropped (see the function `drop_deleted_healthkit_data`).""" WRITE = "WRITE" """This table has just now been written to S3.""" def do_thing(event_type: EventType): print(event_type) do_thing(EventType.READ) do_thing(EventType.DROP_DUPLICATES) do_thing(EventType.DROP_DELETED_SAMPLES) do_thing(EventType.WRITE)

The point around this is that I don't need to know the string constant to use. I could check the docstring, but I could also use one of the available Enum values:

That makes sense. It has some nifty features. I'll try it out.

BryanFauble

Excellent work!

rxu17

LGTM! Just a few comments

src/glue/jobs/json_to_parquet.py

src/glue/jobs/s3_to_json.py

thomasyu888 · 2023-12-12T23:03:45Z

src/glue/jobs/s3_to_json.py

@@ -643,6 +646,41 @@ def _upload_file_to_json_dataset(
        os.remove(file_path)
    return s3_output_key

+def merge_dicts(x: dict, y: dict) -> Generator:


Nit: Just out of curiosity, do you know if there is a feature that already exists from some package?

Not in the standard library afaik. IMO, it's not worth adding another dependency.

thomasyu888

💯 Awesome work!

src/glue/jobs/json_to_parquet.py

sonarqubecloud · 2023-12-15T22:30:21Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

3 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

philerooski added 5 commits December 7, 2023 18:30

Include ecs_logging in JSON to Parquet environment

d425152

Add merge_dicts functionality to S3 to JSON

9df2f46

Enables us to merge nested structures in `logger_context` with local logging info, particularly the `labels` object in `logger_context`

Update test data to include required export_end_date field

243f504

Write record counts in JSON to Parquet

ea1a8f7

Update test dockerfile to include ecs_logging

0d3984e

philerooski requested a review from a team as a code owner December 8, 2023 03:17

BryanFauble reviewed Dec 8, 2023

View reviewed changes

BryanFauble approved these changes Dec 8, 2023

View reviewed changes

rxu17 approved these changes Dec 12, 2023

View reviewed changes

src/glue/jobs/json_to_parquet.py Show resolved Hide resolved

src/glue/jobs/json_to_parquet.py Outdated Show resolved Hide resolved

src/glue/jobs/s3_to_json.py Show resolved Hide resolved

thomasyu888 reviewed Dec 12, 2023

View reviewed changes

thomasyu888 approved these changes Dec 12, 2023

View reviewed changes

philerooski temporarily deployed to develop December 14, 2023 19:56 — with GitHub Actions Inactive

philerooski temporarily deployed to develop December 14, 2023 19:59 — with GitHub Actions Inactive

philerooski had a problem deploying to develop December 14, 2023 19:59 — with GitHub Actions Failure

philerooski force-pushed the etl-580 branch from 79387da to 8cef26c Compare December 14, 2023 20:24

philerooski temporarily deployed to develop December 14, 2023 20:24 — with GitHub Actions Inactive

philerooski temporarily deployed to develop December 14, 2023 20:27 — with GitHub Actions Inactive

philerooski had a problem deploying to develop December 14, 2023 20:27 — with GitHub Actions Failure

BryanFauble reviewed Dec 14, 2023

View reviewed changes

src/glue/jobs/json_to_parquet.py Outdated Show resolved Hide resolved

philerooski force-pushed the etl-580 branch from 8cef26c to 8af95bf Compare December 14, 2023 20:49

philerooski temporarily deployed to develop December 14, 2023 20:49 — with GitHub Actions Inactive

philerooski temporarily deployed to develop December 14, 2023 20:52 — with GitHub Actions Inactive

philerooski had a problem deploying to develop December 14, 2023 20:52 — with GitHub Actions Failure

ETL-580 code review fixes

1adf6d8

philerooski force-pushed the etl-580 branch from 8af95bf to 1adf6d8 Compare December 14, 2023 21:44

philerooski temporarily deployed to develop December 14, 2023 21:44 — with GitHub Actions Inactive

philerooski temporarily deployed to develop December 14, 2023 21:47 — with GitHub Actions Inactive

philerooski temporarily deployed to develop December 14, 2023 22:01 — with GitHub Actions Inactive

philerooski temporarily deployed to develop December 14, 2023 22:04 — with GitHub Actions Inactive

philerooski merged commit b19936b into main Dec 18, 2023
15 checks passed

philerooski deleted the etl-580 branch December 19, 2023 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-580] JSON to Parquet - Write record count of each export to S3 #93

[ETL-580] JSON to Parquet - Write record count of each export to S3 #93

philerooski commented Dec 8, 2023 •

edited

Loading

BryanFauble Dec 8, 2023

philerooski Dec 12, 2023

BryanFauble Dec 12, 2023 •

edited

Loading

philerooski Dec 12, 2023

BryanFauble left a comment

rxu17 left a comment

thomasyu888 Dec 12, 2023

philerooski Dec 14, 2023

thomasyu888 left a comment

sonarqubecloud bot commented Dec 15, 2023

[ETL-580] JSON to Parquet - Write record count of each export to S3 #93

[ETL-580] JSON to Parquet - Write record count of each export to S3 #93

Conversation

philerooski commented Dec 8, 2023 • edited Loading

BryanFauble Dec 8, 2023

Choose a reason for hiding this comment

philerooski Dec 12, 2023

Choose a reason for hiding this comment

BryanFauble Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

philerooski Dec 12, 2023

Choose a reason for hiding this comment

BryanFauble left a comment

Choose a reason for hiding this comment

rxu17 left a comment

Choose a reason for hiding this comment

thomasyu888 Dec 12, 2023

Choose a reason for hiding this comment

philerooski Dec 14, 2023

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Dec 15, 2023

Quality Gate passed

philerooski commented Dec 8, 2023 •

edited

Loading

BryanFauble Dec 12, 2023 •

edited

Loading