[ETL-287] JSON to Parquet job #11

philerooski · 2023-02-01T02:42:34Z

Implementation of JSON to Parquet job
There is one JSON to Parquet job per data type. I spent some time trying to get the trigger to start many instances of the same job with different (--glue-table) arguments, but as it turns out triggers cannot start more than one instance of the same job.
JSON and Parquet datasets are now written to a namespaced S3 prefix.
The S3 to JSON and JSON to Parquet jobs are not namespaced, although the workflow they run within is.
The Glue workflow was modified so that the successful completion of the S3 to JSON job triggers every JSON to Parquet job.

rxu17

Otherwise, lgtm!

config/develop/namespaced/glue-workflow.yaml

rxu17 · 2023-02-01T22:31:09Z

templates/glue-job-JSONToParquet.j2

+  {% set datasets = [] %}
+  {% for v in sceptre_user_data.dataset_schemas.tables.keys() if not "Deleted" in v %}
+    {% set dataset = {} %}
+    {% do dataset.update({"type": v}) %}
+    {% do dataset.update({"table_name": "dataset_" + v.lower()})%}
+    {% do dataset.update({"stackname_prefix": "{}".format(v.replace("_",""))}) %}
+    {% do datasets.append(dataset) %}
+  {% endfor %}


Out of curiosity, given that we use the same set of code across this, glue_tables.j2 and glue-workflow.j2, is there a way for us to do this once somewhere, and then save that result for the other scripts to use? I know previously everything was in one giant .j2 file which we only needed this snippet of code once which wasn't the best for us if we wanted to test/run individual components.

I don't think there's a way to set Jinja "globals" that can be accessed from any stack.

thomasyu888

🔥 Great work!

thomasyu888 · 2023-02-03T08:55:48Z

src/glue/jobs/json_to_parquet.py

+    )
+    return table
+
+def write_table_to_s3(


Are some of these functions used across BridgeDownstream and RECOVER? I wonder if it'd be worth creating a package. Can you install custom packages in AWS Glue?

Are some of these functions used across BridgeDownstream and RECOVER? I wonder if it'd be worth creating a package.

Yes, we will want to maintain a separate python library once we have a more generalized framework for data to parquet. I'm not sure if it makes sense doing this for RECOVER considering the tight deadline we are on, but it's worth keeping in mind for future projects that require a data to compute of some sort.

src/glue/jobs/json_to_parquet.py

philerooski added 9 commits January 31, 2023 16:39

Move glue-workflow template to jinja extension

514e402

Glue workflow triggers multiple JSON to Parquet jobs

8c1dee4

Write to namespace in s3 to json job

45f1e5a

Update Glue table location and add output for database name

39139d3

JSON to Parquet job initial commit

0999dca

Move glue workflow config to namespaced folder

266f9bf

syntax fixes

c9f192e

Copy over and modify production stacks

e1941cf

Type hints and docstring updates for JSON to Parquet job

07ac919

philerooski requested a review from a team as a code owner February 1, 2023 02:42

rxu17 reviewed Feb 1, 2023

View reviewed changes

rxu17 self-requested a review February 1, 2023 23:20

rxu17 approved these changes Feb 1, 2023

View reviewed changes

thomasyu888 approved these changes Feb 3, 2023

View reviewed changes

thomasyu888 reviewed Feb 3, 2023

View reviewed changes

src/glue/jobs/json_to_parquet.py Show resolved Hide resolved

philerooski merged commit 2975325 into main Feb 10, 2023

philerooski deleted the etl-287 branch February 10, 2023 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-287] JSON to Parquet job #11

[ETL-287] JSON to Parquet job #11

philerooski commented Feb 1, 2023

rxu17 left a comment •

edited

Loading

rxu17 Feb 1, 2023

philerooski Feb 2, 2023

thomasyu888 left a comment

thomasyu888 Feb 3, 2023

philerooski Feb 6, 2023 •

edited

Loading

[ETL-287] JSON to Parquet job #11

[ETL-287] JSON to Parquet job #11

Conversation

philerooski commented Feb 1, 2023

rxu17 left a comment • edited Loading

Choose a reason for hiding this comment

rxu17 Feb 1, 2023

Choose a reason for hiding this comment

philerooski Feb 2, 2023

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

thomasyu888 Feb 3, 2023

Choose a reason for hiding this comment

philerooski Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

rxu17 left a comment •

edited

Loading

philerooski Feb 6, 2023 •

edited

Loading