Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-287] JSON to Parquet job #11

Merged
merged 9 commits into from
Feb 10, 2023
Merged

[ETL-287] JSON to Parquet job #11

merged 9 commits into from
Feb 10, 2023

Conversation

philerooski
Copy link
Contributor

  • Implementation of JSON to Parquet job
  • There is one JSON to Parquet job per data type. I spent some time trying to get the trigger to start many instances of the same job with different (--glue-table) arguments, but as it turns out triggers cannot start more than one instance of the same job.
  • JSON and Parquet datasets are now written to a namespaced S3 prefix.
  • The S3 to JSON and JSON to Parquet jobs are not namespaced, although the workflow they run within is.
  • The Glue workflow was modified so that the successful completion of the S3 to JSON job triggers every JSON to Parquet job.

@philerooski philerooski requested a review from a team as a code owner February 1, 2023 02:42
Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, lgtm!

config/develop/namespaced/glue-workflow.yaml Show resolved Hide resolved
Comment on lines +47 to +54
{% set datasets = [] %}
{% for v in sceptre_user_data.dataset_schemas.tables.keys() if not "Deleted" in v %}
{% set dataset = {} %}
{% do dataset.update({"type": v}) %}
{% do dataset.update({"table_name": "dataset_" + v.lower()})%}
{% do dataset.update({"stackname_prefix": "{}".format(v.replace("_",""))}) %}
{% do datasets.append(dataset) %}
{% endfor %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, given that we use the same set of code across this, glue_tables.j2 and glue-workflow.j2, is there a way for us to do this once somewhere, and then save that result for the other scripts to use? I know previously everything was in one giant .j2 file which we only needed this snippet of code once which wasn't the best for us if we wanted to test/run individual components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a way to set Jinja "globals" that can be accessed from any stack.

@rxu17 rxu17 self-requested a review February 1, 2023 23:20
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Great work!

)
return table

def write_table_to_s3(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are some of these functions used across BridgeDownstream and RECOVER? I wonder if it'd be worth creating a package. Can you install custom packages in AWS Glue?

Copy link
Contributor Author

@philerooski philerooski Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are some of these functions used across BridgeDownstream and RECOVER? I wonder if it'd be worth creating a package.

Yes, we will want to maintain a separate python library once we have a more generalized framework for data to parquet. I'm not sure if it makes sense doing this for RECOVER considering the tight deadline we are on, but it's worth keeping in mind for future projects that require a data to compute of some sort.

@philerooski philerooski merged commit 2975325 into main Feb 10, 2023
@philerooski philerooski deleted the etl-287 branch February 10, 2023 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants