-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-287] JSON to Parquet job #11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, lgtm!
{% set datasets = [] %} | ||
{% for v in sceptre_user_data.dataset_schemas.tables.keys() if not "Deleted" in v %} | ||
{% set dataset = {} %} | ||
{% do dataset.update({"type": v}) %} | ||
{% do dataset.update({"table_name": "dataset_" + v.lower()})%} | ||
{% do dataset.update({"stackname_prefix": "{}".format(v.replace("_",""))}) %} | ||
{% do datasets.append(dataset) %} | ||
{% endfor %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, given that we use the same set of code across this, glue_tables.j2
and glue-workflow.j2
, is there a way for us to do this once somewhere, and then save that result for the other scripts to use? I know previously everything was in one giant .j2 file which we only needed this snippet of code once which wasn't the best for us if we wanted to test/run individual components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a way to set Jinja "globals" that can be accessed from any stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Great work!
) | ||
return table | ||
|
||
def write_table_to_s3( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are some of these functions used across BridgeDownstream and RECOVER? I wonder if it'd be worth creating a package. Can you install custom packages in AWS Glue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are some of these functions used across BridgeDownstream and RECOVER? I wonder if it'd be worth creating a package.
Yes, we will want to maintain a separate python library once we have a more generalized framework for data to parquet. I'm not sure if it makes sense doing this for RECOVER considering the tight deadline we are on, but it's worth keeping in mind for future projects that require a data to compute of some sort.
--glue-table
) arguments, but as it turns out triggers cannot start more than one instance of the same job.