-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-499] JSON to Parquet support for Garmin data types #66
Conversation
This requires us to use the sceptre keyword template_bucket_name in stack config so that we don't hit the --templateBody file size validation limit in `aws cloudformation`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a couple of comments
spark_df = table.toDF() | ||
if "InsertedDate" in field_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a recent development that any data type with InsertedDate
that has duplicates should be dropped based on InsertedDate
? Besides just garmin data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an email with Care Evolution. Sorry I forgot to cc you! Here is their words:
For files that contain "InsertedDate", the "InsertedDate" should be used to identify the most recent values and resolve duplicates. The "InsertedDate" is included in those files because it is possible for there to be duplicates within the same export file. For the files without InsertedDate, you should continue to use the date included in the filename to resolve duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does there need to be updates to the tests to have examples with InsertedDate
for this new change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 🙃
Blocked by #65
Changes fall into three categories:
config/*
files - this replaces references to the S3 bucket where we store our templates so that we don't run into the file size limitation on the--templateBody
parameter when theaws cloudformation
tool validates the templates.src/glue/jobs/json_to_parquet.py
- This adds the index fields for Garmin data types to ourINDEX_FIELD_MAP
global, adds additional logic to reference theInsertedDate
field when dropping duplicates, and does a small amount of refactoring.src/glue/resources/table_columns.yaml
- This adds the schema definitions for Garmin data types while reflecting the transformations we do for these data types in [ETL-505] Add Garmin transforms to S3 to JSON job #65 .