Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-499] JSON to Parquet support for Garmin data types #66

Merged
merged 3 commits into from
Jul 27, 2023
Merged

Conversation

philerooski
Copy link
Contributor

Blocked by #65

Changes fall into three categories:

  • Changes to config/* files - this replaces references to the S3 bucket where we store our templates so that we don't run into the file size limitation on the --templateBody parameter when the aws cloudformation tool validates the templates.
  • Changes to src/glue/jobs/json_to_parquet.py - This adds the index fields for Garmin data types to our INDEX_FIELD_MAP global, adds additional logic to reference the InsertedDate field when dropping duplicates, and does a small amount of refactoring.
  • Changes to src/glue/resources/table_columns.yaml - This adds the schema definitions for Garmin data types while reflecting the transformations we do for these data types in [ETL-505] Add Garmin transforms to S3 to JSON job #65 .

This requires us to use the sceptre keyword template_bucket_name
in stack config so that we don't hit the --templateBody file size
validation limit in `aws cloudformation`.
@philerooski philerooski requested a review from a team as a code owner July 25, 2023 21:15
Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple of comments

spark_df = table.toDF()
if "InsertedDate" in field_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a recent development that any data type with InsertedDate that has duplicates should be dropped based on InsertedDate? Besides just garmin data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an email with Care Evolution. Sorry I forgot to cc you! Here is their words:

For files that contain "InsertedDate", the "InsertedDate" should be used to identify the most recent values and resolve duplicates. The "InsertedDate" is included in those files because it is possible for there to be duplicates within the same export file. For the files without InsertedDate, you should continue to use the date included in the filename to resolve duplicates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does there need to be updates to the tests to have examples with InsertedDate for this new change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 🙃

@philerooski philerooski temporarily deployed to develop July 26, 2023 20:54 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 20:54 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 20:54 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 20:54 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 20:59 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 21:23 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 21:23 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 21:23 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 21:23 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 21:28 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 21:28 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 21:35 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 26, 2023 21:37 — with GitHub Actions Inactive
@philerooski philerooski merged commit e7fe57a into main Jul 27, 2023
14 checks passed
@thomasyu888 thomasyu888 deleted the etl-499 branch September 24, 2023 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants