Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-138] Integrate JSON schemas #84

Merged
merged 20 commits into from
Jul 18, 2022
Merged

[ETL-138] Integrate JSON schemas #84

merged 20 commits into from
Jul 18, 2022

Conversation

philerooski
Copy link
Contributor

Lots of changes in this one, but it's not too complicated once you understand that all this work was done to move away from using file names as our dataset identifier and instead use a value derived from the associated JSON Schema $id (as described in the docs here). Older assessments don't have an associated JSON Schema, so we use a legacy mapping (src/glue/resources/dataset_mapping.json) to directly map file names to dataset identifiers.

Other changes include:

  • Constructing Glue tables in the stack by referencing src/glue/resources/table_columns.yaml rather than src/glue/resources/dataset_mapping.json.
  • Bootstrap trigger crontab is taken from the namespaced S3 artifact location, not the latest_version sceptre user data S3 location -- which is really only meant to version glue job scripts. The crontab was also updated to reference the new dataset identifiers.
  • [FYI] src/glue/resources/schema_mapping.json is the future-ready mapping, which maps JSON Schema $id values to their dataset identifier. src/glue/resources/dataset_mapping.json is the legacy mapping which contains mappings for older assessment revisions.
  • dataset_mapping.json (the legacy mapping) now uses assessment ID/revision to map files to dataset identifiers. Previously we used osName/appVersion but a specific app build can still produce different revisions of the same assessment if a newer revision is published on Bridge.
  • We ignore info.json, answers.json, and taskResult.json files. This was requested by Science team as part of ETL-110. A possibly unexpected side effect of this is that we now support writing some files in a .zip archive to a JSON dataset (for example, metadata.json will always be written to a JSON dataset since it matches the metadata.json file name in the universal 'anyOf' category in archive-map.json) but not others.
  • The lambda test events were updated to use the latest test dataset revision.

Before merging we want to do some actions in prod:

  1. Disable bootstrap trigger job until we have reprocessed all existing study data (after merging) and produced a metadata Parquet dataset that can be diff'd upon.
  2. Archive existing parquet datasets (see /src/scripts/archive_dataset/archive_dataset.py).
  3. Because of the way we use Jinja to derive stack resources from configuration files (like the Glue tables derived from table_columns.yaml), I'm not confident that sceptre can deploy new resources and delete outdated resources effectively. We should delete the study stacks in config/prod/studies/ before merging so that we can start with a clean slate.

After merging and reprocessing data:

  1. Delete outdated JSON datasets.

@philerooski philerooski requested a review from a team as a code owner July 13, 2022 00:08
for a in app["assessments"]])
if is_valid_assessment:
for default_file in app["default"]["files"]:
if default_file["filename"] == file_name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do same deprecated check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked deeper into the schema for archive-map.json and it turns out that all of these file arrays use the same FileInfo object schema. So I decided to refactor this function and that turned into a refactor of the entire script.

To answer your question -- no. And when I refactored this I wrote a function for getting a JSON Schema from any FileInfo object and left out any deprecation check. We could be processing any data at any time, whether that data has been deprecated or not, so it shouldn't make a difference to us if something is deprecated. According to Shannon the "deprecated" flag was only meant to be used to mark old mPower 2 files like info.json and taskResult.json, and those don't have JSON Schema included in archive-map.json... long story short, the "deprecated" flag is irrelevant to us.

Copy link
Contributor

@tthyer tthyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -2,7 +2,7 @@ template_path: ec2-bootstrap-trigger.yaml
stack_name: ec2-bootstrap-trigger
parameters:
SsmParameterName: synapse-bridgedownstream-auth
CrontabURI: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.latest_version }}/ec2/resources/crontab
CrontabURI: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.namespace }}/ec2/resources/crontab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting here the passing of latest_version -- @philerooski & @thomasyu888 do we want to prioritize the discussion about our versioning system (and potential conflict with namespacing system), or just file a ticket and worry about it later?

Copy link
Member

@thomasyu888 thomasyu888 Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought is let's file a ticket and worry about it later unless it will be a huge shift later on to resolve this. I don't necessarily want to block this from being pulled in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it definitely shouldn't block this PR from being merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'd like to sort out versioning. It's not urgent.

Copy link

@BrunoGrandePhD BrunoGrandePhD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I did my best to understand the changes being introduced here. I think I'll need a few more of these PRs before I can provide any substantive feedback.

@philerooski philerooski temporarily deployed to develop July 14, 2022 20:57 Inactive
@philerooski philerooski temporarily deployed to develop July 14, 2022 21:00 Inactive
@philerooski philerooski temporarily deployed to develop July 18, 2022 18:31 Inactive
@philerooski philerooski temporarily deployed to develop July 18, 2022 18:34 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants