[ETL-138] Integrate JSON schemas #84

philerooski · 2022-07-13T00:08:58Z

Lots of changes in this one, but it's not too complicated once you understand that all this work was done to move away from using file names as our dataset identifier and instead use a value derived from the associated JSON Schema $id (as described in the docs here). Older assessments don't have an associated JSON Schema, so we use a legacy mapping (src/glue/resources/dataset_mapping.json) to directly map file names to dataset identifiers.

Other changes include:

Constructing Glue tables in the stack by referencing src/glue/resources/table_columns.yaml rather than src/glue/resources/dataset_mapping.json.
Bootstrap trigger crontab is taken from the namespaced S3 artifact location, not the latest_version sceptre user data S3 location -- which is really only meant to version glue job scripts. The crontab was also updated to reference the new dataset identifiers.
[FYI] src/glue/resources/schema_mapping.json is the future-ready mapping, which maps JSON Schema $id values to their dataset identifier. src/glue/resources/dataset_mapping.json is the legacy mapping which contains mappings for older assessment revisions.
dataset_mapping.json (the legacy mapping) now uses assessment ID/revision to map files to dataset identifiers. Previously we used osName/appVersion but a specific app build can still produce different revisions of the same assessment if a newer revision is published on Bridge.
We ignore info.json, answers.json, and taskResult.json files. This was requested by Science team as part of ETL-110. A possibly unexpected side effect of this is that we now support writing some files in a .zip archive to a JSON dataset (for example, metadata.json will always be written to a JSON dataset since it matches the metadata.json file name in the universal 'anyOf' category in archive-map.json) but not others.
The lambda test events were updated to use the latest test dataset revision.

Before merging we want to do some actions in prod:

Disable bootstrap trigger job until we have reprocessed all existing study data (after merging) and produced a metadata Parquet dataset that can be diff'd upon.
Archive existing parquet datasets (see /src/scripts/archive_dataset/archive_dataset.py).
Because of the way we use Jinja to derive stack resources from configuration files (like the Glue tables derived from table_columns.yaml), I'm not confident that sceptre can deploy new resources and delete outdated resources effectively. We should delete the study stacks in config/prod/studies/ before merging so that we can start with a clean slate.

After merging and reprocessing data:

Delete outdated JSON datasets.

Our data resource identifiers in practice are just the glue table names as specified in table_columns.yaml

thomasyu888 · 2022-07-13T11:16:31Z

src/glue/jobs/s3_to_json_s3.py

+                    for a in app["assessments"]])
+            if is_valid_assessment:
+                for default_file in app["default"]["files"]:
+                    if default_file["filename"] == file_name:


No need to do same deprecated check here?

I looked deeper into the schema for archive-map.json and it turns out that all of these file arrays use the same FileInfo object schema. So I decided to refactor this function and that turned into a refactor of the entire script.

To answer your question -- no. And when I refactored this I wrote a function for getting a JSON Schema from any FileInfo object and left out any deprecation check. We could be processing any data at any time, whether that data has been deprecated or not, so it shouldn't make a difference to us if something is deprecated. According to Shannon the "deprecated" flag was only meant to be used to mark old mPower 2 files like info.json and taskResult.json, and those don't have JSON Schema included in archive-map.json... long story short, the "deprecated" flag is irrelevant to us.

tthyer

LGTM

tthyer · 2022-07-13T17:22:36Z

config/prod/ec2-bootstrap-trigger.yaml

@@ -2,7 +2,7 @@ template_path: ec2-bootstrap-trigger.yaml
 stack_name: ec2-bootstrap-trigger
 parameters:
  SsmParameterName: synapse-bridgedownstream-auth
-  CrontabURI: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.latest_version }}/ec2/resources/crontab
+  CrontabURI: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.namespace }}/ec2/resources/crontab


Noting here the passing of latest_version -- @philerooski & @thomasyu888 do we want to prioritize the discussion about our versioning system (and potential conflict with namespacing system), or just file a ticket and worry about it later?

My initial thought is let's file a ticket and worry about it later unless it will be a huge shift later on to resolve this. I don't necessarily want to block this from being pulled in.

I agree it definitely shouldn't block this PR from being merged

Yes I'd like to sort out versioning. It's not urgent.

BrunoGrandePhD

Looks good to me. I did my best to understand the changes being introduced here. I think I'll need a few more of these PRs before I can provide any substantive feedback.

philerooski added 18 commits July 6, 2022 15:47

schema mapping to data type identifier intial commit

d003e6a

Update dataset mapping to use schema mapping data type identifiers

a06d07a

Change the data resource identifiers to the JSON schema identifiers

65a3919

Our data resource identifiers in practice are just the glue table names as specified in table_columns.yaml

Make sure stackname_prefix is alphanumeric

0ec5e96

Use new dataset mapping format in s3 to json job

6fabdda

Update schema mapping file

8d1cf50

Get rid of client info related code in S3 to JSON job

642805d

change archive-map.json allOf references to anyOf

8cabdc7

improve getting json schema function, improved logging

0d35c5d

use 'main' branch of archive-map.json

e3a74aa

remove unnused datasets from glue resource files

9ad7f08

Update lambda test events for test dataset version 2

8bdc139

Update Pipfile.lock

c57789e

Use crontab from namespaced artifacts

5938483

Fix archive map version

469bac3

Cast assessment revision to int for archive-map.json lookup

eb6304d

Update lambda test events

05529e7

Update crontab to reference new dataset identifiers

2a2260e

philerooski requested a review from BrunoGrandePhD July 13, 2022 00:08

philerooski requested a review from a team as a code owner July 13, 2022 00:08

thomasyu888 reviewed Jul 13, 2022

View reviewed changes

thomasyu888 approved these changes Jul 13, 2022

View reviewed changes

tthyer approved these changes Jul 13, 2022

View reviewed changes

BrunoGrandePhD approved these changes Jul 14, 2022

View reviewed changes

Refactor S3 to JSON job and add docstrings

0dc0404

philerooski temporarily deployed to develop July 14, 2022 20:57 Inactive

philerooski temporarily deployed to develop July 14, 2022 21:00 Inactive

Remove unnessecary sceptre user data from study pmbfzc

a848ec0

philerooski temporarily deployed to develop July 18, 2022 18:31 Inactive

philerooski temporarily deployed to develop July 18, 2022 18:34 Inactive

philerooski merged commit 06e2926 into main Jul 18, 2022

philerooski deleted the etl-138 branch July 18, 2022 18:38

philerooski mentioned this pull request Jul 26, 2022

Fix type of assessment revision when referencing archive mapping #87

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-138] Integrate JSON schemas #84

[ETL-138] Integrate JSON schemas #84

philerooski commented Jul 13, 2022

thomasyu888 Jul 13, 2022

philerooski Jul 14, 2022

tthyer left a comment

tthyer Jul 13, 2022

thomasyu888 Jul 13, 2022 •

edited

Loading

tthyer Jul 13, 2022

philerooski Jul 14, 2022

BrunoGrandePhD left a comment

[ETL-138] Integrate JSON schemas #84

[ETL-138] Integrate JSON schemas #84

Conversation

philerooski commented Jul 13, 2022

thomasyu888 Jul 13, 2022

Choose a reason for hiding this comment

philerooski Jul 14, 2022

Choose a reason for hiding this comment

tthyer left a comment

Choose a reason for hiding this comment

tthyer Jul 13, 2022

Choose a reason for hiding this comment

thomasyu888 Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

tthyer Jul 13, 2022

Choose a reason for hiding this comment

philerooski Jul 14, 2022

Choose a reason for hiding this comment

BrunoGrandePhD left a comment

Choose a reason for hiding this comment

thomasyu888 Jul 13, 2022 •

edited

Loading