-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-138] Integrate JSON schemas #84
Conversation
Our data resource identifiers in practice are just the glue table names as specified in table_columns.yaml
src/glue/jobs/s3_to_json_s3.py
Outdated
for a in app["assessments"]]) | ||
if is_valid_assessment: | ||
for default_file in app["default"]["files"]: | ||
if default_file["filename"] == file_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do same deprecated check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked deeper into the schema for archive-map.json and it turns out that all of these file arrays use the same FileInfo object schema. So I decided to refactor this function and that turned into a refactor of the entire script.
To answer your question -- no. And when I refactored this I wrote a function for getting a JSON Schema from any FileInfo object and left out any deprecation check. We could be processing any data at any time, whether that data has been deprecated or not, so it shouldn't make a difference to us if something is deprecated. According to Shannon the "deprecated" flag was only meant to be used to mark old mPower 2 files like info.json and taskResult.json, and those don't have JSON Schema included in archive-map.json... long story short, the "deprecated" flag is irrelevant to us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -2,7 +2,7 @@ template_path: ec2-bootstrap-trigger.yaml | |||
stack_name: ec2-bootstrap-trigger | |||
parameters: | |||
SsmParameterName: synapse-bridgedownstream-auth | |||
CrontabURI: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.latest_version }}/ec2/resources/crontab | |||
CrontabURI: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.namespace }}/ec2/resources/crontab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting here the passing of latest_version -- @philerooski & @thomasyu888 do we want to prioritize the discussion about our versioning system (and potential conflict with namespacing system), or just file a ticket and worry about it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thought is let's file a ticket and worry about it later unless it will be a huge shift later on to resolve this. I don't necessarily want to block this from being pulled in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it definitely shouldn't block this PR from being merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'd like to sort out versioning. It's not urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I did my best to understand the changes being introduced here. I think I'll need a few more of these PRs before I can provide any substantive feedback.
Lots of changes in this one, but it's not too complicated once you understand that all this work was done to move away from using file names as our dataset identifier and instead use a value derived from the associated JSON Schema $id (as described in the docs here). Older assessments don't have an associated JSON Schema, so we use a legacy mapping (
src/glue/resources/dataset_mapping.json
) to directly map file names to dataset identifiers.Other changes include:
src/glue/resources/table_columns.yaml
rather thansrc/glue/resources/dataset_mapping.json
.latest_version
sceptre user data S3 location -- which is really only meant to version glue job scripts. The crontab was also updated to reference the new dataset identifiers.src/glue/resources/schema_mapping.json
is the future-ready mapping, which maps JSON Schema $id values to their dataset identifier.src/glue/resources/dataset_mapping.json
is the legacy mapping which contains mappings for older assessment revisions.dataset_mapping.json
(the legacy mapping) now uses assessment ID/revision to map files to dataset identifiers. Previously we used osName/appVersion but a specific app build can still produce different revisions of the same assessment if a newer revision is published on Bridge.Before merging we want to do some actions in prod:
/src/scripts/archive_dataset/archive_dataset.py
).config/prod/studies/
before merging so that we can start with a clean slate.After merging and reprocessing data: