Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-602] Compile documentation for recover #105

Merged
merged 4 commits into from
Feb 13, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 35 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ By contributing, you are agreeing that we may redistribute your work under this
* [Developing and testing locally](#developing-and-testing-locally)
* [Testing remotely](#testing-remotely)
* [Code review](#code-review)
* [Post review](#post-review)
* [Release process](#release-process)
- [Code style](#code-style)

# Getting started as a developer
Expand Down Expand Up @@ -54,7 +56,7 @@ Assuming that the namespaced [S3 Event Config Lambda](config/develop/namespaced/

### Submitting test events to the S3 to Glue Lambda

To submit SQS events directly to the [S3 to Glue Lambda](src/lambda_function/s3_to_glue), follow the instructions in its README.
To submit SQS events directly to the [S3 to Glue Lambda](src/lambda_function/s3_to_glue), follow the instructions in its [README](src/lambda_function/s3_to_glue/README.md).

### Submitting test events to the S3 to JSON Glue workflow

Expand All @@ -72,7 +74,7 @@ We do not yet have a complete, automated data quality framework for our integrat

To evaluate what effect this branch's changes had on the test data, we have Glue jobs which run as part of the JSON to Parquet workflow which compare the Parquet datasets within this branch's `{namespace}` to those of the `main` namespace. Data reports are written to `s3://recover-dev-processed-data/{namespace}/comparison_result`.

#### Record count comparion
#### Record count comparison

We use a structured logging framework in our Glue jobs so that logs can be consumed in a programmatic way. At the time of writing, this is limited to [a script](src/scripts/consume_logs/consume_logs.py) that will compare the count of lines read versus the count of lines written for each NDJSON file in each S3 to JSON workflow run within a given time frame.

Expand All @@ -82,9 +84,38 @@ Once integration testing is complete, submit a pull request against the `main`

Before and/or after code review, clean up your commit history. If the `main` branch has changed since you last pushed your branch, [rebase](https://git-scm.com/docs/git-rebase) on main. If you have multiple commits, make sure that each commit is focused on delivering one complete feature. If you need to consolidate commits, consider doing an interactive rebase or a [`git merge --squash`](https://git-scm.com/docs/git-merge#Documentation/git-merge.txt---squash) if you are more comfortable with that method.

### Post review
## Post review

Once the pull request has been approved, we expect _you_ to merge. Although this pulls your commits into the `main` branch, it does not yet deploy your changes to the `main` production pipeline. RECOVER data has FISMA restrictions, but only our production account is FISMA compliant. Since there is no guarantee that the test data provided to us (which doesn't have FISMA restrictions) perfectly models the production dataset, we maintain a `staging` namespace in the production account which enables us to test changes on production data before pulling those changes into the `main` namespace. Merging into `main` will deploy the changes to the `staging` namespace. To complete deployment to the `main` namespace of production, we push a new tag with a specific format to this repository, which will trigger [this GitHub action](.github/workflows/upload-and-deploy-to-prod-main.yaml). There is a diagram of this process [here](https://sagebionetworks.jira.com/wiki/spaces/IBC/pages/2863202319/ETL-390+RECOVER+Integration+Testing#Implementation).
**Overview:**
Once the pull request has been approved and merged, this pulls your commits into the `main` branch. It does not yet deploy your changes to the `main` production pipeline. RECOVER data has FISMA restrictions, but only our production account is FISMA compliant. Since there is no guarantee that the test data provided to us (which doesn't have FISMA restrictions) perfectly models the production dataset, we maintain a `staging` namespace in the production account which enables us to test changes on production data before pulling those changes into the `main` namespace. There is a diagram of this process [here](https://sagebionetworks.jira.com/wiki/spaces/IBC/pages/2863202319/ETL-390+RECOVER+Integration+Testing#Implementation).

**Instructions:**

1. After pull request approval, [squash and merge your pull request](https://sagebionetworks.jira.com/wiki/spaces/IBC/pages/2741797027/GitHub+Git#Merging-a-Pull-Request) into `main`. Merging into `main` will deploy the changes to the `staging` namespace.
1. Wait for changes to deploy to the `staging` namespace in the production account
1. After successful deployment, wait for the staging S3 to JSON workflow to finish
1. Trigger the staging JSON to Parquet workflow manually to produce the Parquet datasets in the `staging` namespace
1. Review staging datasets for expected differences and similarities
1. To complete deployment to the `main` namespace of production, follow [release process](#release-process)

## Release process

To complete deployment to the `main` namespace of production, we push a new tag with a specific format to this repository, which will trigger [this GitHub action](.github/workflows/upload-and-deploy-to-prod-main.yaml).

This package uses [semantic versioning](https://semver.org/) for releasing new versions.

**Background Pages:**

- See [About releases - GitHub Docs](https://docs.github.com/en/repositories/releasing-projects-on-github/about-releases) for more information about Github releases and their relationship to github tags
- See [Git - Tagging](https://git-scm.com/book/en/v2/Git-Basics-Tagging) for more information about Github tags

**Instructions:**

1. Once there are important features that need to be added we will need to create a new tag and release. Once released it will be applied to production data.
1. [Draft a new release for the repo here](https://github.com/Sage-Bionetworks/recover/releases), specify the `main` branch as the target, and choose a tag name. You can specify a tag name that doesn't exist to invoke the github auto tag create feature.
1. Click `Generate release notes`, review the content and be sure to include any known bugs for the release.
1. Wait for the CI/CD (specifically the [upload-and-deploy-to-prod-main](https://github.com/Sage-Bionetworks/recover/blob/main/.github/workflows/upload-and-deploy-to-prod-main.yaml) GH action which deploys the features to main) to finish successfully.
1. Now your features have been deployed to production, the next production data run on main will include your features!

# Code style

Expand Down
Loading