Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore cache if Nextclade or dataset version is different #466

Merged
merged 5 commits into from
Jul 29, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Jul 24, 2024

Description of proposed changes

Update workflow to ignore the Nextclade cache if the current Nextclade version or the Nextclade dataset version is different than the version in the cache.

Currently checks Nextclade and dataset versions of the first row of the nextclade.tsv file and formats them as the proposed JSON in #458. Once the version JSON file is in place, it should be easy to swap out the check for the new file.

Related issue(s)

Resolves #457

Checklist

  • Checks pass

@joverlee521
Copy link
Contributor Author

Tested that use-nextclade-cache works as expected

Setting up the trial S3 URL

Uploaded a nextclade.tsv to trial S3 URL that includes the current Nextclade version + Nextclade dataset version.

$ aws s3 cp s3://nextstrain-data/files/ncov/open/nextclade.tsv.zst - | zstd -T0 -dcq | head -n 2 > data/nextclade.tsv
$ ./vendored/upload-to-s3 data/nextclade.tsv s3://nextstrain-data/files/ncov/open/trial/ignore-cache/nextclade.tsv.zst

Testing the workflow would use cache as expected

Ran workflow up to the use_nextclade-cache rule and use_nextclade_cache correctly outputs true

$ nextstrain build --envdir ~/Repos/env.d/aws/ --image nextstrain/ncov-ingest . data/genbank/use_nextclade_cache.txt --configfile config/genbank.yaml --forceall --config keep_all_files=True s3_dst="s3://nextstrain-data/files/ncov/open/trial/ignore-cache"
$ cat data/genbank/use_nextclade_cache.txt
true

Testing the renew flag still works as expected

  1. Uploaded renew flag to trial S3 URL
$ aws s3 cp - s3://nextstrain-data/files/ncov/open/trial/ignore-cache/nextclade.tsv.zst.renew < /dev/null
  1. Workflow found the renew flag and use_nextclade_cache correctly outputs false
$ nextstrain build --envdir ~/Repos/env.d/aws/ --image nextstrain/ncov-ingest . data/genbank/use_nextclade_cache.txt --configfile config/genbank.yaml --forceall --config keep_all_files=True s3_dst="s3://nextstrain-data/files/ncov/open/trial/ignore-cache"
...
[INFO] Found renew flag
...
$ cat data/genbank/use_nextclade_cache.txt
false

Testing the Nextclade version check works as expected

  1. Removed the renew flag
$ aws s3 rm s3://nextstrain-data/files/ncov/open/trial/ignore-cache/nextclade.tsv.zst.renew
  1. Edited the nextclade_version locally to 3.8.1 and re-uploaded the nextclade.tsv
  2. Workflow found different Nextclade versions and use_nextclade_cache correctly outputs false
$ nextstrain build --envdir ~/Repos/env.d/aws/ --image nextstrain/ncov-ingest . data/genbank/use_nextclade_cache.txt --configfile config/genbank.yaml --forceall --config keep_all_files=True s3_dst="s3://nextstrain-data/files/ncov/open/trial/ignore-cache"
...
[INFO] Current Nextclade version (nextclade 3.8.2) is different from cache version (nextclade 3.8.1)
...
$ cat data/genbank/use_nextclade_cache.txt
false

Testing the Nextclade dataset version check works as expected

  1. Change the nextclade_version back to 3.8.2
  2. Edit the dataset_version to 2024-07-18--12-57-03Z and re-upload the nextclade.tsv
  3. Workflow found different dataset versions and use_nextclade_cache correctly outputs false
$ nextstrain build --envdir ~/Repos/env.d/aws/ --image nextstrain/ncov-ingest . data/genbank/use_nextclade_cache.txt --configfile config/genbank.yaml --forceall --config keep_all_files=True s3_dst="s3://nextstrain-data/files/ncov/open/trial/ignore-cache"
...
[INFO] Current Nextclade dataset version (2024-07-17--12-57-03Z) is different from cache version (2024-07-18--12-57-03Z)
...
$ cat data/genbank/use_nextclade_cache.txt
false

bin/fetch-cache-version Show resolved Hide resolved
joverlee521 and others added 5 commits July 26, 2024 13:49
Doing this in preparation for adding version checks to the decision
tree of whether we should use the Nextclade cache.

Replaces download of the empty .renew file with just a check that the
S3 object exists to limit shuffling of files.
Currently checks Nextclade and dataset versions of the first row of the
nextclade.tsv file and formats them as the propose JSON. Once the
version JSON file is in place, it should be easy to swap out the check
for the new file.
Document why we are not using `set -euo pipefail`

Co-authored-by: John SJ Anderson <janders4@fredhutch.org>
Avoids clash of downloaded Nextclade executable with the Nextclade
command available in the environment.

Includes the side-effect of the downloaded executable being removed
as part of `bin/clean` when running the workflow without the
`keep_all_files=True` config param. This ensures that the workflow will
start from a clean slate.
@joverlee521
Copy link
Contributor Author

I plan to merge this on Monday so I can monitor the workflows during the week.

@joverlee521 joverlee521 mentioned this pull request Jul 27, 2024
1 task
@joverlee521 joverlee521 merged commit f9bca07 into master Jul 29, 2024
1 check passed
@joverlee521 joverlee521 deleted the ignore-cache branch July 29, 2024 17:40
@joverlee521
Copy link
Contributor Author

Confirmed that yesterday's run's completed successfully after updates (GenBank and GISAID).

Confirmed that the nextclade TSVs all contain a single version of Nextclade and the dataset

$ aws s3 cp s3://nextstrain-ncov-private/nextclade.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version	dataset_version
nextclade 3.8.2	2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-ncov-private/nextclade_21L.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version	dataset_version
nextclade 3.8.2	2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-data/files/ncov/open/nextclade_21L.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version	dataset_version
nextclade 3.8.2	2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-data/files/ncov/open/nextclade.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version	dataset_version
nextclade 3.8.2	2024-07-17--12-57-03Z

@joverlee521
Copy link
Contributor Author

There hadn't been a release of Nextclade or the SARS-CoV-2 Nextclade dataset since this was merged so I wasn't able to fully confirm this was working in production...

There was a release of the SARS-CoV-2 dataset on 2024-09-25 and on 2024-09-26 the automated workflows for GISAID and GenBank both ignored the cache and did a full Nextclade run as expected 🎉

joverlee521 added a commit that referenced this pull request Oct 4, 2024
Updating instructions as a follow up to
#466.
joverlee521 added a commit that referenced this pull request Oct 4, 2024
Updating instructions as a follow up to
#466.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignore Nextclade cache on new Nextclade version or dataset version
2 participants