Ignore cache if Nextclade or dataset version is different #466

joverlee521 · 2024-07-24T23:46:23Z

Description of proposed changes

Update workflow to ignore the Nextclade cache if the current Nextclade version or the Nextclade dataset version is different than the version in the cache.

Currently checks Nextclade and dataset versions of the first row of the nextclade.tsv file and formats them as the proposed JSON in #458. Once the version JSON file is in place, it should be easy to swap out the check for the new file.

Related issue(s)

Resolves #457

Checklist

Checks pass

joverlee521 · 2024-07-25T00:17:35Z

Tested that use-nextclade-cache works as expected

Setting up the trial S3 URL

Uploaded a nextclade.tsv to trial S3 URL that includes the current Nextclade version + Nextclade dataset version.

$ aws s3 cp s3://nextstrain-data/files/ncov/open/nextclade.tsv.zst - | zstd -T0 -dcq | head -n 2 > data/nextclade.tsv
$ ./vendored/upload-to-s3 data/nextclade.tsv s3://nextstrain-data/files/ncov/open/trial/ignore-cache/nextclade.tsv.zst

Testing the workflow would use cache as expected

Ran workflow up to the use_nextclade-cache rule and use_nextclade_cache correctly outputs true

$ nextstrain build --envdir ~/Repos/env.d/aws/ --image nextstrain/ncov-ingest . data/genbank/use_nextclade_cache.txt --configfile config/genbank.yaml --forceall --config keep_all_files=True s3_dst="s3://nextstrain-data/files/ncov/open/trial/ignore-cache"
$ cat data/genbank/use_nextclade_cache.txt
true

Testing the renew flag still works as expected

Uploaded renew flag to trial S3 URL

$ aws s3 cp - s3://nextstrain-data/files/ncov/open/trial/ignore-cache/nextclade.tsv.zst.renew < /dev/null

Workflow found the renew flag and use_nextclade_cache correctly outputs false

$ nextstrain build --envdir ~/Repos/env.d/aws/ --image nextstrain/ncov-ingest . data/genbank/use_nextclade_cache.txt --configfile config/genbank.yaml --forceall --config keep_all_files=True s3_dst="s3://nextstrain-data/files/ncov/open/trial/ignore-cache"
...
[INFO] Found renew flag
...
$ cat data/genbank/use_nextclade_cache.txt
false

Testing the Nextclade version check works as expected

Removed the renew flag

$ aws s3 rm s3://nextstrain-data/files/ncov/open/trial/ignore-cache/nextclade.tsv.zst.renew

Edited the nextclade_version locally to 3.8.1 and re-uploaded the nextclade.tsv
Workflow found different Nextclade versions and use_nextclade_cache correctly outputs false

$ nextstrain build --envdir ~/Repos/env.d/aws/ --image nextstrain/ncov-ingest . data/genbank/use_nextclade_cache.txt --configfile config/genbank.yaml --forceall --config keep_all_files=True s3_dst="s3://nextstrain-data/files/ncov/open/trial/ignore-cache"
...
[INFO] Current Nextclade version (nextclade 3.8.2) is different from cache version (nextclade 3.8.1)
...
$ cat data/genbank/use_nextclade_cache.txt
false

Testing the Nextclade dataset version check works as expected

Change the nextclade_version back to 3.8.2
Edit the dataset_version to 2024-07-18--12-57-03Z and re-upload the nextclade.tsv
Workflow found different dataset versions and use_nextclade_cache correctly outputs false

$ nextstrain build --envdir ~/Repos/env.d/aws/ --image nextstrain/ncov-ingest . data/genbank/use_nextclade_cache.txt --configfile config/genbank.yaml --forceall --config keep_all_files=True s3_dst="s3://nextstrain-data/files/ncov/open/trial/ignore-cache"
...
[INFO] Current Nextclade dataset version (2024-07-17--12-57-03Z) is different from cache version (2024-07-18--12-57-03Z)
...
$ cat data/genbank/use_nextclade_cache.txt
false

bin/fetch-cache-version

workflow/snakemake_rules/nextclade.smk

Doing this in preparation for adding version checks to the decision tree of whether we should use the Nextclade cache. Replaces download of the empty .renew file with just a check that the S3 object exists to limit shuffling of files.

Currently checks Nextclade and dataset versions of the first row of the nextclade.tsv file and formats them as the propose JSON. Once the version JSON file is in place, it should be easy to swap out the check for the new file.

Document why we are not using `set -euo pipefail` Co-authored-by: John SJ Anderson <janders4@fredhutch.org>

Avoids clash of downloaded Nextclade executable with the Nextclade command available in the environment. Includes the side-effect of the downloaded executable being removed as part of `bin/clean` when running the workflow without the `keep_all_files=True` config param. This ensures that the workflow will start from a clean slate.

joverlee521 · 2024-07-26T21:11:14Z

I plan to merge this on Monday so I can monitor the workflows during the week.

joverlee521 · 2024-07-30T21:15:43Z

Confirmed that yesterday's run's completed successfully after updates (GenBank and GISAID).

Confirmed that the nextclade TSVs all contain a single version of Nextclade and the dataset

$ aws s3 cp s3://nextstrain-ncov-private/nextclade.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version	dataset_version
nextclade 3.8.2	2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-ncov-private/nextclade_21L.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version	dataset_version
nextclade 3.8.2	2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-data/files/ncov/open/nextclade_21L.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version	dataset_version
nextclade 3.8.2	2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-data/files/ncov/open/nextclade.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version	dataset_version
nextclade 3.8.2	2024-07-17--12-57-03Z

joverlee521 · 2024-10-04T16:43:54Z

There hadn't been a release of Nextclade or the SARS-CoV-2 Nextclade dataset since this was merged so I wasn't able to fully confirm this was working in production...

There was a release of the SARS-CoV-2 dataset on 2024-09-25 and on 2024-09-26 the automated workflows for GISAID and GenBank both ignored the cache and did a full Nextclade run as expected 🎉

Updating instructions as a follow up to #466.

joverlee521 force-pushed the ignore-cache branch from 3367f56 to 5999035 Compare July 24, 2024 23:50

Base automatically changed from update-vendored to master July 25, 2024 17:27

joverlee521 mentioned this pull request Jul 25, 2024

feat(nextclade): workaround for https://github.com/nextstrain/nextclade/issues/1422 no longer needed #441

Merged

3 tasks

genehack approved these changes Jul 26, 2024

View reviewed changes

bin/fetch-cache-version Show resolved Hide resolved

joverlee521 commented Jul 26, 2024

View reviewed changes

workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved

joverlee521 and others added 5 commits July 26, 2024 13:49

Ignore cache if Nextclade or dataset version is different

325d4bd

Currently checks Nextclade and dataset versions of the first row of the nextclade.tsv file and formats them as the propose JSON. Once the version JSON file is in place, it should be easy to swap out the check for the new file.

bin/fetch-cache-version: Add comment

988fbec

Document why we are not using `set -euo pipefail` Co-authored-by: John SJ Anderson <janders4@fredhutch.org>

Simplify nextclade_dataset input

6c6a4ff

joverlee521 force-pushed the ignore-cache branch from 9f34d75 to 9a2ca57 Compare July 26, 2024 21:09

joverlee521 mentioned this pull request Jul 27, 2024

Surface Nextclade versions #467

Merged

1 task

joverlee521 merged commit f9bca07 into master Jul 29, 2024
1 check passed

joverlee521 deleted the ignore-cache branch July 29, 2024 17:40

joverlee521 added a commit that referenced this pull request Oct 4, 2024

Update README instructions for Nextclad re-runs

9ac6ce5

Updating instructions as a follow up to #466.

joverlee521 added a commit that referenced this pull request Oct 4, 2024

Update README instructions for Nextclad re-runs

efb17d5

Updating instructions as a follow up to #466.

joverlee521 mentioned this pull request Oct 4, 2024

Update README instructions for Nextclade re-runs #479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore cache if Nextclade or dataset version is different #466

Ignore cache if Nextclade or dataset version is different #466

joverlee521 commented Jul 24, 2024 •

edited

Loading

joverlee521 commented Jul 25, 2024

Setting up the trial S3 URL

Testing the workflow would use cache as expected

Testing the renew flag still works as expected

Testing the Nextclade version check works as expected

Testing the Nextclade dataset version check works as expected

joverlee521 commented Jul 26, 2024

joverlee521 commented Jul 30, 2024

joverlee521 commented Oct 4, 2024

Ignore cache if Nextclade or dataset version is different #466

Ignore cache if Nextclade or dataset version is different #466

Conversation

joverlee521 commented Jul 24, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 commented Jul 25, 2024

Setting up the trial S3 URL

Testing the workflow would use cache as expected

Testing the renew flag still works as expected

Testing the Nextclade version check works as expected

Testing the Nextclade dataset version check works as expected

joverlee521 commented Jul 26, 2024

joverlee521 commented Jul 30, 2024

joverlee521 commented Oct 4, 2024

joverlee521 commented Jul 24, 2024 •

edited

Loading