Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make nextclade tree #28

Merged
merged 8 commits into from
May 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# CHANGELOG
* 9 May 2024: Create a N450 tree that can be used as part of a Nextclade dataset to assign genotypes to measles samples based on criteria outlined by the WHO [PR #28](https://github.com/nextstrain/measles/pull/28)
* 25 April 2024: Add specific sequences and metadata to the measles trees, including WHO reference sequences, vaccine strains, and genotypes reported on NCBI [PR #26](https://github.com/nextstrain/measles/pull/26)
* 10 April 2024: Add a single GH Action workflow to automate the ingest and phylogenetic workflows [PR #22](https://github.com/nextstrain/measles/pull/22)
* 2 April 2024: Add nextstrain-automation build-configs for deploying the final Auspice dataset of the phylogenetic workflow [PR #21](https://github.com/nextstrain/measles/pull/21)
Expand Down
36 changes: 36 additions & 0 deletions nextclade/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@

# Measles Nextclade Dataset Tree

This workflow creates a phylogenetic tree that can be used as part of a Nextclade dataset to assign genotypes to measles samples based on [criteria outlined by the WHO](https://www.who.int/publications/i/item/WER8709).

The WHO has defined 24 measles genotypes based on N gene and H gene sequences from 28 reference strains. For new measles samples, genotypes can be assigned based on genetic similarity to the reference strains at the "N450" region (a 450 bp region of the N gene).

The tree created here includes N450 sequences for the 28 reference strains, along with other representative strains for each genotype.

The workflow includes the following steps:
* Build a tree using samples from the `ingest` output, with the following sampling criteria:
* Exclude samples for which a genotype is NOT present on NCBI (indicated in the metadata column "genotype_ncbi")
* Force-include the following samples:
* WHO genotype reference strains
* Vaccine strains
* All available samples for genotypes that are poorly represented on NCBI (i.e., genotypes that have fewer than 10 samples on NCBI)
* Subsampling criteria:
* group_by: "region genotype_ncbi year"
* subsample_max_sequences: 500
* min_date: 1950
* min_length: 400
* Assign genotypes to each sample and internal nodes of the tree with `augur clades`, using clade-defining mutations in `defaults/clades.tsv`
* Provide the following coloring options on the tree:
* WHO reference strains ("True" or "False")
* Genotype assignment from `augur clades`
* Genotype assignment reported on NCBI

## How to create a new tree:
* Run the workflow: `nextstrain build .`
* Inspect the output tree by comparing genotype assignments from the following sources:
* WHO reference strains
* `augur clades` output
* NCBI Datasets output
* If unwanted samples are present in the tree, add them to `defaults/dropped_strains.tsv` and re-run the workflow
* If any changes are needed to the clade-defining mutations, add changes to `defaults/clades.tsv` and re-run the workflow
* Repeat as needed
24 changes: 24 additions & 0 deletions nextclade/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
configfile: "defaults/config.yaml"

rule all:
input:
auspice_json = "auspice/measles.json"

include: "rules/prepare_sequences.smk"
include: "rules/construct_phylogeny.smk"
include: "rules/annotate_phylogeny.smk"
include: "rules/export.smk"

# Include custom rules defined in the config.
if "custom_rules" in config:
for rule_file in config["custom_rules"]:

include: rule_file

rule clean:
"""Removing directories: {params}"""
params:
"results ",
"auspice"
shell:
"rm -rfv {params}"
62 changes: 62 additions & 0 deletions nextclade/defaults/auspice_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
{
"title": "Real-time tracking of measles virus evolution",
"maintainers": [
{"name": "Kim Andrews", "url": "https://bedford.io/team/kim-andrews/"},
{"name": "the Nextstrain team", "url": "https://nextstrain.org/team"}
],
"build_url": "https://github.com/nextstrain/measles",
"colorings": [
{
"key": "gt",
"title": "Genotype",
"type": "categorical"
},
{
"key": "num_date",
"title": "Date",
"type": "continuous"
},
{
"key": "clade_membership",
"title": "MeV Genotype (Nextstrain)",
"type": "categorical"
},
{
"key": "region",
"title": "Region",
"type": "categorical"
},
{
"key": "country",
"title": "Country",
"type": "categorical"
},
{
"key": "genotype_ncbi",
"title": "MeV Genotype (GenBank metadata)",
"type": "categorical"
},
{
"key": "is_reference",
"title": "WHO Reference",
"type": "categorical"
}
],
"geo_resolutions": [
"country",
"region"
],
"display_defaults": {
"map_triplicate": true,
"color_by": "clade_membership"
},
"filters": [
"clade_membership",
"region",
"country",
"author"
],
"metadata_columns": [
"author"
]
}
64 changes: 64 additions & 0 deletions nextclade/defaults/clades.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
clade gene site alt
A nuc 89 A
A nuc 126 A
A nuc 422 T
A nuc 439 A
B1 nuc 36 G
B1 nuc 79 G
B1 nuc 111 T
B1 nuc 261 A
B1 nuc 279 G
B2 nuc 323 T
B3 nuc 351 C
C1 nuc 75 C
C1 nuc 246 T
C1 nuc 317 T
C1 nuc 354 A
C2 nuc 81 A
C2 nuc 118 T
D1 nuc 23 G
D1 nuc 45 C
D1 nuc 330 T
D10 nuc 123 G
D10 nuc 218 T
D10 nuc 323 G
D10 nuc 444 A
D11 nuc 54 A
D11 nuc 214 T
D2 nuc 207 G
D2 nuc 255 C
D2 nuc 15 A
D2 nuc 250 A
D3 nuc 21 C
D3 nuc 186 C
D3 nuc 275 C
D3 nuc 287 T
D4 nuc 133 A
D4 nuc 367 T
D4 nuc 416 T
D5 nuc 48 G
D5 nuc 105 C
D6 nuc 316 A
D7 nuc 339 C
D7 nuc 343 C
D8 nuc 226 A
D8 nuc 251 T
D9 nuc 216 A
E nuc 224 T
E nuc 329 A
E nuc 341 A
F nuc 222 A
F nuc 297 T
F nuc 332 T
G1 nuc 27 T
G1 nuc 177 G
G1 nuc 419 T
G2 nuc 242 A
G2 nuc 271 G
G3 nuc 3 A
H1 nuc 126 C
H1 nuc 224 A
H2 nuc 84 A
H2 nuc 171 A
H2 nuc 245 A
H2 nuc 387 T
59 changes: 59 additions & 0 deletions nextclade/defaults/colors.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Regions: These fields are identical to those in phylogenetic/defaults/colors.tsv. Any changes to one should also be made to the other.
region Asia #447CCD
jameshadfield marked this conversation as resolved.
Show resolved Hide resolved
region Oceania #5EA9A1
region Africa #8ABB6A
region Europe #BEBB48
region South America #E29E39
region North America #E2562B
#
# MeV Genotypes reported in NCBI GenBank metadata: These fields are identical to those in phylogenetic/defaults/colors.tsv. Any changes to one should also be made to the other.
genotype_ncbi A #5E1D9D
genotype_ncbi B1 #4B26B1
genotype_ncbi B2 #4138C3
genotype_ncbi B3 #3F4FCC
genotype_ncbi C1 #4065CF
genotype_ncbi C2 #447ACD
genotype_ncbi D1 #4A8BC3
genotype_ncbi D2 #529AB6
genotype_ncbi D3 #5BA6A6
genotype_ncbi D4 #66AE95
genotype_ncbi D5 #73B583
genotype_ncbi D6 #81B973
genotype_ncbi D7 #91BC64
genotype_ncbi D8 #A1BE58
genotype_ncbi D9 #B1BD4E
genotype_ncbi D10 #C0BA47
genotype_ncbi D11 #CEB541
genotype_ncbi E #DAAD3D
genotype_ncbi F #E19F3A
genotype_ncbi G1 #E68E36
genotype_ncbi G2 #E67832
genotype_ncbi G3 #E35F2D
genotype_ncbi H1 #DF4328
genotype_ncbi H2 #DB2823
#
# MeV Genotypes assigned by augur clades
clade_membership A #5E1D9D
clade_membership B1 #4B26B1
clade_membership B2 #4138C3
clade_membership B3 #3F4FCC
clade_membership C1 #4065CF
clade_membership C2 #447ACD
clade_membership D1 #4A8BC3
clade_membership D2 #529AB6
clade_membership D3 #5BA6A6
clade_membership D4 #66AE95
clade_membership D5 #73B583
clade_membership D6 #81B973
clade_membership D7 #91BC64
clade_membership D8 #A1BE58
clade_membership D9 #B1BD4E
clade_membership D10 #C0BA47
clade_membership D11 #CEB541
clade_membership E #DAAD3D
clade_membership F #E19F3A
clade_membership G1 #E68E36
clade_membership G2 #E67832
clade_membership G3 #E35F2D
clade_membership H1 #DF4328
clade_membership H2 #DB2823
25 changes: 25 additions & 0 deletions nextclade/defaults/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
strain_id_field: "accession"
files:
exclude: "defaults/dropped_strains.txt"
include: "defaults/include_strains.txt"
reference_N450: "defaults/measles_reference_N450.gb"
reference_N450_fasta: "defaults/measles_reference_N450.fasta"
clades: "defaults/clades.tsv"
colors: "defaults/colors.tsv"
jameshadfield marked this conversation as resolved.
Show resolved Hide resolved
auspice_config: "defaults/auspice_config.json"
align_and_extract_N450:
min_length: 400
min_seed_cover: 0.01
filter:
group_by: "region genotype_ncbi year"
subsample_max_sequences: 500
min_date: 1950
min_length: 400
refine:
coalescent: "opt"
date_inference: "marginal"
clock_filter_iqd: 4
ancestral:
inference: "joint"
export:
metadata_columns: "strain division location"
31 changes: 31 additions & 0 deletions nextclade/defaults/dropped_strains.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
HM562901 # temara.MOR/24.03
HM562900 # Mvs/Toulon.FRA/08.07
#
# Incorrect genotypes reported on NCBI:
KY941950
KX610825
KX603680
KX455930
KX420624
KX024471
KP856709
KP734120
KP734117
KP056769
KM017095
KJ556875
KJ556859
KC139079
JX556685
JX556684
JN005809
GU937234
AF410987
#
# Genotype assignment is ambiguous based on tree structure:
KY678430
AB453044
#
# Clock rate outliers:
AB573812
HM562905
53 changes: 53 additions & 0 deletions nextclade/defaults/include_strains.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Vaccine strain information from Parks et al. Comparison of predicted amino acid
# sequences of measles virus strains in the Edmonston vaccine lineage
# https://doi.org/10.1128/jvi.75.2.910-920.2001
AF266288
AF266287
AF266290
AF266289
AF266291
AF266286
#
# WHO genotype reference strains
# Information from https://www.who.int/publications/i/item/WER8709
AF045212
AF045217
AF079555
AF171232
AF243450
AF280803
AF481485
AJ232203
AY037020
AY043459
AY184217
AY923185
D01005
GU440571
L46750
L46753
L46758
M89921
U01974
U01976
U01977
U01987
U01994
U01998
U64582
X84865
X84872
X84879
#
# Rare genotypes
# Including these to boost representation of these genotypes in the nextclade tree
AF410989 #Rare genotype: E
MG912591 #Rare genotype: G2
AY037009 #Rare genotype: G2
AY037043 #Rare genotype: H2
AY037026 #Rare genotype: H2
AY037028 #Rare genotype: D2
FJ668380 #Rare genotype: D10
MN017369 #Rare genotype: D11
KC968467 #Rare genotype: D11
KC968354 #Rare genotype: D11
8 changes: 8 additions & 0 deletions nextclade/defaults/measles_reference_N450.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
>lcl|NC_001498.1_cds_NP_056918.1_1 [gene=N] [locus_tag=MeVgp1] [db_xref=GeneID:1489804] [protein=nucleocapsid protein] [protein_id=NP_056918.1] [location=1233..1682] [gbkey=CDS]
GTCAGTTCCACATTGGCATCCGAACTCGGTATCACTGCCGAGGATGCAAGGCTTGTTTCAGAGAT
TGCAATGCATACTACTGAGGACAGGATCAGTAGAGCGGTCGGACCCAGACAAGCCCAAGTGTCATTTCTA
CACGGTGATCAAAGTGAGAATGAGCTACCAGGATTGGGGGGCAAGGAAGATAGGAGGGTCAAACAGGGTC
GGGGAGAAGCCAGGGAGAGCTACAGAGAAACCGGGTCCAGCAGAGCAAGTGATGCGAGAGCTGCCCATCC
TCCAACCAGCATGCCCCTAGACATTGACACTGCATCGGAGTCAGGCCAAGATCCGCAGGACAGTCGAAGG
TCAGCTGACGCCCTGCTCAGGCTGCAAGCCATGGCAGGAATCTTGGAAGAACAAGGCTCAGACACGGACA
CCCCTAGGGTATACAATGACAGAGATCTTCTAGAC
Loading