-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clade Specification by AA and Nuc #206
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent and very helpful for the migration of flu to modular augur. I haven't run the code yet, but I've included some (sort of pedantic) comments below to improve code readability.
augur/clades.py
Outdated
''' | ||
Determines whether a node contains all mutations that define a clade | ||
''' | ||
isClade = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, isClade
should be snakecased like is_clade
.
augur/clades.py
Outdated
isClade = False | ||
for gene, muts in clade_mutations.items(): | ||
if (gene in node_muts and node_muts[gene] != []): | ||
prs_muts = [(int(tu[1:-1]), tu[-1]) for tu in node_muts[gene]] #get mutations in right format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to tell what is happening here based on the variable names and operations alone. Can you rename these variables (especially prs_muts
) to be more descriptive and include an inline comment above this line with an example of what tu
looks like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comparison with empty list is not ideal -- I am always worried it might be comparing pointers or similar. The != []
isn't actually needed, empty lists evaluate to false.
augur/translate.py
Outdated
@@ -170,6 +170,17 @@ def construct_mut(start, pos, end): | |||
def assign_aa_vcf(tree, translations): | |||
aa_muts = {} | |||
|
|||
#get mutations on the root | |||
root = tree.get_nonterminals()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use tree.root
to get the root node.
augur/clades.py
Outdated
|
||
def read_in_clade_definitions(clade_file): | ||
''' | ||
Reads in tab-seperated file that defines clades by amino-acid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring made me think clades couldn't be defined by nucleotides, but seeing the real TB clade definitions clarified this. Could you update the docstring comment and example to mention nucleotides, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, pending minor improvements.
augur/clades.py
Outdated
isClade = False | ||
for gene, muts in clade_mutations.items(): | ||
if (gene in node_muts and node_muts[gene] != []): | ||
prs_muts = [(int(tu[1:-1]), tu[-1]) for tu in node_muts[gene]] #get mutations in right format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comparison with empty list is not ideal -- I am always worried it might be comparing pointers or similar. The != []
isn't actually needed, empty lists evaluate to false.
augur/clades.py
Outdated
node_data = read_node_data(args.mutations, args.tree) | ||
if node_data is None: | ||
print("ERROR: could not read node data (incl sequences)") | ||
return -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standard error code would be +1
http://tldp.org/LDP/abs/html/exitcodes.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from translate
, so I'll change there as well
Thank you both - happy to make these changes. |
If all is ok, can we merge this? It has changes to the TB run, so would be easier to merge before I pull TB out to a separate repo. |
This PR should allow specification of clades by nucleotide and/or AA, as was previously done (hopefully) in old augur.
I based this from what I found in the flu builds, but it would be good for someone to check this does all it needs to - particularly for flu! I preserved the method of specifying the mutations by 'gene position alt' to make moving these definitions easier.
One change is that
translate
now writes out mutations on the root node. For Fasta-input this is empty, but for VCF input it will contain mutations from the reference. This was mostly to makeutils
read_node_data
work - otherwise the tree andaa_muts
JSON don't have the same number of nodes. But it has a nice side-effect of being able to see AA (and in particular DRM) mutations that apply to the whole dataset with VCF input - matching what's often shown in TB papers, for example.All the rest slots nicely into existing
auspice
functionality (clade_annotation
sets node labels,clade_membership
can be a colorBy), and if 633 is merged inauspice
, the clades specified here can allow zooming on the tree by clade via URL.I've changed the TB build so that it now uses this rather than the 'trait' 'cluster', and I've added it to the TB test as well - though it doesn't work perfectly as a node cannot be a clade, so it looks a bit off for a couple of nodes.
Clades in TB (rather than cluster, as previously):
Clade labels persist even when you change the colorBy.