Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: explicitly specify how the root and ambiguous states are handled… #1690

Merged
merged 11 commits into from
Dec 26, 2024

Conversation

rneher
Copy link
Member

@rneher rneher commented Nov 25, 2024

… during sequence reconstruction and mutation counting.

Fixes #1689

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

@rneher rneher requested a review from jameshadfield November 25, 2024 20:08
@jameshadfield
Copy link
Member

jameshadfield commented Nov 25, 2024

Running this on a mpox clade-I build which currently assigns all root mutations to the basal branch leading to clade Ib.

This PR: Branch leading to clade Ia

  • Augur refine divergence: 13 mutations.
  • Inferred mutations from ATGC to ATGC: 14 (e.g. A2963G)
  • Gap to base mutations: 205 (e.g. -19105T)
  • Base to gap mutations: 0

This PR: Branch leading to clade Ib

  • Augur refine divergence: 52 mutations.
  • Inferred mutations from ATGC to ATGC: 59
  • Gap to base mutations: 0
  • Base to gap mutations: 677

Pre-PR behaviour

  • Same divergence from augur refine (I didn't re-run the rule)
  • All mutations are on the branch leading to clade-Ib (not good, obviously)
    • Inferred mutations from ATGC to ATGC: 73 (all on the branch leading to clade-Ib)
    • Gap to base mutations: 0
    • Base to gap mutations: 882

So it's a lot better - and perhaps this PR is correct within its remit - but something's not quite right. The total number of inferred ATGC-ATGC mutations (73) is unchanged, but this doesn't match the refined branch lengths (65). Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations?

Secondly, and this is just for my own interest, how are the gap mutations being allocated among sister branches? I'd have thought clade IIb gets 882 * 52/(52+13) = 706 but that branch gets 677.

I've got some tests here which use small contrived (50nt-ish) genomes where we can reason with the mutations so I'll take a look at them later.

@jameshadfield
Copy link
Member

jameshadfield commented Nov 26, 2024

Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations?

I thought this was simply the subs/site scaled by some scalar, but it's actually doing the full inference and counting mutations, and this inference is modified by this PR (I missed this).

Rerunning mpox including the refine step makes the assigned mutations match the mutation count. The allocation of gap-mutations also follow the branch-lengths.

@jameshadfield
Copy link
Member

I looked into the tests, starting with the first test here which infers ancestral mutations given the reference via --root-sequence. This is easy to reason with because it uses the toy genome I introduced here:
image

The results using this PR are stochastic, I observed the following three reconstructions back-to-back-to-back:

  • Run 1: Test succeeded (including the non-parsimonious reconstruction at pos 18 noted in the test)
  • Run 2: pos 14 inferred to be T at the root with 2 mutations needed (the parsimonious reconstruction with C at the root needs only 1). Similarly we now infer pos 7 to be G at the root (this is the same mutation pattern across the tree as pos 18).
  • Run 3: Similar to run 2 however inferring pos 7 to be G at the root but using the parsimonious reconstruction of pos 18 (i.e. it matches the root).

For the same test but looking at the case where we don't supply the root-sequence we see similar stochasticity, with differences in root-sequence reconstruction observed at pos 7, 18, 33, 43.

@rneher
Copy link
Member Author

rneher commented Nov 27, 2024

thanks for digging into this. This makes all sense to me and is the expected result as far as i can tell. providing a --root-sequence does not actually affect the reconstruction, it just decorates the branch to root with all mutations necessary to construct the inferred root sequence from the provided root sequence.

@rneher
Copy link
Member Author

rneher commented Nov 27, 2024

note that the sample C is a lot closer to the root (0.02) than Node AB (0.06). So in most cases, the root will agree with C.

and yes, we infer the actual mutations for the branch length in mutation units. this made sense from the perspective of very similar genomes, where you want branch length to correspond to the mutations in a direct discrete way.

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look into the main issue, but noticed a small Python syntax change to be addressed.

augur/refine.py Outdated Show resolved Hide resolved
@huddlej
Copy link
Contributor

huddlej commented Dec 17, 2024

I tried to make the augur ancestral tests deterministic by adding a --seed argument to the ancestral command that controls TreeTime's random seed. I updated the failing tests to match the new expected "random" behavior and was able to consistently get those tests passing locally. See the commit messages for the details. I expect at least one other test to remain broken and this does feel a bit like whack-a-mole when such a fundamental part of augur (TreeTime) becomes stochastic. Another approach would be to provide an interface through refine and ancestral to select the original behavior of those tools prior to this PR's change, just to enable deterministic behavior.

Do folks feel like the --seed approach is a step in the right direction or should we try something else?

@victorlin
Copy link
Member

I think seeding is reasonable. It's also used in many augur filter functional tests with --subsample-seed.

@jameshadfield
Copy link
Member

Do folks feel like the --seed approach is a step in the right direction or should we try something else?

Yes, this seems like a good direction, both for testing and facilitating reproducible builds. I'm also happy to skip certain tests (or parts of tests) where expedient.

Copy link

codecov bot commented Dec 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.82%. Comparing base (77ae31e) to head (d618f53).
Report is 12 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1690      +/-   ##
==========================================
+ Coverage   72.79%   72.82%   +0.02%     
==========================================
  Files          79       79              
  Lines        8271     8272       +1     
  Branches     1691     1691              
==========================================
+ Hits         6021     6024       +3     
+ Misses       1961     1960       -1     
+ Partials      289      288       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@huddlej
Copy link
Contributor

huddlej commented Dec 18, 2024

Thanks, @victorlin and @jameshadfield! I managed to get the remaining broken tests to pass (stably?). If this looks good to you both and @rneher, then I can update the CHANGELOG to reflect the additional ancestral feature and we can merge.

rneher and others added 11 commits December 23, 2024 13:06
… during sequence reconstruction and mutation counting
Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>
Adds a new `--seed` argument to the ancestral command, following the
existing pattern in the refine command, and passes the user-provided
value to TreeTime's `TreeAnc` class to ensure deterministic outcomes for
random samples. Since this PR changes ancestral's root sequence
inference to be stochastic, this new feature allows us to get the same
"random" result with each run of our functional tests.

Correspondingly, this commit updates the non-VCF functional tests of
ancestral to use a fixed seed. This seed should fix the random behavior
to a specific outcome, but we need to update our expected JSON outputs
to match that outcome here. In this case, the change is an introduction
of a "T" at position 14 in the node_root of the "simple genome" data.
As in the previous commit, this commit updates the expected VCF output
to match the fixed "random" outcome for the given seed where position 14
is a T in the root node.
Updates the functional test of multiallele VCF behavior when one of the
alleles is a "N" character. The multiallele VCF in the original
test (setting sample B to genotype of 30G) caused the inference of
alleles at sites 7 and 14 to change substantially in this PR's new
implementation. This commit stabilizes the test by assigning both
samples A and B to the 30G genotype such that all nodes are inferred to
have a 30G and the root node gets a A30G mutation. It should maintain
the originally intended behavior of the test [1].

As part of this change, I've replaced the dynamically modified VCF used
by this test with a static version of the VCF in version control. This
change simplified test debugging by allowing me to inspect the input
file outside of the cram temporary environment and it also fixes an
issue with the original dynamic VCF where an additional line for site 30
was added but the original line was not being removed.

[1] #1380
The change to how augur refine can stochastically assign the root node
mutations led to a change in inferred branch lengths for our functional
tests. Since the test uses a random seed argument for augur refine, the
results of subsequent runs should be deterministically successful after
this change.
Updates data for augur translate functional tests to match the updated
test data for the ancestral command.
Updates SNPs in manually-curated VCF used to test augur translate so
they match the new expected genotype for the "simple genome" data at
position 14. This fixes the vcf.t translate test. Also updates the
expected amino acid translations JSON for the vcf with root mutation
test. As part of this update, I've created a complete copy of the
"truth" aa_muts.json with the root mutation and used this instead of the
version that was previously created by modifying the original aa_muts
file with sed. Although this change duplicates test data, I was able to
debug the test more easily with a complete version of the JSON living
outside of the cram tests.
Adds random seed argument to all functional tests of the ancestral
command. This should hopefully resolve remaining stochastic changes in
cram runs during CI.
@huddlej huddlej force-pushed the fix/root-state-in-sequence-reconstruction branch from f556cc9 to d618f53 Compare December 23, 2024 21:09
@huddlej
Copy link
Contributor

huddlej commented Dec 23, 2024

I tried to resolve a conflict in the change log with GitHub's web UI which ended up making a merge commit from master that I didn't want. To fix this and the change log, I rebased this branch onto master and force-pushed. Sorry for the mess! 🤦🏻

@rneher rneher merged commit a57d50c into master Dec 26, 2024
36 checks passed
@rneher rneher deleted the fix/root-state-in-sequence-reconstruction branch December 26, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistency of lengths and mutations assigned to branches that are children of the root.
5 participants