Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AY.96 maybe belongs inside AY.46, or even AY.46.6 #435

Closed
corneliusroemer opened this issue Feb 13, 2022 · 8 comments
Closed

AY.96 maybe belongs inside AY.46, or even AY.46.6 #435

corneliusroemer opened this issue Feb 13, 2022 · 8 comments
Labels
correction Highlight an error in the description or definition

Comments

@corneliusroemer
Copy link
Contributor

When building the Nextclade reference tree, containing among others 2 randomly chosen sequences from the designation list for AY.96, I noticed that the two sequences didn't cluster together.

One sequence seems to sit inside AY.46.

Could it be that AY.96 is either not monophyletic or that it actually belongs inside AY.46? Or do we think AY.46 defining mutation nuc: C10977T = ORF1ab: A3571V homoplasically appeared within AY.96?

How would one investigate? Look where all the AY.96 defining sequences get placed by Usher (and maybe also Nextclade?), in addition, it'd be handy to look at the mutations that differentiate AY.96 from base-Delta-21J. Something is up there, otherwise why would pangoLEARN misclassify so much as AY.46(.6).

Here's Usher with all sequences classified as AY.96 by pangoLEARN: https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/singleSubtreeAuspice_genome_902b_937d50.json?c=pango_lineage_usher&label=nuc%20mutations:A28254C

image

@corneliusroemer
Copy link
Contributor Author

I just uploaded all the AY.96 designation strain names to Usher, and (almost) all of the non-Botswana sequences are classified as AY.46.6 by Usher. Whereas all the Botswana ones are AY.96.

image

https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/singleSubtreeAuspice_genome_2127b_93eea0.json?c=country&label=nuc%20mutations:A28254C,G28461A

In fact, AY.96 (proper) seems to be defined by the following 3 muts:

nuc: A28254C = ORF8: I121L
nuc: G28461A = N:G63D
nuc: C10977T = ORF1ab: A3571V

The confusion with AY.46 happens because that lineage is defined by:

nuc: C10977T = ORF1ab: A3571V

And the following AY.96 mutation seems to have arisen (homoplasically?) in AY.46.6:

nuc: G28461A = N:G63D

So there are a number of AY.46.6 (maybe 2k) that share 2 out of 3 mutations with AY.96. That'd be ok, if this hadn't been mixed up in designations.

It's still not clear to me whether these are independent events or whether AY.96 is somehow related to AY.46.6, maybe through recombination?

But until this is clear we should probably clean up AY.96 and remove the false designations. This should help unconfuse pangoLEARN, too.

@AngieHinrichs
Copy link
Member

G28461A is a reversion, ugh. A28461G is common to Delta. Those 'homoplasic reversions' in Delta (and even more so in Omicron) make building a good tree a lot harder than it would be if the genome sequences could just be perfect.

I wonder if perhaps we should mask 28461 in Delta in the UCSC/UShER tree. Looks like the reversion also affects AY.122.2. For the record, here's what we mask currently:

And also for reference: lineages are annotated on nodes of the UCSC/UShER tree using mutations in this file.

  • AY.96 is 21J + A28254C + optional reversion at G28461A (file hasA28461N instead of A28461G), not C10977T although that doesn't mean that descendants of the AY.96 node could add C10977T. (C21846T used to be required before it was masked; it's possible that could have affected which sequences were selected as AY.96 representatives).

  • AY.46 is 21J + C10977T (looks like C21846T was never a requirement, so probably representatives without C21846T would have been selected)

  • AY.46.6 is AY.46 + C6310T

The daily build process includes a matOptimize run so when it's not super-clear in which order mutations arose, it's possible for branches to hop around a bit, and that has caused some problems for lineage designation like this. 🙁 A lineage designated using one version of the UCSC/UShER tree might be split in a later version.

So yeah, cleaning up the designations sounds reasonable to me. @chrisruis any other recollections/observations?

@chrisruis
Copy link
Collaborator

Looking at the recent UShER tree, the Botswana AY.96 sequences cluster separately from AY.46 - their clade is defined by A28254C (Orf8:I121L). C10977T (Orf1ab:A3571V) then occurs one branch into the clade, after 1 sequence has diverged. Orf8:I121L doesn't look to be that reliable - it changes backwards and forwards quite a bit within Delta, including reverting within AY.96, and actually changes backwards and forwards in other lineages too. So this potentially isn't an ideal marker for a lineage

I expect if we masked Orf8:I121L, AY.96 would cluster with AY.46 due to the shared Orf1ab:A3571V. So whether we think it's a separate lineage or part of AY.46 potentially comes down to whether or not we trust Orf8:I121L and that doesn't look like a really reliable defining mutation

I think we definitely want to update the designations for the non-Botswana AY.96 sequences. And then @AngieHinrichs should we try masking Orf8:I121L and confirm whether the Botswana sequences also cluster within AY.46?

@corneliusroemer corneliusroemer added the correction Highlight an error in the description or definition label Feb 14, 2022
@AngieHinrichs
Copy link
Member

And then @AngieHinrichs should we try masking Orf8:I121L and confirm whether the Botswana sequences also cluster within AY.46?

Yes, I will give this a try today.

@AngieHinrichs
Copy link
Member

Masking 28254 (ORF8:121) or 28461 (N:63) alone didn't merge the AY.96 Botswana samples into AY.46, but masking both of those sites did. And it seemed to be not harmful, maybe helpful for other little branches in Delta with mutations (or probably false reversions) at those sites. So as of today's build, which hopefully will be ready by the end of tomorrow, I will mask both of those sites in the Delta branch.

@corneliusroemer
Copy link
Contributor Author

What's our conclusion now? Remove AY.96 entirely?

@AngieHinrichs
Copy link
Member

Yep, with the new masking, AY.96 is gone from the 2022-02-17 tree (GISAID+public tree is on the main site; public-only tree will be updated in a few hours). The AY.96 designated sequences now have this breakdown of UShER lineage assignment:

      3 B.1.617.2   Botswana/R19B69_BHP_000534685/2021, Botswana/R21B2_BHP_AAB76120/2021, Botswana/R23B87_BHP_AAB117732/2021
     88 AY.46       all other Botswana, Finland/THL-202122692/2021, Germany/BE-RKI-I-222540/2021
      1 AY.46.2     Finland/12626/2021
     30 AY.46.6     Finland/THL-202127179/2021, all other Germany, all Italy, Slovenia, Sweden, Switzerland

chrisruis added a commit that referenced this issue Feb 21, 2022
@chrisruis
Copy link
Collaborator

AY.96 has been withdrawn in v1.2.129

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
correction Highlight an error in the description or definition
Projects
None yet
Development

No branches or pull requests

3 participants