Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add t-SNE embeddings and clusters to workflow #176

Merged
merged 3 commits into from
Aug 29, 2024
Merged

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Aug 27, 2024

Description of proposed changes

Adds logic and scripts to the core workflow to create a t-SNE embedding per build with all segments and create HDBSCAN clusters from the embeddings. Updates the Auspice config JSONs the public and private builds to include the new t-SNE coordinate fields and the t-SNE cluster field.

Although we select the same strains for each gene segment, not all segment sequences can be aligned. As a result, different sets of strains can appear in the final Nextclade alignment for each gene segment. Since the t-SNE embedding requires the same strains to be present for all of its inputs and in the same order, we need to find the list of strains shared across all segments, extract only those strain sequences from each segment alignment, and sort the alignments prior to calculating pairwise distances and the embedding. We could perform the necessary set operations on multiple segment alignments with shell commands only, but this implementation was not as readable as the simple Python script included here.

This approach is based on our work in Nanduri et al. showing that HDBSCAN clusters of t-SNE embeddings from HA and NA alignments can reliably detect phylogenetic clusters including reassortment groups.

Examples

H1N1pdm HA and NA clusters showing reassortment within subclade D.1:
image

H3N2 HA and NA clusters (unfortunately both in yellow) showing reassortment within a recent HA clade carrying a HA1:145N substitution:
image 2

Vic HA and NA clusters showing two clusters that completely lack monophyly in the HA tree but which appear monophyletic in the NA tree:
image 3

Related issue(s)

Closes #174

Checklist

  • Checks pass

Adds logic and scripts to the core workflow to create a t-SNE embedding
per build with all segments and create HDBSCAN clusters from the
embeddings. Updates the Auspice config JSONs the public and private
builds to include the new t-SNE coordinate fields and the t-SNE cluster
field.

Although we select the same strains for each gene segment, not all
segment sequences can be aligned. As a result, different sets of strains
can appear in the final Nextclade alignment for each gene segment. Since
the t-SNE embedding requires the same strains to be present for all of
its inputs and in the same order, we need to find the list of strains
shared across all segments, extract only those strain sequences from
each segment alignment, and sort the alignments prior to calculating
pairwise distances and the embedding. We could perform the necessary set
operations on multiple segment alignments with shell commands only, but
this implementation was not as readable as the simple Python script
included here.

Closes #174
Make the introduction of t-SNE embeddings more backward-compatible by
making them an opt-in feature of the workflow like titer models, LBI,
etc. Enables embeddings for CI (to catch errors) and for public and
private builds for nextstrain.org. We may further decide that the main
value of these embeddings is in the 2y builds and modify the
nextstrain-public config to only run embeddings for those builds.
@huddlej huddlej merged commit 08efd02 into master Aug 29, 2024
3 checks passed
@huddlej huddlej deleted the use-pathogen-embed branch August 29, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run pathogen-embed on HA/NA alignments to flag putative reassortant clades
1 participant