Add t-SNE embeddings and clusters to workflow #176
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Adds logic and scripts to the core workflow to create a t-SNE embedding per build with all segments and create HDBSCAN clusters from the embeddings. Updates the Auspice config JSONs the public and private builds to include the new t-SNE coordinate fields and the t-SNE cluster field.
Although we select the same strains for each gene segment, not all segment sequences can be aligned. As a result, different sets of strains can appear in the final Nextclade alignment for each gene segment. Since the t-SNE embedding requires the same strains to be present for all of its inputs and in the same order, we need to find the list of strains shared across all segments, extract only those strain sequences from each segment alignment, and sort the alignments prior to calculating pairwise distances and the embedding. We could perform the necessary set operations on multiple segment alignments with shell commands only, but this implementation was not as readable as the simple Python script included here.
This approach is based on our work in Nanduri et al. showing that HDBSCAN clusters of t-SNE embeddings from HA and NA alignments can reliably detect phylogenetic clusters including reassortment groups.
Examples
H1N1pdm HA and NA clusters showing reassortment within subclade D.1:
H3N2 HA and NA clusters (unfortunately both in yellow) showing reassortment within a recent HA clade carrying a HA1:145N substitution:
Vic HA and NA clusters showing two clusters that completely lack monophyly in the HA tree but which appear monophyletic in the NA tree:
Related issue(s)
Closes #174
Checklist