Add t-SNE embeddings and clusters to workflow #176

huddlej · 2024-08-27T19:51:05Z

Description of proposed changes

Adds logic and scripts to the core workflow to create a t-SNE embedding per build with all segments and create HDBSCAN clusters from the embeddings. Updates the Auspice config JSONs the public and private builds to include the new t-SNE coordinate fields and the t-SNE cluster field.

Although we select the same strains for each gene segment, not all segment sequences can be aligned. As a result, different sets of strains can appear in the final Nextclade alignment for each gene segment. Since the t-SNE embedding requires the same strains to be present for all of its inputs and in the same order, we need to find the list of strains shared across all segments, extract only those strain sequences from each segment alignment, and sort the alignments prior to calculating pairwise distances and the embedding. We could perform the necessary set operations on multiple segment alignments with shell commands only, but this implementation was not as readable as the simple Python script included here.

This approach is based on our work in Nanduri et al. showing that HDBSCAN clusters of t-SNE embeddings from HA and NA alignments can reliably detect phylogenetic clusters including reassortment groups.

Examples

H1N1pdm HA and NA clusters showing reassortment within subclade D.1:

H3N2 HA and NA clusters (unfortunately both in yellow) showing reassortment within a recent HA clade carrying a HA1:145N substitution:

Vic HA and NA clusters showing two clusters that completely lack monophyly in the HA tree but which appear monophyletic in the NA tree:

Related issue(s)

Closes #174

Checklist

Checks pass

Adds logic and scripts to the core workflow to create a t-SNE embedding per build with all segments and create HDBSCAN clusters from the embeddings. Updates the Auspice config JSONs the public and private builds to include the new t-SNE coordinate fields and the t-SNE cluster field. Although we select the same strains for each gene segment, not all segment sequences can be aligned. As a result, different sets of strains can appear in the final Nextclade alignment for each gene segment. Since the t-SNE embedding requires the same strains to be present for all of its inputs and in the same order, we need to find the list of strains shared across all segments, extract only those strain sequences from each segment alignment, and sort the alignments prior to calculating pairwise distances and the embedding. We could perform the necessary set operations on multiple segment alignments with shell commands only, but this implementation was not as readable as the simple Python script included here. Closes #174

Make the introduction of t-SNE embeddings more backward-compatible by making them an opt-in feature of the workflow like titer models, LBI, etc. Enables embeddings for CI (to catch errors) and for public and private builds for nextstrain.org. We may further decide that the main value of these embeddings is in the 2y builds and modify the nextstrain-public config to only run embeddings for those builds.

huddlej requested review from trvrb, jameshadfield and rneher August 27, 2024 19:53

huddlej added 2 commits August 28, 2024 12:32

Fix typo in benchmark path

654b438

huddlej mentioned this pull request Aug 28, 2024

Add t-SNE embeddings and clusters nextstrain/avian-flu#88

Draft

1 task

huddlej merged commit 08efd02 into master Aug 29, 2024
3 checks passed

huddlej deleted the use-pathogen-embed branch August 29, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add t-SNE embeddings and clusters to workflow #176

Add t-SNE embeddings and clusters to workflow #176

huddlej commented Aug 27, 2024 •

edited

Loading

Add t-SNE embeddings and clusters to workflow #176

Add t-SNE embeddings and clusters to workflow #176

Conversation

huddlej commented Aug 27, 2024 • edited Loading

Description of proposed changes

Examples

Related issue(s)

Checklist

huddlej commented Aug 27, 2024 •

edited

Loading