Various masked LM ideas using EMS-2.
-
masked_lm_clustering shows how to perform hierarchical clustering of latent embeddings of proteins using the masked protein language model ESM-2. This uses a sequence with (possibly mutiple) masked residues, computes the top
m
most likely and least likely protein sequences conditioned on all positions being masked simultaneously. It then uses persistent homology, DBSCAN, and HDBSCAN (along with$k$ -Means and Agglomerative Clustering for comparison) to cluster the sequences. HDBSCAN returns a clustering hierarchy reminiscent of an evolutionary tree for protein sequences generated by the model. -
ems2_mutations implements part of the paper Language models enable zero-shot prediction of the effects of mutations on protein function using ESM-2 instead of ESM-1v. See also the META repo
-
scoring_mutations computes the
masked_marginal_score
, thewild_type_marginal_score
, themutant_type_marginal_score
, and thepseudolikelihood_score
for a list of mutated sequences predicted to be the most and least likely by ESM-2 based on a fixed wild-type sequences, and with a fixed target mutation sequence. This is closely related to the previous notebook, and finishes implementing the scoring functions mentioned in Language models enable zero-shot prediction of the effects of mutations on protein function using ESM-2. You can swap outfacebook/esm2_t6_8M_UR50D
for one of the other larger models. -
sequence_classification builds a basic protein sequence classifier with three labels for enzymes, receptor proteins, and structural proteins. It uses the
facebook/esm2_t6_8M_UR50D
and thus is lightweight and easy to train, yet accurate. -
residue_classification trains a small residue classifier using
facebook/esm2_t6_8M_UR50D
to classify residues into three classes: Exposed to Solvent, Binding Site, or Transmembrane Region.