What is the significance of the numbering system for the recommended mutations in the python bin/recommend.py output? #40

danielguion · 2024-01-17T02:19:16Z

danielguion
Jan 17, 2024

For example, in:

(base) user efficient-evolution-master % python bin/recommend.py QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCARSGHITVFGVNVDAFDMWGQGTMVTVSS
/User/opt/anaconda3/lib/python3.8/site-packages/esm/pretrained.py:215: UserWarning: Regression weights not found, predicting contacts will not produce correct results.
  warnings.warn(
S44G	6
N74S	4
V35Y	3
E65P	3
D27F	2
W59S	2
M117Y	2
T53I	        2
S42P	1
R56S	1
N61T	1
M123T	1
D27G	1
A34Y	1
P75R	1
I24V	1
P75V	1
P91A	1

What does the 1, 2, 3, 4, and 6 mean? And why are some of the numbers more repeated than others? Are these recommended mutations to apply together to the protein I want to "evolve"? Or are they to be applied all at once?

I could not find an answer in the paper - my apologies if it is there and I'm not seeing it.

Answered by brianhie

Mar 12, 2024

This is documented in the README: "the script will output a list of substitutions and the number of recommending language models."

The number indicates the count of language models for which the corresponding mutation has higher LM likelihood than wildtype. We use an ensemble of six language models, which is why the number is out of 6. We use these counts to prioritize mutations that have a consensus across multiple language models, as described in the methods of the paper.

Hope that helps!

View full answer

jamesrgraham · 2024-03-12T19:01:27Z

jamesrgraham
Mar 12, 2024

I'd like an answer to this, as well. I've looked through the code and in the amis.py script, I found the deep mutational scanning function.

Deep Mutational Scanning (deep_mutational_scan function): This function prints the results of scanning mutations across a protein sequence. For each position in the sequence, it predicts how different mutations (amino acid substitutions) at that position could affect the protein based on the model's predictions. The specific print statements within this function output the position (pos), the mutation (mt), and the value (val), which likely represents the predicted impact or score of that mutation. The output format for each mutation would be pos mt val, providing detailed insights into how each possible mutation could affect the protein's function or stability.

So, maybe some sort of prioritization? Since they output it in large->small numbers, I would assume the ones with higher numbers are "better" in some way.

But I'd prefer an answer from the devs. What should we do with these numbers?

2 replies

brianhie Mar 12, 2024
Maintainer

This is documented in the README: "the script will output a list of substitutions and the number of recommending language models."

The number indicates the count of language models for which the corresponding mutation has higher LM likelihood than wildtype. We use an ensemble of six language models, which is why the number is out of 6. We use these counts to prioritize mutations that have a consensus across multiple language models, as described in the methods of the paper.

Hope that helps!

Answer selected by brianhie

jamesrgraham Mar 13, 2024

That’s the answer! I totally missed it.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the significance of the numbering system for the recommended mutations in the python bin/recommend.py output? #40

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

What is the significance of the numbering system for the recommended mutations in the python bin/recommend.py output? #40

danielguion Jan 17, 2024

Replies: 1 comment · 2 replies

jamesrgraham Mar 12, 2024

brianhie Mar 12, 2024 Maintainer

jamesrgraham Mar 13, 2024

danielguion
Jan 17, 2024

Replies: 1 comment 2 replies

jamesrgraham
Mar 12, 2024

brianhie Mar 12, 2024
Maintainer