The best probability and LDDT score to filter in easy-search #243

Jigyasa3 · 2024-02-17T03:09:25Z

Thank you again for a great resource!
I am using the foldseek easy-search command to annotate some proteins of interest. I am selecting the annotation with the highest prob and LDDT score for each protein. I wanted to confirm if there is a filter that I can use to confidently say what the putative annotation is for the protein of interest?
For example, I have several hits that have prob of >0.7, but the LDDT score <0.3. While most of the proteins have prob of >0.7 and LDDT score >0.5. What is the "best" cutoff for annotating proteins using Foldseek?

At the same time, where can I find the target protein description? If my target protein is MGYP001275795760, where can I find its full name?

Any suggestions?

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2024-02-17T07:21:04Z

The safest cut-off is neither prob nor LDDT/TM-score (in our opinion), since neither has a multiple testing correction in-built. When searching against potentially hundreds of millions of entities, E-value will likely be the most/only reliable indicator of homology for annotation. In your range, its probably not possible to say for certain that either of the hits are reliable annotations. All of them have probably high E-values? With high E-values and uncertain LDDT/TM-score/prob we can just establish that there is some structural similarity to be found; for stronger statements additional evidence is required.

The MGYP proteins come from MGnify. You can find the source assembly from the metadata on the MGnify download server: http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/

Specifically the [mgy_assemblies.tsv.gz](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/mgy_assemblies.tsv.gz) file. I don't think that the EBI offers a service yet to map MGYP accessions to their source.

Jigyasa3 · 2024-02-19T20:16:12Z

Hi @milot-mirdita ,

Thank you for replying! I wanted to confirm another thing, while the E. values of the results are high, the alignment length of the match varies a lot! Some proteins have an alignment length of less than 50 amino acids (but high probability, LDDT score, and E.value).
I was wondering if these proteins can be considered as remote homologs?
Or would you suggest a more stringent filtering criterion for defining remote homologs?

Regards,
Jigyasa

milot-mirdita · 2024-02-20T05:12:44Z

Just to clarify and make sure that there is no miscommunication or typos: A high value for E-values is bad. E-values should be as low and close to 0 as possible. E-values of < 10^-3 are normally very certain homologs. For higher values you'd need other evidence to establish homology.

Jigyasa3 · 2024-02-20T05:19:16Z

Hi @milot-mirdita ,
I am comparing the output from Foldseek with hh-suite to find remote homologs, and I observe that none of the hits have E. values less than 1e-3.
Link to the open issue. Is there a way to examine false negatives?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The best probability and LDDT score to filter in easy-search #243

The best probability and LDDT score to filter in easy-search #243

Jigyasa3 commented Feb 17, 2024 •

edited

Loading

milot-mirdita commented Feb 17, 2024

Jigyasa3 commented Feb 19, 2024 •

edited

Loading

milot-mirdita commented Feb 20, 2024

Jigyasa3 commented Feb 20, 2024 •

edited

Loading

The best probability and LDDT score to filter in easy-search #243

The best probability and LDDT score to filter in easy-search #243

Comments

Jigyasa3 commented Feb 17, 2024 • edited Loading

milot-mirdita commented Feb 17, 2024

Jigyasa3 commented Feb 19, 2024 • edited Loading

milot-mirdita commented Feb 20, 2024

Jigyasa3 commented Feb 20, 2024 • edited Loading

Jigyasa3 commented Feb 17, 2024 •

edited

Loading

Jigyasa3 commented Feb 19, 2024 •

edited

Loading

Jigyasa3 commented Feb 20, 2024 •

edited

Loading