-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The best probability and LDDT score to filter in easy-search #243
Comments
The safest cut-off is neither prob nor LDDT/TM-score (in our opinion), since neither has a multiple testing correction in-built. When searching against potentially hundreds of millions of entities, E-value will likely be the most/only reliable indicator of homology for annotation. In your range, its probably not possible to say for certain that either of the hits are reliable annotations. All of them have probably high E-values? With high E-values and uncertain LDDT/TM-score/prob we can just establish that there is some structural similarity to be found; for stronger statements additional evidence is required. The MGYP proteins come from MGnify. You can find the source assembly from the metadata on the MGnify download server: http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/ Specifically the |
Hi @milot-mirdita , Thank you for replying! I wanted to confirm another thing, while the E. values of the results are high, the alignment length of the match varies a lot! Some proteins have an alignment length of less than 50 amino acids (but high probability, LDDT score, and E.value). Regards, |
Just to clarify and make sure that there is no miscommunication or typos: A high value for E-values is bad. E-values should be as low and close to 0 as possible. E-values of < 10^-3 are normally very certain homologs. For higher values you'd need other evidence to establish homology. |
Hi @milot-mirdita , |
Hi @martin-steinegger ,
Thank you again for a great resource!
I am using the
foldseek easy-search
command to annotate some proteins of interest. I am selecting the annotation with the highestprob
andLDDT
score for each protein. I wanted to confirm if there is a filter that I can use to confidently say what the putative annotation is for the protein of interest?For example, I have several hits that have
prob
of >0.7, but theLDDT
score <0.3. While most of the proteins haveprob
of >0.7 andLDDT
score >0.5. What is the "best" cutoff for annotating proteins using Foldseek?At the same time, where can I find the target protein description? If my target protein is
MGYP001275795760
, where can I find its full name?Any suggestions?
The text was updated successfully, but these errors were encountered: