-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the first draft for protein structure prediction #191
Changes from all commits
578e55e
e2dfda4
fdab3e7
4013de7
f512c6f
304195d
3d6a849
f2d469c
7f400b5
1d583f2
2d94fbb
a2417ad
3b344e3
95b9be2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,7 +52,114 @@ particularly notable in this area?* | |
|
||
### Protein secondary and tertiary structure | ||
|
||
*Jinbo Xu is writing this* | ||
Proteins play fundamental roles in all biological processes including the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with the purpose of this section's introductory paragraph being to first discuss what is being predicted, but I think we could be much less verbose. I think all we need here is:
|
||
maintenance of cellular integrity, metabolism, transcription/translation, and | ||
cell-cell communication. Complete description of protein structures and | ||
functions is a fundamental step towards understanding biological life and | ||
also highly relevant in the development of therapeutics and drugs. UnitProt | ||
currently has about 94 millions of protein sequences. Even if we remove | ||
redundancy at 50% sequence identity level, UnitProt still has about | ||
20 millions of protein sequences. However, fewer than 100,000 proteins | ||
have experimentally-solved structures in Protein Data Bank (PDB). As a result, | ||
computational structure prediction is essential for a majority number of | ||
protein sequences. However, predicting protein 3D structures from sequence alone | ||
is very challenging, especially when similar solved structures (called templates) | ||
are not available in PDB. In the past decades, various computational methods have | ||
been developed to predict protein structure from different aspects, | ||
including prediction of secondary structure, torsion angles, solvent accessibility, | ||
inter-residue contact map, disorder regions and side-chain packing. | ||
|
||
Machine learning is extensively applied to predict protein structures and | ||
some success has been achieved. For example, secondary structure can be | ||
predicted with about 80% of 3-state (i.e., Q3) accuracy by a neural network | ||
method PSIPRED [@doi:10.1093/bioinformatics/16.4.404]. Starting from | ||
2012, deep learning has been gradually introduced to protein structure | ||
prediction. The adopted deep learning models include deep belief network, | ||
LSTM(long short-term memory), deep convolutional neural networks (DCNN) | ||
and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1 | ||
@doi:10.1038/srep18962]. Here we focus on deep learning methods for | ||
two representative subproblems: secondary structure prediction and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you provide any intuition on why we want to focus on these subproblems? For example, is it where deep learning has had the biggest advantage over competitors or greatest success in an absolute sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For protein contact prediction, it is true that deep learning makes the biggest progress since this problem appeared to be very challenging before 2011. Only in recent years (2011-2016) we saw big improvement. The first improvement comes from structure learning (2011-2012) for co-evolutionary analysis (which mostly works on proteins with lots of sequence homologs), the second mild improvement comes from a simple neural network method, and the third big improvement comes from my deep learning algorithm (which works on proteins with only dozens of sequence homologs). The reason I chose secondary structure prediction because it is a basic problem and also easy to understand. This problem is also more accessible to pure machine learning people than other subproblems such as disorder prediction and solvent accessibility prediction. Secondary structure prediction is almost an essential module of any protein structure prediction package. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm strongly in favor of limiting scope and focusing on protein structure-related problems where 1) we have something to say instead of just enumerating methods and 2) deep learning is contributing something unique, so this sounds great to me. It could be helpful to bring some of the context in your comment into the manuscript. Something like:
I don't necessarily like the phrasing I suggested, but the idea would be to show that we are explicitly narrowing our scope, calling out to a few other deep learning methods in other areas of structure prediction that might be interesting but are out of that scope, and stating why we focused on the sub-problem we did. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Completely agree! I think adding something similar to what @agitter outlined ("Starting with 2012..." also similar to @j3xugit's comment above) and being concise about how deep learning has indeed remarkably improved performance over time (and citing papers with their architectures) may be preferable to enumerating methods in as great detail. I think the next two or three paragraphs could probably be combined if we adopt this strategy. As is, I think this section may be a bit long if we want to squeeze in other subsections within
I agree that deep learning is improving performance, but, to play devil's advocate, who cares? Perhaps it would be more pertinent and in line with what has been written in the section about morphological phenotypes and in the rough outline of the study section if we follow with a discussion about three forward-looking points:
This could be added as a final short paragraph of the subsection if this is what we feel the main message for the review is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This being said, I do not want to stall up progress for any reason! I understand that it is often much easier to work with a full manuscript once the tone/message is clarified so if the merge was close, I'd say go for it. (right now I'm trying to get a feel of how progress through PRs is being made!) |
||
contact map prediction. Secondary structure refers to local conformation of | ||
a sequence segment while a contact map contains information of global | ||
conformation. Secondary structure prediction is a basic problem and almost | ||
an essential module of any protein structure prediction package. It has also | ||
been used as sequence labeling benchmark in the machine learning community. | ||
Contact prediction is much more challenging than secondary structure prediction, | ||
but it has a much larger impact on tertiary structure prediction. | ||
In recent years, contact prediction has made good progress and its accuracy | ||
has been significantly improved [@doi:10.1371/journal.pcbi.1005324 | ||
@doi:10.1093/bioinformatics/btu791 @doi:10.1073/pnas.0805923106 | ||
@doi:10.1371/journal.pone.0028766]. | ||
|
||
Protein secondary structure can exhibit three different states (alpha helix, | ||
beta strand and loop regions) or eight finer-grained states. More methods are | ||
developed to predict 3-state secondary structure than 8-state. A predictor is | ||
typically evaluated by 3-state (i.e., Q3) and 8-state (i.e., Q8) accuracy, respectively. | ||
Qi et al. developed a multi-task deep learning method to simultaneously predict several | ||
local structure properties including secondary structures [@doi:10.1371/journal.pone.0032235]. | ||
Spencer, Eickholt and Cheng predicted secondary structure using deep belief networks | ||
[@doi:10.1109/TCBB.2014.2343960]. Heffernan and Zhou et al. developed an iterative | ||
deep learning framework to simultaneously predict secondary structure, backbone torsion | ||
angles and solvent accessibility [@doi:10.1038/srep11476]. However, none of these deep | ||
learning methods achieved significant improvement over PSIPRED [@doi:10.1006/jmbi.1999.3091] | ||
in terms of Q3 accuracy. In 2014, Zhou and Troyanskaya demonstrated that they could | ||
improve Q8 accuracy over a shallow learning architecture conditional neural fields [@doi:10.1002/pmic.201100196] | ||
by using a deep supervised and convolutional generative stochastic network[@arxiv:1403.1347], | ||
but did not report any results in terms of Q3 accuracy. In 2016 Wang and Xu et al. developed a deep | ||
convolutional neural fields (DeepCNF) model that can significantly improve secondary | ||
structure prediction in terms of both Q3 and Q8 accuracy[@doi:10.1038/srep18962]. | ||
DeepCNF possibly is the first that reports Q3 accuracy of 84-85%, much higher than | ||
the 80% accuracy maintained by PSIPRED for more than 10 years. | ||
It is also reported that DeepCNF can improve prediction of solvent accessibility | ||
and disorder regions [@doi:10.1007/978-3-319-46227-1_1]. This improvement may be mainly | ||
due to the introduction of convolutional neural fields to capture long-range | ||
sequential information, which is important for beta strand prediction. Nevertheless, | ||
improving secondary structure prediction from 80% to 84-85% is unlikely to | ||
result in a similar amount of improvement in tertiary structure prediction since secondary | ||
structure mainly reflects coarse-grained local conformation of a protein structure. | ||
|
||
Protein contact prediction and contact-assisted folding (i.e., folding proteins using | ||
predicted contacts as restraints) represents a promising new direction for ab initio folding | ||
of proteins without good templates in PDB. | ||
Evolutionary coupling analysis (ECA) is an effective contact prediction method for some | ||
proteins with a very large number (>1000) of sequence homologs [@doi:10.1371/journal.pone.0028766], | ||
but ECA fares poorly for proteins without many sequence homologs. Since (soluble) proteins with | ||
many sequence homologs are likely to have a good template in PDB, to make contact-assisted | ||
folding practically useful for ab initio folding, it is essential to predict accurate contacts | ||
for proteins without many sequence homologs. By combining ECA with a few other protein features, | ||
shallow neural network-based methods such as MetaPSICOV [@doi:10.1093/bioinformatics/btu791] and | ||
CoinDCA-NN [@doi:10.1093/bioinformatics/btv472] have shown some advantage over ECA | ||
for proteins with a small number of sequence homologs, but their accuracy is still not very good. | ||
In recent years, deep learning methods have been explored for contact prediction. For example, | ||
Di Lena et al. introduced a deep spatio-temporal neural network (up to 100 layers) that utilizes both | ||
spatial and temporal features to predict protein contacts[@doi:10.1093/bioinformatics/bts475]. | ||
Eickholt and Cheng combined deep belief networks and boosting techniques to predict protein contacts | ||
[@doi:10.1093/bioinformatics/bts598] and trained deep networks by layer-wise unsupervised | ||
learning followed by fine-tuning of the entire network. Skwark and Elofsson et al. developed | ||
an iterative deep learning technique for contact prediction by stacking a series of Random Forests | ||
[@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in the well-known CASP competitions, | ||
these methods did not show any advantage over MetaPSICOV [@doi:10.1093/bioinformatics/btu791], a method | ||
using two cascaded neural networks. Very recently, Wang and Xu et al. proposed a novel deep learning method | ||
RaptorX-Contact [@doi:10.1371/journal.pcbi.1005324] that can significantly improve contact prediction | ||
over MetaPSICOV especially for proteins without many sequence homologs. RaptorX-Contact employs a network | ||
architecture formed by one 1D residual neural network and one 2D residual neural network. | ||
Blindly tested in the latest CASP competition (i.e., CASP12 [@url:http://www.predictioncenter.org/casp12/rrc_avrg_results.cgi]), | ||
RaptorX-Contact is ranked first in terms of the total F1 score (a widely-used performance metric) on | ||
free-modeling targets as well as the whole set of targets. In the CASP12 test, the group ranked second | ||
also employed a deep learning method. Even MetaPSICOV, which ranked third in CASP12, employed more | ||
and wider hidden layers than its old version. Wang and Xu et al. have also | ||
demonstrated in another blind test CAMEO (which can be interpreted as a fully-automated | ||
CASP) [@url:http://www.cameo3d.org/] that their predicted contacts can help fold quite a few proteins | ||
with a novel fold and only 65-330 sequence homologs and that their method also works well on membrane | ||
protein contact prediction even if trained mostly by non-membrane proteins. In fact, most of the top 10 | ||
contact prediction groups in CASP12 employed some kind of deep learning techniques. The RaptorX-Contact | ||
method performed better mainly due to introduction of residual neural networks and exploiting contact | ||
occurrence patterns by simultaneous prediction of all the contacts in a single protein. | ||
It is still possible to further improve contact prediction by studying new deep network architectures. | ||
However, current methods fail when proteins in question have almost no sequence homologs. It is unclear if there | ||
is an effective way to deal with this type of proteins or not except waiting for more sequence homologs. | ||
Finally, the deep learning methods summarized above also apply to interfacial contact prediction | ||
of a protein complex, but may be less effective since on average protein complexes have fewer sequence homologs. | ||
|
||
### Signaling | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this placeholder sub-section title okay?