the first draft for protein structure prediction (#191)

This build is based on b3c72b6. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/219705004 https://travis-ci.org/greenelab/deep-review/jobs/219705005 [ci skip] The full commit message that triggered this build is copied below: the first draft for protein structure prediction (#191) * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md Now each line has <80 chars (including space) * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Line wrap to trigger CI build * Fix doi tag * Fix arxiv reference
greenelab · Apr 7, 2017 · 711df8c · 711df8c
1 parent f3cbfe0
commit 711df8c
Show file tree

Hide file tree

Showing 5 changed files with 4,025 additions and 1 deletion.
diff --git a/all-sections.md b/all-sections.md
@@ -679,7 +679,114 @@ particularly notable in this area?*
 
 ### Protein secondary and tertiary structure
 
-*Jinbo Xu is writing this*
+Proteins play fundamental roles in all biological processes including the 
+maintenance of cellular integrity, metabolism, transcription/translation, and 
+cell-cell communication. Complete description of protein structures and 
+functions is a fundamental step towards understanding biological life and 
+also highly relevant in the development of therapeutics and drugs. UnitProt 
+currently has about 94 millions of protein sequences. Even if we remove 
+redundancy at 50% sequence identity level, UnitProt still has about 
+20 millions of protein sequences. However, fewer than 100,000 proteins 
+have experimentally-solved structures in Protein Data Bank (PDB). As a result, 
+computational structure prediction is essential for a majority number of 
+protein sequences. However, predicting protein 3D structures from sequence alone 
+is very challenging, especially when similar solved structures (called templates) 
+are not available in PDB. In the past decades, various computational methods have 
+been developed to predict protein structure from different aspects, 
+including prediction of secondary structure, torsion angles, solvent accessibility, 
+inter-residue contact map, disorder regions and side-chain packing.
+
+Machine learning is extensively applied to predict protein structures and 
+some success has been achieved. For example, secondary structure can be 
+predicted with about 80% of 3-state (i.e., Q3) accuracy by a neural network 
+method PSIPRED [@ref_160]. Starting from 
+2012, deep learning has been gradually introduced to protein structure 
+prediction. The adopted deep learning models include deep belief network, 
+LSTM(long short-term memory), deep convolutional neural networks (DCNN) 
+and deep convolutional neural fields[@ref_157 
+@ref_37]. Here we focus on deep learning methods for 
+two representative subproblems: secondary structure prediction and 
+contact map prediction. Secondary structure refers to local conformation of 
+a sequence segment while a contact map contains information of global 
+conformation. Secondary structure prediction is a basic problem and almost 
+an essential module of any protein structure prediction package. It has also
+been used as sequence labeling benchmark in the machine learning community. 
+Contact prediction is much more challenging than secondary structure prediction,
+but it has a much larger impact on tertiary structure prediction. 
+In recent years, contact prediction has made good progress and its accuracy 
+has been significantly improved [@ref_166 
+@ref_163 @ref_159
+@ref_167].
+
+Protein secondary structure can exhibit three different states (alpha helix, 
+beta strand and loop regions) or eight finer-grained states. More methods are 
+developed to predict 3-state secondary structure than 8-state. A predictor is 
+typically evaluated by 3-state (i.e., Q3) and 8-state (i.e., Q8) accuracy, respectively. 
+Qi et al. developed a multi-task deep learning method to simultaneously predict several 
+local structure properties including secondary structures [@ref_168]. 
+Spencer, Eickholt and Cheng predicted secondary structure using deep belief networks 
+[@ref_57]. Heffernan and Zhou et al. developed an iterative 
+deep learning framework to simultaneously predict secondary structure, backbone torsion 
+angles and solvent accessibility [@ref_158]. However, none of these deep 
+learning methods achieved significant improvement over PSIPRED [@ref_156] 
+in terms of Q3 accuracy. In 2014, Zhou and Troyanskaya demonstrated that they could 
+improve Q8 accuracy over a shallow learning architecture conditional neural fields [@ref_155] 
+by using a deep supervised and convolutional generative stochastic network[@ref_154], 
+but did not report any results in terms of Q3 accuracy. In 2016 Wang and Xu et al. developed a deep 
+convolutional neural fields (DeepCNF) model that can significantly improve secondary 
+structure prediction in terms of both Q3 and Q8 accuracy[@ref_37]. 
+DeepCNF possibly is the first that reports Q3 accuracy of 84-85%, much higher than 
+the 80% accuracy maintained by PSIPRED for more than 10 years. 
+It is also reported that DeepCNF can improve prediction of solvent accessibility 
+and disorder regions [@ref_157]. This improvement may be mainly 
+due to the introduction of convolutional neural fields to capture long-range 
+sequential information, which is important for beta strand prediction. Nevertheless, 
+improving secondary structure prediction from 80% to 84-85% is unlikely to 
+result in a similar amount of improvement in tertiary structure prediction since secondary
+structure mainly reflects coarse-grained local conformation of a protein structure.
+
+Protein contact prediction and contact-assisted folding (i.e., folding proteins using 
+predicted contacts as restraints) represents a promising new direction for ab initio folding 
+of proteins without good templates in PDB. 
+Evolutionary coupling analysis (ECA) is an effective contact prediction method for some 
+proteins with a very large number (>1000) of sequence homologs [@ref_167], 
+but ECA fares poorly for proteins without many sequence homologs. Since (soluble) proteins with 
+many sequence homologs are likely to have a good template in PDB, to make contact-assisted 
+folding practically useful for ab initio folding, it is essential to predict accurate contacts 
+for proteins without many sequence homologs. By combining ECA with a few other protein features, 
+shallow neural network-based methods such as MetaPSICOV [@ref_163] and 
+CoinDCA-NN [@ref_164] have shown some advantage over ECA 
+for proteins with a small number of sequence homologs, but their accuracy is still not very good. 
+In recent years, deep learning methods have been explored for contact prediction. For example, 
+Di Lena et al. introduced a deep spatio-temporal neural network (up to 100 layers) that utilizes both 
+spatial and temporal features to predict protein contacts[@ref_161]. 
+Eickholt and Cheng combined deep belief networks and boosting techniques to predict protein contacts 
+[@ref_162] and trained deep networks by layer-wise unsupervised 
+learning followed by fine-tuning of the entire network. Skwark and Elofsson et al. developed 
+an iterative deep learning technique for contact prediction by stacking a series of Random Forests 
+[@ref_165]. However, blindly tested in the well-known CASP competitions, 
+these methods did not show any advantage over MetaPSICOV [@ref_163], a method 
+using two cascaded neural networks. Very recently, Wang and Xu et al. proposed a novel deep learning method 
+RaptorX-Contact [@ref_166] that can significantly improve contact prediction 
+over MetaPSICOV especially for proteins without many sequence homologs. RaptorX-Contact employs a network 
+architecture formed by one 1D residual neural network and one 2D residual neural network. 
+Blindly tested in the latest CASP competition (i.e., CASP12 [@ref_170]), 
+RaptorX-Contact is ranked first in terms of the total F1 score (a widely-used performance metric) on 
+free-modeling targets as well as the whole set of targets. In the CASP12 test, the group ranked second 
+also employed a deep learning method. Even MetaPSICOV, which ranked third in CASP12, employed more
+and wider hidden layers than its old version. Wang and Xu et al. have also 
+demonstrated in another blind test CAMEO (which can be interpreted as a fully-automated 
+CASP) [@ref_169] that their predicted contacts can help fold quite a few proteins 
+with a novel fold and only 65-330 sequence homologs and that their method also works well on membrane 
+protein contact prediction even if trained mostly by non-membrane proteins. In fact, most of the top 10 
+contact prediction groups in CASP12 employed some kind of deep learning techniques. The RaptorX-Contact 
+method performed better mainly due to introduction of residual neural networks and exploiting contact
+occurrence patterns by simultaneous prediction of all the contacts in a single protein. 
+It is still possible to further improve contact prediction by studying new deep network architectures. 
+However, current methods fail when proteins in question have almost no sequence homologs. It is unclear if there
+is an effective way to deal with this type of proteins or not except waiting for more sequence homologs.
+Finally, the deep learning methods summarized above also apply to interfacial contact prediction 
+of a protein complex, but may be less effective since on average protein complexes have fewer sequence homologs. 
 
 ### Signaling
 

diff --git a/bibliography.bib b/bibliography.bib
@@ -969,3 +969,35 @@ @article{ref_151
  year = {2016}
 }
 
+
+@article{ref_154,
+ abstract = {Predicting protein secondary structure is a fundamental problem in protein
+structure prediction. Here we present a new supervised generative stochastic
+network (GSN) based method to predict local secondary structure with deep
+hierarchical representations. GSN is a recently proposed deep learning
+technique (Bengio & Thibodeau-Laufer, 2013) to globally train deep generative
+model. We present the supervised extension of GSN, which learns a Markov chain
+to sample from a conditional distribution, and applied it to protein structure
+prediction. To scale the model to full-sized, high-dimensional data, like
+protein sequences with hundreds of amino acids, we introduce a convolutional
+architecture, which allows efficient learning across multiple layers of
+hierarchical representations. Our architecture uniquely focuses on predicting
+structured low-level labels informed with both low and high-level
+representations learned by the model. In our application this corresponds to
+labeling the secondary structure state of each amino-acid residue. We trained
+and tested the model on separate sets of non-homologous proteins sharing less
+than 30% sequence identity. Our model achieves 66.4% Q8 accuracy on the CB513
+dataset, better than the previously reported best performance 64.9% (Wang et
+al., 2011) for this challenging secondary structure prediction problem.},
+ archiveprefix = {arXiv},
+ author = {Jian Zhou and Olga G. Troyanskaya},
+ eprint = {1403.1347v1},
+ file = {1403.1347v1.pdf},
+ link = {http://arxiv.org/abs/1403.1347v1},
+ month = {Mar},
+ primaryclass = {q-bio.QM},
+ title = {Deep Supervised and Convolutional Generative Stochastic Network for
+Protein Secondary Structure Prediction},
+ year = {2014}
+}
+