Skip to content

Commit

Permalink
the first draft for protein structure prediction (#191)
Browse files Browse the repository at this point in the history
This build is based on
b3c72b6.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/deep-review/builds/219705004
https://travis-ci.org/greenelab/deep-review/jobs/219705005

[ci skip]

The full commit message that triggered this build is copied below:

the first draft for protein structure prediction (#191)

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

Now each line has <80 chars (including space)

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Line wrap to trigger CI build

* Fix doi tag

* Fix arxiv reference
  • Loading branch information
j3xugit committed Apr 7, 2017
1 parent f3cbfe0 commit 711df8c
Show file tree
Hide file tree
Showing 5 changed files with 4,025 additions and 1 deletion.
109 changes: 108 additions & 1 deletion all-sections.md
Original file line number Diff line number Diff line change
Expand Up @@ -679,7 +679,114 @@ particularly notable in this area?*

### Protein secondary and tertiary structure

*Jinbo Xu is writing this*
Proteins play fundamental roles in all biological processes including the
maintenance of cellular integrity, metabolism, transcription/translation, and
cell-cell communication. Complete description of protein structures and
functions is a fundamental step towards understanding biological life and
also highly relevant in the development of therapeutics and drugs. UnitProt
currently has about 94 millions of protein sequences. Even if we remove
redundancy at 50% sequence identity level, UnitProt still has about
20 millions of protein sequences. However, fewer than 100,000 proteins
have experimentally-solved structures in Protein Data Bank (PDB). As a result,
computational structure prediction is essential for a majority number of
protein sequences. However, predicting protein 3D structures from sequence alone
is very challenging, especially when similar solved structures (called templates)
are not available in PDB. In the past decades, various computational methods have
been developed to predict protein structure from different aspects,
including prediction of secondary structure, torsion angles, solvent accessibility,
inter-residue contact map, disorder regions and side-chain packing.

Machine learning is extensively applied to predict protein structures and
some success has been achieved. For example, secondary structure can be
predicted with about 80% of 3-state (i.e., Q3) accuracy by a neural network
method PSIPRED [@ref_160]. Starting from
2012, deep learning has been gradually introduced to protein structure
prediction. The adopted deep learning models include deep belief network,
LSTM(long short-term memory), deep convolutional neural networks (DCNN)
and deep convolutional neural fields[@ref_157
@ref_37]. Here we focus on deep learning methods for
two representative subproblems: secondary structure prediction and
contact map prediction. Secondary structure refers to local conformation of
a sequence segment while a contact map contains information of global
conformation. Secondary structure prediction is a basic problem and almost
an essential module of any protein structure prediction package. It has also
been used as sequence labeling benchmark in the machine learning community.
Contact prediction is much more challenging than secondary structure prediction,
but it has a much larger impact on tertiary structure prediction.
In recent years, contact prediction has made good progress and its accuracy
has been significantly improved [@ref_166
@ref_163 @ref_159
@ref_167].

Protein secondary structure can exhibit three different states (alpha helix,
beta strand and loop regions) or eight finer-grained states. More methods are
developed to predict 3-state secondary structure than 8-state. A predictor is
typically evaluated by 3-state (i.e., Q3) and 8-state (i.e., Q8) accuracy, respectively.
Qi et al. developed a multi-task deep learning method to simultaneously predict several
local structure properties including secondary structures [@ref_168].
Spencer, Eickholt and Cheng predicted secondary structure using deep belief networks
[@ref_57]. Heffernan and Zhou et al. developed an iterative
deep learning framework to simultaneously predict secondary structure, backbone torsion
angles and solvent accessibility [@ref_158]. However, none of these deep
learning methods achieved significant improvement over PSIPRED [@ref_156]
in terms of Q3 accuracy. In 2014, Zhou and Troyanskaya demonstrated that they could
improve Q8 accuracy over a shallow learning architecture conditional neural fields [@ref_155]
by using a deep supervised and convolutional generative stochastic network[@ref_154],
but did not report any results in terms of Q3 accuracy. In 2016 Wang and Xu et al. developed a deep
convolutional neural fields (DeepCNF) model that can significantly improve secondary
structure prediction in terms of both Q3 and Q8 accuracy[@ref_37].
DeepCNF possibly is the first that reports Q3 accuracy of 84-85%, much higher than
the 80% accuracy maintained by PSIPRED for more than 10 years.
It is also reported that DeepCNF can improve prediction of solvent accessibility
and disorder regions [@ref_157]. This improvement may be mainly
due to the introduction of convolutional neural fields to capture long-range
sequential information, which is important for beta strand prediction. Nevertheless,
improving secondary structure prediction from 80% to 84-85% is unlikely to
result in a similar amount of improvement in tertiary structure prediction since secondary
structure mainly reflects coarse-grained local conformation of a protein structure.

Protein contact prediction and contact-assisted folding (i.e., folding proteins using
predicted contacts as restraints) represents a promising new direction for ab initio folding
of proteins without good templates in PDB.
Evolutionary coupling analysis (ECA) is an effective contact prediction method for some
proteins with a very large number (>1000) of sequence homologs [@ref_167],
but ECA fares poorly for proteins without many sequence homologs. Since (soluble) proteins with
many sequence homologs are likely to have a good template in PDB, to make contact-assisted
folding practically useful for ab initio folding, it is essential to predict accurate contacts
for proteins without many sequence homologs. By combining ECA with a few other protein features,
shallow neural network-based methods such as MetaPSICOV [@ref_163] and
CoinDCA-NN [@ref_164] have shown some advantage over ECA
for proteins with a small number of sequence homologs, but their accuracy is still not very good.
In recent years, deep learning methods have been explored for contact prediction. For example,
Di Lena et al. introduced a deep spatio-temporal neural network (up to 100 layers) that utilizes both
spatial and temporal features to predict protein contacts[@ref_161].
Eickholt and Cheng combined deep belief networks and boosting techniques to predict protein contacts
[@ref_162] and trained deep networks by layer-wise unsupervised
learning followed by fine-tuning of the entire network. Skwark and Elofsson et al. developed
an iterative deep learning technique for contact prediction by stacking a series of Random Forests
[@ref_165]. However, blindly tested in the well-known CASP competitions,
these methods did not show any advantage over MetaPSICOV [@ref_163], a method
using two cascaded neural networks. Very recently, Wang and Xu et al. proposed a novel deep learning method
RaptorX-Contact [@ref_166] that can significantly improve contact prediction
over MetaPSICOV especially for proteins without many sequence homologs. RaptorX-Contact employs a network
architecture formed by one 1D residual neural network and one 2D residual neural network.
Blindly tested in the latest CASP competition (i.e., CASP12 [@ref_170]),
RaptorX-Contact is ranked first in terms of the total F1 score (a widely-used performance metric) on
free-modeling targets as well as the whole set of targets. In the CASP12 test, the group ranked second
also employed a deep learning method. Even MetaPSICOV, which ranked third in CASP12, employed more
and wider hidden layers than its old version. Wang and Xu et al. have also
demonstrated in another blind test CAMEO (which can be interpreted as a fully-automated
CASP) [@ref_169] that their predicted contacts can help fold quite a few proteins
with a novel fold and only 65-330 sequence homologs and that their method also works well on membrane
protein contact prediction even if trained mostly by non-membrane proteins. In fact, most of the top 10
contact prediction groups in CASP12 employed some kind of deep learning techniques. The RaptorX-Contact
method performed better mainly due to introduction of residual neural networks and exploiting contact
occurrence patterns by simultaneous prediction of all the contacts in a single protein.
It is still possible to further improve contact prediction by studying new deep network architectures.
However, current methods fail when proteins in question have almost no sequence homologs. It is unclear if there
is an effective way to deal with this type of proteins or not except waiting for more sequence homologs.
Finally, the deep learning methods summarized above also apply to interfacial contact prediction
of a protein complex, but may be less effective since on average protein complexes have fewer sequence homologs.

### Signaling

Expand Down
32 changes: 32 additions & 0 deletions bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -969,3 +969,35 @@ @article{ref_151
year = {2016}
}


@article{ref_154,
abstract = {Predicting protein secondary structure is a fundamental problem in protein
structure prediction. Here we present a new supervised generative stochastic
network (GSN) based method to predict local secondary structure with deep
hierarchical representations. GSN is a recently proposed deep learning
technique (Bengio & Thibodeau-Laufer, 2013) to globally train deep generative
model. We present the supervised extension of GSN, which learns a Markov chain
to sample from a conditional distribution, and applied it to protein structure
prediction. To scale the model to full-sized, high-dimensional data, like
protein sequences with hundreds of amino acids, we introduce a convolutional
architecture, which allows efficient learning across multiple layers of
hierarchical representations. Our architecture uniquely focuses on predicting
structured low-level labels informed with both low and high-level
representations learned by the model. In our application this corresponds to
labeling the secondary structure state of each amino-acid residue. We trained
and tested the model on separate sets of non-homologous proteins sharing less
than 30% sequence identity. Our model achieves 66.4% Q8 accuracy on the CB513
dataset, better than the previously reported best performance 64.9% (Wang et
al., 2011) for this challenging secondary structure prediction problem.},
archiveprefix = {arXiv},
author = {Jian Zhou and Olga G. Troyanskaya},
eprint = {1403.1347v1},
file = {1403.1347v1.pdf},
link = {http://arxiv.org/abs/1403.1347v1},
month = {Mar},
primaryclass = {q-bio.QM},
title = {Deep Supervised and Convolutional Generative Stochastic Network for
Protein Secondary Structure Prediction},
year = {2014}
}

Loading

0 comments on commit 711df8c

Please sign in to comment.