Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the first draft for protein structure prediction #191

Merged
merged 14 commits into from
Apr 7, 2017
109 changes: 108 additions & 1 deletion sections/04_study.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,114 @@ particularly notable in this area?*

### Protein secondary and tertiary structure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this placeholder sub-section title okay?


*Jinbo Xu is writing this*
Proteins play fundamental roles in all biological processes including the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the purpose of this section's introductory paragraph being to first discuss what is being predicted, but I think we could be much less verbose. I think all we need here is:

  1. One sentence, like you have already, describing what a protein does.
  2. Quickly introduce the problem: There are millions of protein sequences, many of which are redundant and few structures are solved.
  3. Computational methods as one solution: Over several decades, these algorithms have predicted different aspects of secondary and tertiary structure such as...(as they are listed below) (is there a separate review paper we could cite that discusses protein structure algorithms specifically?)

maintenance of cellular integrity, metabolism, transcription/translation, and
cell-cell communication. Complete description of protein structures and
functions is a fundamental step towards understanding biological life and
also highly relevant in the development of therapeutics and drugs. UnitProt
currently has about 94 millions of protein sequences. Even if we remove
redundancy at 50% sequence identity level, UnitProt still has about
20 millions of protein sequences. However, fewer than 100,000 proteins
have experimentally-solved structures in Protein Data Bank (PDB). As a result,
computational structure prediction is essential for a majority number of
protein sequences. However, predicting protein 3D structures from sequence alone
is very challenging, especially when similar solved structures (called templates)
are not available in PDB. In the past decades, various computational methods have
been developed to predict protein structure from different aspects,
including prediction of secondary structure, torsion angles, solvent accessibility,
inter-residue contact map, disorder regions and side-chain packing.

Machine learning is extensively applied to predict protein structures and
some success has been achieved. For example, secondary structure can be
predicted with about 80% of 3-state (i.e., Q3) accuracy by a neural network
method PSIPRED [@doi:10.1093/bioinformatics/16.4.404]. Starting from
2012, deep learning has been gradually introduced to protein structure
prediction. The adopted deep learning models include deep belief network,
LSTM(long short-term memory), deep convolutional neural networks (DCNN)
and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1
@doi:10.1038/srep18962]. Here we focus on deep learning methods for
two representative subproblems: secondary structure prediction and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide any intuition on why we want to focus on these subproblems? For example, is it where deep learning has had the biggest advantage over competitors or greatest success in an absolute sense?

Copy link
Contributor Author

@j3xugit j3xugit Jan 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For protein contact prediction, it is true that deep learning makes the biggest progress since this problem appeared to be very challenging before 2011. Only in recent years (2011-2016) we saw big improvement. The first improvement comes from structure learning (2011-2012) for co-evolutionary analysis (which mostly works on proteins with lots of sequence homologs), the second mild improvement comes from a simple neural network method, and the third big improvement comes from my deep learning algorithm (which works on proteins with only dozens of sequence homologs).

The reason I chose secondary structure prediction because it is a basic problem and also easy to understand. This problem is also more accessible to pure machine learning people than other subproblems such as disorder prediction and solvent accessibility prediction. Secondary structure prediction is almost an essential module of any protein structure prediction package.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly in favor of limiting scope and focusing on protein structure-related problems where 1) we have something to say instead of just enumerating methods and 2) deep learning is contributing something unique, so this sounds great to me. It could be helpful to bring some of the context in your comment into the manuscript. Something like:

Starting from 2012, deep learning has been gradually introduced to multiple aspects of protein structure prediction ranging from X [ref] to Y [ref] to Z [ref]. The adopted deep learning models include deep belief network, LSTM(long short-term memory), deep convolutional neural networks (DCNN) and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1 @doi:10.1038/srep18962]. Here we focus on two representative subproblems where deep learning has been especially impactful: secondary structure prediction and contact map prediction.

I don't necessarily like the phrasing I suggested, but the idea would be to show that we are explicitly narrowing our scope, calling out to a few other deep learning methods in other areas of structure prediction that might be interesting but are out of that scope, and stating why we focused on the sub-problem we did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide any intuition on why we want to focus on these subproblems? For example, is it where deep learning has had the biggest advantage over competitors or greatest success in an absolute sense?

  1. we have something to say instead of just enumerating methods

Completely agree!

I think adding something similar to what @agitter outlined ("Starting with 2012..." also similar to @j3xugit's comment above) and being concise about how deep learning has indeed remarkably improved performance over time (and citing papers with their architectures) may be preferable to enumerating methods in as great detail.

I think the next two or three paragraphs could probably be combined if we adopt this strategy. As is, I think this section may be a bit long if we want to squeeze in other subsections within Study.

  1. deep learning is contributing something unique, so this sounds great to me.

I agree that deep learning is improving performance, but, to play devil's advocate, who cares?

Perhaps it would be more pertinent and in line with what has been written in the section about morphological phenotypes and in the rough outline of the study section if we follow with a discussion about three forward-looking points:

  • what knowledge of protein structure leads to
  • how biomedical applications may be impacted by this knowledge
  • if deep learning approaches can get us to that point in the future

This could be added as a final short paragraph of the subsection if this is what we feel the main message for the review is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being said, I do not want to stall up progress for any reason! I understand that it is often much easier to work with a full manuscript once the tone/message is clarified so if the merge was close, I'd say go for it. (right now I'm trying to get a feel of how progress through PRs is being made!)

contact map prediction. Secondary structure refers to local conformation of
a sequence segment while a contact map contains information of global
conformation. Secondary structure prediction is a basic problem and almost
an essential module of any protein structure prediction package. It has also
been used as sequence labeling benchmark in the machine learning community.
Contact prediction is much more challenging than secondary structure prediction,
but it has a much larger impact on tertiary structure prediction.
In recent years, contact prediction has made good progress and its accuracy
has been significantly improved [@doi:10.1371/journal.pcbi.1005324
@doi:10.1093/bioinformatics/btu791 @doi:10.1073/pnas.0805923106
@doi:10.1371/journal.pone.0028766].

Protein secondary structure can exhibit three different states (alpha helix,
beta strand and loop regions) or eight finer-grained states. More methods are
developed to predict 3-state secondary structure than 8-state. A predictor is
typically evaluated by 3-state (i.e., Q3) and 8-state (i.e., Q8) accuracy, respectively.
Qi et al. developed a multi-task deep learning method to simultaneously predict several
local structure properties including secondary structures [@doi:10.1371/journal.pone.0032235].
Spencer, Eickholt and Cheng predicted secondary structure using deep belief networks
[@doi:10.1109/TCBB.2014.2343960]. Heffernan and Zhou et al. developed an iterative
deep learning framework to simultaneously predict secondary structure, backbone torsion
angles and solvent accessibility [@doi:10.1038/srep11476]. However, none of these deep
learning methods achieved significant improvement over PSIPRED [@doi:10.1006/jmbi.1999.3091]
in terms of Q3 accuracy. In 2014, Zhou and Troyanskaya demonstrated that they could
improve Q8 accuracy over a shallow learning architecture conditional neural fields [@doi:10.1002/pmic.201100196]
by using a deep supervised and convolutional generative stochastic network[@arxiv:1403.1347],
but did not report any results in terms of Q3 accuracy. In 2016 Wang and Xu et al. developed a deep
convolutional neural fields (DeepCNF) model that can significantly improve secondary
structure prediction in terms of both Q3 and Q8 accuracy[@doi:10.1038/srep18962].
DeepCNF possibly is the first that reports Q3 accuracy of 84-85%, much higher than
the 80% accuracy maintained by PSIPRED for more than 10 years.
It is also reported that DeepCNF can improve prediction of solvent accessibility
and disorder regions [@doi:10.1007/978-3-319-46227-1_1]. This improvement may be mainly
due to the introduction of convolutional neural fields to capture long-range
sequential information, which is important for beta strand prediction. Nevertheless,
improving secondary structure prediction from 80% to 84-85% is unlikely to
result in a similar amount of improvement in tertiary structure prediction since secondary
structure mainly reflects coarse-grained local conformation of a protein structure.

Protein contact prediction and contact-assisted folding (i.e., folding proteins using
predicted contacts as restraints) represents a promising new direction for ab initio folding
of proteins without good templates in PDB.
Evolutionary coupling analysis (ECA) is an effective contact prediction method for some
proteins with a very large number (>1000) of sequence homologs [@doi:10.1371/journal.pone.0028766],
but ECA fares poorly for proteins without many sequence homologs. Since (soluble) proteins with
many sequence homologs are likely to have a good template in PDB, to make contact-assisted
folding practically useful for ab initio folding, it is essential to predict accurate contacts
for proteins without many sequence homologs. By combining ECA with a few other protein features,
shallow neural network-based methods such as MetaPSICOV [@doi:10.1093/bioinformatics/btu791] and
CoinDCA-NN [@doi:10.1093/bioinformatics/btv472] have shown some advantage over ECA
for proteins with a small number of sequence homologs, but their accuracy is still not very good.
In recent years, deep learning methods have been explored for contact prediction. For example,
Di Lena et al. introduced a deep spatio-temporal neural network (up to 100 layers) that utilizes both
spatial and temporal features to predict protein contacts[@doi:10.1093/bioinformatics/bts475].
Eickholt and Cheng combined deep belief networks and boosting techniques to predict protein contacts
[@doi:10.1093/bioinformatics/bts598] and trained deep networks by layer-wise unsupervised
learning followed by fine-tuning of the entire network. Skwark and Elofsson et al. developed
an iterative deep learning technique for contact prediction by stacking a series of Random Forests
[@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in the well-known CASP competitions,
these methods did not show any advantage over MetaPSICOV [@doi:10.1093/bioinformatics/btu791], a method
using two cascaded neural networks. Very recently, Wang and Xu et al. proposed a novel deep learning method
RaptorX-Contact [@doi:10.1371/journal.pcbi.1005324] that can significantly improve contact prediction
over MetaPSICOV especially for proteins without many sequence homologs. RaptorX-Contact employs a network
architecture formed by one 1D residual neural network and one 2D residual neural network.
Blindly tested in the latest CASP competition (i.e., CASP12 [@url:http://www.predictioncenter.org/casp12/rrc_avrg_results.cgi]),
RaptorX-Contact is ranked first in terms of the total F1 score (a widely-used performance metric) on
free-modeling targets as well as the whole set of targets. In the CASP12 test, the group ranked second
also employed a deep learning method. Even MetaPSICOV, which ranked third in CASP12, employed more
and wider hidden layers than its old version. Wang and Xu et al. have also
demonstrated in another blind test CAMEO (which can be interpreted as a fully-automated
CASP) [@url:http://www.cameo3d.org/] that their predicted contacts can help fold quite a few proteins
with a novel fold and only 65-330 sequence homologs and that their method also works well on membrane
protein contact prediction even if trained mostly by non-membrane proteins. In fact, most of the top 10
contact prediction groups in CASP12 employed some kind of deep learning techniques. The RaptorX-Contact
method performed better mainly due to introduction of residual neural networks and exploiting contact
occurrence patterns by simultaneous prediction of all the contacts in a single protein.
It is still possible to further improve contact prediction by studying new deep network architectures.
However, current methods fail when proteins in question have almost no sequence homologs. It is unclear if there
is an effective way to deal with this type of proteins or not except waiting for more sequence homologs.
Finally, the deep learning methods summarized above also apply to interfacial contact prediction
of a protein complex, but may be less effective since on average protein complexes have fewer sequence homologs.

### Signaling

Expand Down