Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the first draft for protein structure prediction #191

Merged
merged 14 commits into from
Apr 7, 2017

Conversation

j3xugit
Copy link
Contributor

@j3xugit j3xugit commented Jan 8, 2017

Any comments are appreciated.

@agitter agitter added the study label Jan 8, 2017
@agitter agitter mentioned this pull request Jan 8, 2017
@cgreene
Copy link
Member

cgreene commented Jan 18, 2017

Hi @j3xugit - I have a quick request similar to #200. Can you reformat to 80 chars/line? GitHub only allows commenting on lines, so I can only comment at the paragraph level with the way things are setup here. Auto-formatting may make this change very quick. Thanks!

@j3xugit
Copy link
Contributor Author

j3xugit commented Jan 19, 2017

Does GitHub provide any option for me to reformat the writing to 80 chars/line?
Sorry that I am new to GitHub.

@XieConnect
Copy link
Contributor

@j3xugit I don't think Github helps you do that. You may need to break the lines with your local text editor, and re-do the pull request.

Now each line has <80 chars (including space)
Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @j3xugit I think this is great overall. I have some specific comments, but we should be ready to merge soon. One overall thought is that we may need to define more of the terminology for novices.

I didn't review the technical content since I haven't read many of these papers.

maintenance of cellular integrity, metabolism, transcription/translation, and
cell-cell communication. Complete description of protein structures and
functions is a fundamental step towards understanding biological life and
also highly relevant in the development of therapeutics and drugs. Tons of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to tie it in, but there is a potential link to our drug discovery section. No specific change requested.

cell-cell communication. Complete description of protein structures and
functions is a fundamental step towards understanding biological life and
also highly relevant in the development of therapeutics and drugs. Tons of
protein sequences have been generated, but fewer than 100,000 of them
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable estimate on the number of protein sequences we could provide? Just something in the right ballpark from an appropriate database?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, UnitProt has about 94 millions of protein sequences. Even if we remove redundancy by 50% sequence identity, UnitProt still has about 20 millions of protein sequences.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, please add either of these values before we merge.

have experimentally-solved structures. As a result, computational structure
prediction is essential for a majority number of protein sequences. However,
predicting protein 3D structures from sequence alone is very challenging,
especially when similar templates are not available. In the past decades,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to define or explain templates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revise it to "especially when similar solved structures (called templates) are not available in the Protein Data Bank (PDB). "

various computational methods have been developed to predict protein
structure from different aspects, including prediction of secondary structure,
torsion angles, solvent accessibility, inter-residue contact map, disorder
regions and side-chain packing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps cite a general structure prediction review here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is a recent review paper covering all these aspects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I spent a few minutes looking and didn't find anything with the right scope. Let's not delay the merge looking for a review to cite, but we could add a TODO comment that it may be helpful to add one during revisions.


Machine learning is extensively applied to predict protein structures and
some success has been achieved. For example, secondary structure can be
predicted with about 80% of Q3 accuracy by a 2-layer neural network
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to explain or define Q3 accuracy or preview that it will be explained below. This term will be unfamiliar to most readers on first encounter.

Copy link
Contributor Author

@j3xugit j3xugit Jan 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace "Q3" by "3-state (i.e., Q3)".

method to simultaneously predict several local structure properties
including secondary structures [@doi:10.1371/journal.pone.0032235].
Cheng group predicted secondary structure using deep belief networks
[@doi: 10.1109/TCBB.2014.2343960]. Zhou developed an iterative deep
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space after doi:

Elofsson group developed an iterative deep learning technique for contact
prediction, in which Random Forests are applied to predict contacts at each
iteration [@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in
the well-known CASP competitions, these methods did not show any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a CASP reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

CASP) [@url:http://www.cameo3d.org/] that the predicted contacts can
fold quite a few proteins with a novel fold and only 65-330 sequence
homologs. Xu’s method also works well on membrane protein contact
prediction even if trained mostly by non-membrane proteins.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there a few sentences to add about where the field may go from here? I wouldn't think that we consider protein structure prediction to be "solved" now. Suggested ideas:

  • Without broadcasting your own group's plans, are there other ways deep learning can still further improve upon the state-of-the-art?
  • Are data, algorithms, or something else the bottleneck?
  • Are we close to or still very far from the level of accuracy needed to use predicted protein structures successfully in downstream applications (e.g. predicting protein-protein or protein-compound interactions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is possible to further improve. For example, we may try different network architectures. Data is the bottleneck. Some proteins just do not have any sequence homologs, which may fail any methods.

In terms of application, it really depends. For some proteins, we can produce really good models even by our contact-assisted folding. For example, a few weeks ago in the blind CAMEO test, our contact-based web server predicted 3D models for two membrane proteins of >200 residues with RMSD close to 2 Angstrom. Many proteins can also be well modeled by template-based methods. However, there are still many proteins for which we cannot produce high-resolution models. Nevertheless, for protein-protein interaction prediction, we may not need a very high resolution.

@@ -52,7 +52,104 @@ particularly notable in this area?*

### Protein secondary and tertiary structure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this placeholder sub-section title okay?

unsupervised learning followed by fine-tuning of the entire network.
Elofsson group developed an iterative deep learning technique for contact
prediction, in which Random Forests are applied to predict contacts at each
iteration [@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be intriguing. Where does the deep learning come in if they use RF to predict contacts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They trained a RF to do prediction first, then feed the output of the first RF (and original input features) to the 2nd RF and then feed the output of the 2nd RF (and original input) to the 3rd RF. They repeated this 4-5 times. This is a typical deep learning method we talk about, but the authors thought their method is deep learning.

@j3xugit
Copy link
Contributor Author

j3xugit commented Jan 27, 2017

Please read my comments and your feedback is really appreciated.

@agitter
Copy link
Collaborator

agitter commented Jan 28, 2017

@j3xugit Thanks for responding to all of my comments. If you can please incorporate your comments from the discussion above into the text and fix some of the small formatting things (e.g. doi), I'll merge this.

Copy link
Contributor

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a stab at reviewing this PR. In general, I think there is a lot of really great info here! I also think a lot of it can be trimmed while preserving its flavor. I also made a couple comments about our general message and strategy for these sub-sections.

@agitter it looks like you were about ready to merge this so I don't want my comments to stall progress!

@@ -52,7 +52,113 @@ particularly notable in this area?*

### Protein secondary and tertiary structure

*Jinbo Xu is writing this*
Proteins play fundamental roles in all biological processes including the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the purpose of this section's introductory paragraph being to first discuss what is being predicted, but I think we could be much less verbose. I think all we need here is:

  1. One sentence, like you have already, describing what a protein does.
  2. Quickly introduce the problem: There are millions of protein sequences, many of which are redundant and few structures are solved.
  3. Computational methods as one solution: Over several decades, these algorithms have predicted different aspects of secondary and tertiary structure such as...(as they are listed below) (is there a separate review paper we could cite that discusses protein structure algorithms specifically?)

LSTM(long short-term memory), deep convolutional neural networks (DCNN)
and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1
@doi:10.1038/srep18962]. Here we focus on deep learning methods for
two representative subproblems: secondary structure prediction and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide any intuition on why we want to focus on these subproblems? For example, is it where deep learning has had the biggest advantage over competitors or greatest success in an absolute sense?

  1. we have something to say instead of just enumerating methods

Completely agree!

I think adding something similar to what @agitter outlined ("Starting with 2012..." also similar to @j3xugit's comment above) and being concise about how deep learning has indeed remarkably improved performance over time (and citing papers with their architectures) may be preferable to enumerating methods in as great detail.

I think the next two or three paragraphs could probably be combined if we adopt this strategy. As is, I think this section may be a bit long if we want to squeeze in other subsections within Study.

  1. deep learning is contributing something unique, so this sounds great to me.

I agree that deep learning is improving performance, but, to play devil's advocate, who cares?

Perhaps it would be more pertinent and in line with what has been written in the section about morphological phenotypes and in the rough outline of the study section if we follow with a discussion about three forward-looking points:

  • what knowledge of protein structure leads to
  • how biomedical applications may be impacted by this knowledge
  • if deep learning approaches can get us to that point in the future

This could be added as a final short paragraph of the subsection if this is what we feel the main message for the review is.

LSTM(long short-term memory), deep convolutional neural networks (DCNN)
and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1
@doi:10.1038/srep18962]. Here we focus on deep learning methods for
two representative subproblems: secondary structure prediction and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being said, I do not want to stall up progress for any reason! I understand that it is often much easier to work with a full manuscript once the tone/message is clarified so if the merge was close, I'd say go for it. (right now I'm trying to get a feel of how progress through PRs is being made!)

@dhimmel dhimmel force-pushed the master branch 6 times, most recently from bd3cb76 to 9178a88 Compare February 26, 2017 01:50
@agitter
Copy link
Collaborator

agitter commented Apr 7, 2017

@gwaygenomics You had left some comments here when we were last working on this topic. We should revisit some of these ideas during editing, but I think we're ready to merge this as a first draft. Do you agree?

Copy link
Contributor

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@agitter
Copy link
Collaborator

agitter commented Apr 7, 2017

We should trigger a Travis CI build before merging to check the references.

@agitter
Copy link
Collaborator

agitter commented Apr 7, 2017

The integration test failed. I'll have to see why before merging. @dhimmel I can probably figure out the problem myself, but I'm tagging you in case it is immediately obvious to you from the build logs.

@cgreene
Copy link
Member

cgreene commented Apr 7, 2017 via email

@cgreene
Copy link
Member

cgreene commented Apr 7, 2017 via email

@gwaybio
Copy link
Contributor

gwaybio commented Apr 7, 2017

looks like line 137

@agitter
Copy link
Collaborator

agitter commented Apr 7, 2017

Thanks @cgreene and @gwaygenomics. I was able to fix the references and the CI passes. I'll merge now.

@agitter agitter merged commit b3c72b6 into greenelab:master Apr 7, 2017
dhimmel pushed a commit that referenced this pull request Apr 7, 2017
This build is based on
b3c72b6.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/deep-review/builds/219705004
https://travis-ci.org/greenelab/deep-review/jobs/219705005

[ci skip]

The full commit message that triggered this build is copied below:

the first draft for protein structure prediction (#191)

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

Now each line has <80 chars (including space)

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Line wrap to trigger CI build

* Fix doi tag

* Fix arxiv reference
dhimmel pushed a commit that referenced this pull request Apr 7, 2017
This build is based on
b3c72b6.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/deep-review/builds/219705004
https://travis-ci.org/greenelab/deep-review/jobs/219705005

[ci skip]

The full commit message that triggered this build is copied below:

the first draft for protein structure prediction (#191)

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

Now each line has <80 chars (including space)

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Update 04_study.md

* Line wrap to trigger CI build

* Fix doi tag

* Fix arxiv reference
@dhimmel
Copy link
Collaborator

dhimmel commented Apr 7, 2017

Yeah the error message is not good. Will look into making it better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants