-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the first draft for protein structure prediction #191
Conversation
Does GitHub provide any option for me to reformat the writing to 80 chars/line? |
@j3xugit I don't think Github helps you do that. You may need to break the lines with your local text editor, and re-do the pull request. |
Now each line has <80 chars (including space)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @j3xugit I think this is great overall. I have some specific comments, but we should be ready to merge soon. One overall thought is that we may need to define more of the terminology for novices.
I didn't review the technical content since I haven't read many of these papers.
sections/04_study.md
Outdated
maintenance of cellular integrity, metabolism, transcription/translation, and | ||
cell-cell communication. Complete description of protein structures and | ||
functions is a fundamental step towards understanding biological life and | ||
also highly relevant in the development of therapeutics and drugs. Tons of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to tie it in, but there is a potential link to our drug discovery section. No specific change requested.
sections/04_study.md
Outdated
cell-cell communication. Complete description of protein structures and | ||
functions is a fundamental step towards understanding biological life and | ||
also highly relevant in the development of therapeutics and drugs. Tons of | ||
protein sequences have been generated, but fewer than 100,000 of them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reasonable estimate on the number of protein sequences we could provide? Just something in the right ballpark from an appropriate database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, UnitProt has about 94 millions of protein sequences. Even if we remove redundancy by 50% sequence identity, UnitProt still has about 20 millions of protein sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, please add either of these values before we merge.
sections/04_study.md
Outdated
have experimentally-solved structures. As a result, computational structure | ||
prediction is essential for a majority number of protein sequences. However, | ||
predicting protein 3D structures from sequence alone is very challenging, | ||
especially when similar templates are not available. In the past decades, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need to define or explain templates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revise it to "especially when similar solved structures (called templates) are not available in the Protein Data Bank (PDB). "
sections/04_study.md
Outdated
various computational methods have been developed to predict protein | ||
structure from different aspects, including prediction of secondary structure, | ||
torsion angles, solvent accessibility, inter-residue contact map, disorder | ||
regions and side-chain packing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps cite a general structure prediction review here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is a recent review paper covering all these aspects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I spent a few minutes looking and didn't find anything with the right scope. Let's not delay the merge looking for a review to cite, but we could add a TODO comment that it may be helpful to add one during revisions.
sections/04_study.md
Outdated
|
||
Machine learning is extensively applied to predict protein structures and | ||
some success has been achieved. For example, secondary structure can be | ||
predicted with about 80% of Q3 accuracy by a 2-layer neural network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to explain or define Q3 accuracy or preview that it will be explained below. This term will be unfamiliar to most readers on first encounter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can replace "Q3" by "3-state (i.e., Q3)".
sections/04_study.md
Outdated
method to simultaneously predict several local structure properties | ||
including secondary structures [@doi:10.1371/journal.pone.0032235]. | ||
Cheng group predicted secondary structure using deep belief networks | ||
[@doi: 10.1109/TCBB.2014.2343960]. Zhou developed an iterative deep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove space after doi:
sections/04_study.md
Outdated
Elofsson group developed an iterative deep learning technique for contact | ||
prediction, in which Random Forests are applied to predict contacts at each | ||
iteration [@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in | ||
the well-known CASP competitions, these methods did not show any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a CASP reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
sections/04_study.md
Outdated
CASP) [@url:http://www.cameo3d.org/] that the predicted contacts can | ||
fold quite a few proteins with a novel fold and only 65-330 sequence | ||
homologs. Xu’s method also works well on membrane protein contact | ||
prediction even if trained mostly by non-membrane proteins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there a few sentences to add about where the field may go from here? I wouldn't think that we consider protein structure prediction to be "solved" now. Suggested ideas:
- Without broadcasting your own group's plans, are there other ways deep learning can still further improve upon the state-of-the-art?
- Are data, algorithms, or something else the bottleneck?
- Are we close to or still very far from the level of accuracy needed to use predicted protein structures successfully in downstream applications (e.g. predicting protein-protein or protein-compound interactions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is possible to further improve. For example, we may try different network architectures. Data is the bottleneck. Some proteins just do not have any sequence homologs, which may fail any methods.
In terms of application, it really depends. For some proteins, we can produce really good models even by our contact-assisted folding. For example, a few weeks ago in the blind CAMEO test, our contact-based web server predicted 3D models for two membrane proteins of >200 residues with RMSD close to 2 Angstrom. Many proteins can also be well modeled by template-based methods. However, there are still many proteins for which we cannot produce high-resolution models. Nevertheless, for protein-protein interaction prediction, we may not need a very high resolution.
@@ -52,7 +52,104 @@ particularly notable in this area?* | |||
|
|||
### Protein secondary and tertiary structure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this placeholder sub-section title okay?
sections/04_study.md
Outdated
unsupervised learning followed by fine-tuning of the entire network. | ||
Elofsson group developed an iterative deep learning technique for contact | ||
prediction, in which Random Forests are applied to predict contacts at each | ||
iteration [@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be intriguing. Where does the deep learning come in if they use RF to predict contacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They trained a RF to do prediction first, then feed the output of the first RF (and original input features) to the 2nd RF and then feed the output of the 2nd RF (and original input) to the 3rd RF. They repeated this 4-5 times. This is a typical deep learning method we talk about, but the authors thought their method is deep learning.
Please read my comments and your feedback is really appreciated. |
@j3xugit Thanks for responding to all of my comments. If you can please incorporate your comments from the discussion above into the text and fix some of the small formatting things (e.g. doi), I'll merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a stab at reviewing this PR. In general, I think there is a lot of really great info here! I also think a lot of it can be trimmed while preserving its flavor. I also made a couple comments about our general message and strategy for these sub-sections.
@agitter it looks like you were about ready to merge this so I don't want my comments to stall progress!
@@ -52,7 +52,113 @@ particularly notable in this area?* | |||
|
|||
### Protein secondary and tertiary structure | |||
|
|||
*Jinbo Xu is writing this* | |||
Proteins play fundamental roles in all biological processes including the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the purpose of this section's introductory paragraph being to first discuss what is being predicted, but I think we could be much less verbose. I think all we need here is:
- One sentence, like you have already, describing what a protein does.
- Quickly introduce the problem: There are millions of protein sequences, many of which are redundant and few structures are solved.
- Computational methods as one solution: Over several decades, these algorithms have predicted different aspects of secondary and tertiary structure such as...(as they are listed below) (is there a separate review paper we could cite that discusses protein structure algorithms specifically?)
LSTM(long short-term memory), deep convolutional neural networks (DCNN) | ||
and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1 | ||
@doi:10.1038/srep18962]. Here we focus on deep learning methods for | ||
two representative subproblems: secondary structure prediction and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide any intuition on why we want to focus on these subproblems? For example, is it where deep learning has had the biggest advantage over competitors or greatest success in an absolute sense?
- we have something to say instead of just enumerating methods
Completely agree!
I think adding something similar to what @agitter outlined ("Starting with 2012..." also similar to @j3xugit's comment above) and being concise about how deep learning has indeed remarkably improved performance over time (and citing papers with their architectures) may be preferable to enumerating methods in as great detail.
I think the next two or three paragraphs could probably be combined if we adopt this strategy. As is, I think this section may be a bit long if we want to squeeze in other subsections within Study
.
- deep learning is contributing something unique, so this sounds great to me.
I agree that deep learning is improving performance, but, to play devil's advocate, who cares?
Perhaps it would be more pertinent and in line with what has been written in the section about morphological phenotypes and in the rough outline of the study section if we follow with a discussion about three forward-looking points:
- what knowledge of protein structure leads to
- how biomedical applications may be impacted by this knowledge
- if deep learning approaches can get us to that point in the future
This could be added as a final short paragraph of the subsection if this is what we feel the main message for the review is.
LSTM(long short-term memory), deep convolutional neural networks (DCNN) | ||
and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1 | ||
@doi:10.1038/srep18962]. Here we focus on deep learning methods for | ||
two representative subproblems: secondary structure prediction and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This being said, I do not want to stall up progress for any reason! I understand that it is often much easier to work with a full manuscript once the tone/message is clarified so if the merge was close, I'd say go for it. (right now I'm trying to get a feel of how progress through PRs is being made!)
bd3cb76
to
9178a88
Compare
@gwaygenomics You had left some comments here when we were last working on this topic. We should revisit some of these ideas during editing, but I think we're ready to merge this as a first draft. Do you agree? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
We should trigger a Travis CI build before merging to check the references. |
The integration test failed. I'll have to see why before merging. @dhimmel I can probably figure out the problem myself, but I'm tagging you in case it is immediately obvious to you from the build logs. |
From a quick poke at
https://travis-ci.org/greenelab/deep-review/builds/219654757 - is there an
[@] or something that neglects the required colon?
|
I think it's this one:
+[@10.1093/bioinformatics/bts598]
|
looks like line 137 |
Thanks @cgreene and @gwaygenomics. I was able to fix the references and the CI passes. I'll merge now. |
This build is based on b3c72b6. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/219705004 https://travis-ci.org/greenelab/deep-review/jobs/219705005 [ci skip] The full commit message that triggered this build is copied below: the first draft for protein structure prediction (#191) * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md Now each line has <80 chars (including space) * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Line wrap to trigger CI build * Fix doi tag * Fix arxiv reference
This build is based on b3c72b6. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/219705004 https://travis-ci.org/greenelab/deep-review/jobs/219705005 [ci skip] The full commit message that triggered this build is copied below: the first draft for protein structure prediction (#191) * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md Now each line has <80 chars (including space) * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Line wrap to trigger CI build * Fix doi tag * Fix arxiv reference
Yeah the error message is not good. Will look into making it better |
Any comments are appreciated.