the first draft for protein structure prediction #191

j3xugit · 2017-01-08T18:16:25Z

Any comments are appreciated.

cgreene · 2017-01-18T20:43:21Z

Hi @j3xugit - I have a quick request similar to #200. Can you reformat to 80 chars/line? GitHub only allows commenting on lines, so I can only comment at the paragraph level with the way things are setup here. Auto-formatting may make this change very quick. Thanks!

j3xugit · 2017-01-19T05:19:54Z

Does GitHub provide any option for me to reformat the writing to 80 chars/line?
Sorry that I am new to GitHub.

XieConnect · 2017-01-19T05:40:15Z

@j3xugit I don't think Github helps you do that. You may need to break the lines with your local text editor, and re-do the pull request.

Now each line has <80 chars (including space)

agitter

Thanks for your contribution @j3xugit I think this is great overall. I have some specific comments, but we should be ready to merge soon. One overall thought is that we may need to define more of the terminology for novices.

I didn't review the technical content since I haven't read many of these papers.

agitter · 2017-01-22T22:29:33Z

sections/04_study.md

+maintenance of cellular integrity, metabolism, transcription/translation, and 
+cell-cell communication. Complete description of protein structures and 
+functions is a fundamental step towards understanding biological life and 
+also highly relevant in the development of therapeutics and drugs. Tons of 


Not sure how to tie it in, but there is a potential link to our drug discovery section. No specific change requested.

agitter · 2017-01-22T22:29:37Z

sections/04_study.md

+cell-cell communication. Complete description of protein structures and 
+functions is a fundamental step towards understanding biological life and 
+also highly relevant in the development of therapeutics and drugs. Tons of 
+protein sequences have been generated, but fewer than 100,000 of them 


Is there a reasonable estimate on the number of protein sequences we could provide? Just something in the right ballpark from an appropriate database?

yep, UnitProt has about 94 millions of protein sequences. Even if we remove redundancy by 50% sequence identity, UnitProt still has about 20 millions of protein sequences.

Perfect, please add either of these values before we merge.

agitter · 2017-01-22T22:30:28Z

sections/04_study.md

+have experimentally-solved structures. As a result, computational structure 
+prediction is essential for a majority number of protein sequences. However, 
+predicting protein 3D structures from sequence alone is very challenging, 
+especially when similar templates are not available. In the past decades, 


Do you think we need to define or explain templates?

I will revise it to "especially when similar solved structures (called templates) are not available in the Protein Data Bank (PDB). "

agitter · 2017-01-22T22:30:50Z

sections/04_study.md

+various computational methods have been developed to predict protein 
+structure from different aspects, including prediction of secondary structure, 
+torsion angles, solvent accessibility, inter-residue contact map, disorder 
+regions and side-chain packing.


Perhaps cite a general structure prediction review here?

Not sure if there is a recent review paper covering all these aspects.

Good point. I spent a few minutes looking and didn't find anything with the right scope. Let's not delay the merge looking for a review to cite, but we could add a TODO comment that it may be helpful to add one during revisions.

agitter · 2017-01-22T22:31:19Z

sections/04_study.md

+
+Machine learning is extensively applied to predict protein structures and 
+some success has been achieved. For example, secondary structure can be 
+predicted with about 80% of Q3 accuracy by a 2-layer neural network 


Need to explain or define Q3 accuracy or preview that it will be explained below. This term will be unfamiliar to most readers on first encounter.

We can replace "Q3" by "3-state (i.e., Q3)".

agitter · 2017-01-22T22:46:54Z

sections/04_study.md

+method to simultaneously predict several local structure properties 
+including secondary structures [@doi:10.1371/journal.pone.0032235]. 
+Cheng group predicted secondary structure using deep belief networks 
+[@doi: 10.1109/TCBB.2014.2343960]. Zhou developed an iterative deep 


Remove space after doi:

agitter · 2017-01-22T22:48:21Z

sections/04_study.md

+Elofsson group developed an iterative deep learning technique for contact 
+prediction, in which Random Forests are applied to predict contacts at each 
+iteration [@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in 
+the well-known CASP competitions, these methods did not show any 


Add a CASP reference?

agitter · 2017-01-22T22:52:12Z

sections/04_study.md

+CASP) [@url:http://www.cameo3d.org/] that the predicted contacts can 
+fold quite a few proteins with a novel fold and only 65-330 sequence 
+homologs. Xu’s method also works well on membrane protein contact 
+prediction even if trained mostly by non-membrane proteins.


Are there a few sentences to add about where the field may go from here? I wouldn't think that we consider protein structure prediction to be "solved" now. Suggested ideas:

Without broadcasting your own group's plans, are there other ways deep learning can still further improve upon the state-of-the-art?

Are data, algorithms, or something else the bottleneck?

Are we close to or still very far from the level of accuracy needed to use predicted protein structures successfully in downstream applications (e.g. predicting protein-protein or protein-compound interactions).

I guess it is possible to further improve. For example, we may try different network architectures. Data is the bottleneck. Some proteins just do not have any sequence homologs, which may fail any methods.

In terms of application, it really depends. For some proteins, we can produce really good models even by our contact-assisted folding. For example, a few weeks ago in the blind CAMEO test, our contact-based web server predicted 3D models for two membrane proteins of >200 residues with RMSD close to 2 Angstrom. Many proteins can also be well modeled by template-based methods. However, there are still many proteins for which we cannot produce high-resolution models. Nevertheless, for protein-protein interaction prediction, we may not need a very high resolution.

agitter · 2017-01-22T22:53:31Z

sections/04_study.md

@@ -52,7 +52,104 @@ particularly notable in this area?*

 ### Protein secondary and tertiary structure


Is this placeholder sub-section title okay?

agitter · 2017-01-22T23:00:35Z

sections/04_study.md

+unsupervised learning followed by fine-tuning of the entire network. 
+Elofsson group developed an iterative deep learning technique for contact 
+prediction, in which Random Forests are applied to predict contacts at each 
+iteration [@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in 


This could be intriguing. Where does the deep learning come in if they use RF to predict contacts?

They trained a RF to do prediction first, then feed the output of the first RF (and original input features) to the 2nd RF and then feed the output of the 2nd RF (and original input) to the 3rd RF. They repeated this 4-5 times. This is a typical deep learning method we talk about, but the authors thought their method is deep learning.

j3xugit · 2017-01-27T23:14:23Z

Please read my comments and your feedback is really appreciated.

agitter · 2017-01-28T13:41:30Z

@j3xugit Thanks for responding to all of my comments. If you can please incorporate your comments from the discussion above into the text and fix some of the small formatting things (e.g. doi), I'll merge this.

gwaybio

Took a stab at reviewing this PR. In general, I think there is a lot of really great info here! I also think a lot of it can be trimmed while preserving its flavor. I also made a couple comments about our general message and strategy for these sub-sections.

@agitter it looks like you were about ready to merge this so I don't want my comments to stall progress!

gwaybio · 2017-02-02T02:01:14Z

sections/04_study.md

@@ -52,7 +52,113 @@ particularly notable in this area?*

 ### Protein secondary and tertiary structure

-*Jinbo Xu is writing this*
+Proteins play fundamental roles in all biological processes including the 


I agree with the purpose of this section's introductory paragraph being to first discuss what is being predicted, but I think we could be much less verbose. I think all we need here is:

One sentence, like you have already, describing what a protein does.

Quickly introduce the problem: There are millions of protein sequences, many of which are redundant and few structures are solved.

Computational methods as one solution: Over several decades, these algorithms have predicted different aspects of secondary and tertiary structure such as...(as they are listed below) (is there a separate review paper we could cite that discusses protein structure algorithms specifically?)

gwaybio · 2017-02-02T02:18:44Z

sections/04_study.md

+LSTM(long short-term memory), deep convolutional neural networks (DCNN) 
+and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1 
+@doi:10.1038/srep18962]. Here we focus on deep learning methods for 
+two representative subproblems: secondary structure prediction and 


Can you provide any intuition on why we want to focus on these subproblems? For example, is it where deep learning has had the biggest advantage over competitors or greatest success in an absolute sense?

we have something to say instead of just enumerating methods

Completely agree!

I think adding something similar to what @agitter outlined ("Starting with 2012..." also similar to @j3xugit's comment above) and being concise about how deep learning has indeed remarkably improved performance over time (and citing papers with their architectures) may be preferable to enumerating methods in as great detail.

I think the next two or three paragraphs could probably be combined if we adopt this strategy. As is, I think this section may be a bit long if we want to squeeze in other subsections within Study.

deep learning is contributing something unique, so this sounds great to me.

I agree that deep learning is improving performance, but, to play devil's advocate, who cares?

Perhaps it would be more pertinent and in line with what has been written in the section about morphological phenotypes and in the rough outline of the study section if we follow with a discussion about three forward-looking points:

what knowledge of protein structure leads to

how biomedical applications may be impacted by this knowledge

if deep learning approaches can get us to that point in the future

This could be added as a final short paragraph of the subsection if this is what we feel the main message for the review is.

gwaybio · 2017-02-02T02:24:48Z

sections/04_study.md

+LSTM(long short-term memory), deep convolutional neural networks (DCNN) 
+and deep convolutional neural fields[@doi:10.1007/978-3-319-46227-1_1 
+@doi:10.1038/srep18962]. Here we focus on deep learning methods for 
+two representative subproblems: secondary structure prediction and 


This being said, I do not want to stall up progress for any reason! I understand that it is often much easier to work with a full manuscript once the tone/message is clarified so if the merge was close, I'd say go for it. (right now I'm trying to get a feel of how progress through PRs is being made!)

agitter · 2017-04-07T11:42:27Z

@gwaygenomics You had left some comments here when we were last working on this topic. We should revisit some of these ideas during editing, but I think we're ready to merge this as a first draft. Do you agree?

gwaybio

Agree

agitter · 2017-04-07T11:45:58Z

We should trigger a Travis CI build before merging to check the references.

agitter · 2017-04-07T12:07:26Z

The integration test failed. I'll have to see why before merging. @dhimmel I can probably figure out the problem myself, but I'm tagging you in case it is immediately obvious to you from the build logs.

cgreene · 2017-04-07T12:47:26Z

From a quick poke at https://travis-ci.org/greenelab/deep-review/builds/219654757 - is there an [@] or something that neglects the required colon?

cgreene · 2017-04-07T12:48:37Z

I think it's this one: +[@10.1093/bioinformatics/bts598]

gwaybio · 2017-04-07T12:48:48Z

looks like line 137

agitter · 2017-04-07T14:24:25Z

Thanks @cgreene and @gwaygenomics. I was able to fix the references and the CI passes. I'll merge now.

This build is based on b3c72b6. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/219705004 https://travis-ci.org/greenelab/deep-review/jobs/219705005 [ci skip] The full commit message that triggered this build is copied below: the first draft for protein structure prediction (#191) * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md Now each line has <80 chars (including space) * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Update 04_study.md * Line wrap to trigger CI build * Fix doi tag * Fix arxiv reference

dhimmel · 2017-04-07T17:31:58Z

Yeah the error message is not good. Will look into making it better

j3xugit added 3 commits January 8, 2017 10:49

Update 04_study.md

578e55e

Update 04_study.md

e2dfda4

Update 04_study.md

fdab3e7

agitter added the study label Jan 8, 2017

agitter mentioned this pull request Jan 8, 2017

Current Section Status #188

Closed

Update 04_study.md

4013de7

Update 04_study.md

f512c6f

Now each line has <80 chars (including space)

agitter requested changes Jan 22, 2017

View reviewed changes

j3xugit added 6 commits January 28, 2017 10:18

Update 04_study.md

304195d

Update 04_study.md

3d6a849

Update 04_study.md

f2d469c

Update 04_study.md

7f400b5

Update 04_study.md

1d583f2

Update 04_study.md

2d94fbb

gwaybio reviewed Feb 2, 2017

View reviewed changes

dhimmel force-pushed the master branch 6 times, most recently from bd3cb76 to 9178a88 Compare February 26, 2017 01:50

gwaybio approved these changes Apr 7, 2017

View reviewed changes

Line wrap to trigger CI build

a2417ad

agitter added 2 commits April 7, 2017 09:08

Fix doi tag

3b344e3

Fix arxiv reference

95b9be2

agitter approved these changes Apr 7, 2017

View reviewed changes

agitter merged commit b3c72b6 into greenelab:master Apr 7, 2017

bdo311 mentioned this pull request May 13, 2017

shortening + polishing of protein structure subsection #440

Merged

		@@ -52,7 +52,104 @@ particularly notable in this area?*

		### Protein secondary and tertiary structure

the first draft for protein structure prediction #191

the first draft for protein structure prediction #191

Conversation

j3xugit commented Jan 8, 2017

cgreene commented Jan 18, 2017

j3xugit commented Jan 19, 2017

XieConnect commented Jan 19, 2017

agitter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j3xugit Jan 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j3xugit commented Jan 27, 2017

agitter commented Jan 28, 2017

gwaybio left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agitter commented Apr 7, 2017

gwaybio left a comment

Choose a reason for hiding this comment

agitter commented Apr 7, 2017

agitter commented Apr 7, 2017

cgreene commented Apr 7, 2017 via email • edited by dhimmel Loading

cgreene commented Apr 7, 2017 via email • edited by dhimmel Loading

gwaybio commented Apr 7, 2017

agitter commented Apr 7, 2017

dhimmel commented Apr 7, 2017

j3xugit Jan 27, 2017 •

edited

Loading

gwaybio left a comment •

edited

Loading

cgreene commented Apr 7, 2017 via email •

edited by dhimmel

Loading

cgreene commented Apr 7, 2017 via email •

edited by dhimmel

Loading