initial draft of sequencing and variant calling #344

bdo311 · 2017-04-23T19:32:18Z

Thanks in advance for the feedback! I'll probably try to revise some of the wording to improve the flow.

agitter

Another excellent section, thank you. I have two minor comments and one open-ended thought that you can choose to ignore.

agitter · 2017-04-24T21:47:14Z

sections/04_study.md

+and other genetic diseases will require accurate calling of SNP and indels.
+
+Current methods achieve relatively high (>99%) precision at 90% recall for SNPs
+and indel calls from Illumina short-read data, yet this leaves a large number of


Do we have a reference we can use for this performance level?

Yep I'll re-reference the Poplin paper, which benchmarked the previous state of the art algorithms (GATK).

agitter · 2017-04-24T21:56:34Z

sections/04_study.md

+features for each candidate variant and fed these vectors into a fully connected
+deep neural network [@tag:Torracinta2016_deep_snp]. Unfortunately, this feature
+set required at least 15 iterations of software development to fine-tune, which
+will likely not be generalizable. Going forward, we foresee that variant calling


Great conclusions, I'm always happy to see us take a stance in the review. To challenge you slightly, what do you think about the tension between natural encodings of the data where domain-specific labeled data may be limited versus conceptually suboptimal encodings (e.g. DNA -> images) where labeled data are rich or pre-trained models are available? Is this image-to-variant calling strategy likely to hold an advantage only in the short term until better training datasets or simulators become available? Should it be more widely adopted elsewhere in genomics?

I think it's an issue of not really knowing the best representation with which to input your data. With sequence data, we could either encode it as an RGB image and send it through a CNN already pretrained on ImageNet, or encode it as a one-hot tensor and send it through a CNN that we train and optimize explicit for genomic data. To know which one is better will require more work to be done.

I think what I was trying to get across in the section was that inputting raw sequence (in its optimal representation) could probably in the long run be superior or equal to inputting sequence + hand-derived features (in its optimal representation) in terms of time spent optimizing.

The point about raw sequence versus hand-derived features came across clearly. If we think comparing image-based versus natural encodings will require more work to be done, I'll merge this as-is.

I can try to clarify the point about image based vs natural encodings in a sentence -- just to say that we don't know which will be best, just that there needs to be more research to test both strategies out under otherwise identical conditions.

Okay. You only need to add that if you really want to. I merged already so you would need to open another small pull request.

agitter · 2017-04-24T21:57:38Z

sections/04_study.md

+Illumina data, for instance, will likely not be applicable to PacBio long-read
+data or MinION nanopore data, which have vastly different specificity and
+sensitivity profiles and signal-to-noise characteristics. Recently, Boza et al.
+used bidirectional recurrent neural networks infer the E. coli sequence from


E. coli in italics

This build is based on 053df86. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/225561181 https://travis-ci.org/greenelab/deep-review/jobs/225561182 [ci skip] The full commit message that triggered this build is copied below: initial draft of sequencing and variant calling (#344) * initial draft of seq/variants * minor changes

initial draft of seq/variants

f17e13f

agitter mentioned this pull request Apr 24, 2017

Current Section Status #188

Closed

agitter self-requested a review April 24, 2017 21:41

agitter requested changes Apr 24, 2017

View reviewed changes

minor changes

e3d8877

agitter approved these changes Apr 25, 2017

View reviewed changes

agitter merged commit 053df86 into greenelab:master Apr 25, 2017

bdo311 mentioned this pull request Apr 25, 2017

added sentence about comparing image vs tensor #358

Merged

alxndrkalinin mentioned this pull request Apr 26, 2017

Discussion: transfer learning #347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial draft of sequencing and variant calling #344

initial draft of sequencing and variant calling #344

bdo311 commented Apr 23, 2017

agitter left a comment

agitter Apr 24, 2017

bdo311 Apr 25, 2017

agitter Apr 24, 2017

bdo311 Apr 25, 2017

agitter Apr 25, 2017

bdo311 Apr 25, 2017

agitter Apr 25, 2017

agitter Apr 24, 2017

initial draft of sequencing and variant calling #344

initial draft of sequencing and variant calling #344

Conversation

bdo311 commented Apr 23, 2017

agitter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment