-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial draft of sequencing and variant calling #344
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another excellent section, thank you. I have two minor comments and one open-ended thought that you can choose to ignore.
sections/04_study.md
Outdated
and other genetic diseases will require accurate calling of SNP and indels. | ||
|
||
Current methods achieve relatively high (>99%) precision at 90% recall for SNPs | ||
and indel calls from Illumina short-read data, yet this leaves a large number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a reference we can use for this performance level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I'll re-reference the Poplin paper, which benchmarked the previous state of the art algorithms (GATK).
sections/04_study.md
Outdated
features for each candidate variant and fed these vectors into a fully connected | ||
deep neural network [@tag:Torracinta2016_deep_snp]. Unfortunately, this feature | ||
set required at least 15 iterations of software development to fine-tune, which | ||
will likely not be generalizable. Going forward, we foresee that variant calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great conclusions, I'm always happy to see us take a stance in the review. To challenge you slightly, what do you think about the tension between natural encodings of the data where domain-specific labeled data may be limited versus conceptually suboptimal encodings (e.g. DNA -> images) where labeled data are rich or pre-trained models are available? Is this image-to-variant calling strategy likely to hold an advantage only in the short term until better training datasets or simulators become available? Should it be more widely adopted elsewhere in genomics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an issue of not really knowing the best representation with which to input your data. With sequence data, we could either encode it as an RGB image and send it through a CNN already pretrained on ImageNet, or encode it as a one-hot tensor and send it through a CNN that we train and optimize explicit for genomic data. To know which one is better will require more work to be done.
I think what I was trying to get across in the section was that inputting raw sequence (in its optimal representation) could probably in the long run be superior or equal to inputting sequence + hand-derived features (in its optimal representation) in terms of time spent optimizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point about raw sequence versus hand-derived features came across clearly. If we think comparing image-based versus natural encodings will require more work to be done, I'll merge this as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to clarify the point about image based vs natural encodings in a sentence -- just to say that we don't know which will be best, just that there needs to be more research to test both strategies out under otherwise identical conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. You only need to add that if you really want to. I merged already so you would need to open another small pull request.
sections/04_study.md
Outdated
Illumina data, for instance, will likely not be applicable to PacBio long-read | ||
data or MinION nanopore data, which have vastly different specificity and | ||
sensitivity profiles and signal-to-noise characteristics. Recently, Boza et al. | ||
used bidirectional recurrent neural networks infer the E. coli sequence from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E. coli in italics
This build is based on 053df86. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/225561181 https://travis-ci.org/greenelab/deep-review/jobs/225561182 [ci skip] The full commit message that triggered this build is copied below: initial draft of sequencing and variant calling (#344) * initial draft of seq/variants * minor changes
This build is based on 053df86. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/225561181 https://travis-ci.org/greenelab/deep-review/jobs/225561182 [ci skip] The full commit message that triggered this build is copied below: initial draft of sequencing and variant calling (#344) * initial draft of seq/variants * minor changes
Thanks in advance for the feedback! I'll probably try to revise some of the wording to improve the flow.