Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial draft of sequencing and variant calling #344

Merged
merged 2 commits into from
Apr 25, 2017

Conversation

bdo311
Copy link
Contributor

@bdo311 bdo311 commented Apr 23, 2017

Thanks in advance for the feedback! I'll probably try to revise some of the wording to improve the flow.

@agitter agitter mentioned this pull request Apr 24, 2017
@agitter agitter self-requested a review April 24, 2017 21:41
Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another excellent section, thank you. I have two minor comments and one open-ended thought that you can choose to ignore.

and other genetic diseases will require accurate calling of SNP and indels.

Current methods achieve relatively high (>99%) precision at 90% recall for SNPs
and indel calls from Illumina short-read data, yet this leaves a large number of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a reference we can use for this performance level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I'll re-reference the Poplin paper, which benchmarked the previous state of the art algorithms (GATK).

features for each candidate variant and fed these vectors into a fully connected
deep neural network [@tag:Torracinta2016_deep_snp]. Unfortunately, this feature
set required at least 15 iterations of software development to fine-tune, which
will likely not be generalizable. Going forward, we foresee that variant calling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great conclusions, I'm always happy to see us take a stance in the review. To challenge you slightly, what do you think about the tension between natural encodings of the data where domain-specific labeled data may be limited versus conceptually suboptimal encodings (e.g. DNA -> images) where labeled data are rich or pre-trained models are available? Is this image-to-variant calling strategy likely to hold an advantage only in the short term until better training datasets or simulators become available? Should it be more widely adopted elsewhere in genomics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's an issue of not really knowing the best representation with which to input your data. With sequence data, we could either encode it as an RGB image and send it through a CNN already pretrained on ImageNet, or encode it as a one-hot tensor and send it through a CNN that we train and optimize explicit for genomic data. To know which one is better will require more work to be done.

I think what I was trying to get across in the section was that inputting raw sequence (in its optimal representation) could probably in the long run be superior or equal to inputting sequence + hand-derived features (in its optimal representation) in terms of time spent optimizing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point about raw sequence versus hand-derived features came across clearly. If we think comparing image-based versus natural encodings will require more work to be done, I'll merge this as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to clarify the point about image based vs natural encodings in a sentence -- just to say that we don't know which will be best, just that there needs to be more research to test both strategies out under otherwise identical conditions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. You only need to add that if you really want to. I merged already so you would need to open another small pull request.

Illumina data, for instance, will likely not be applicable to PacBio long-read
data or MinION nanopore data, which have vastly different specificity and
sensitivity profiles and signal-to-noise characteristics. Recently, Boza et al.
used bidirectional recurrent neural networks infer the E. coli sequence from
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E. coli in italics

@agitter agitter merged commit 053df86 into greenelab:master Apr 25, 2017
dhimmel pushed a commit that referenced this pull request Apr 25, 2017
This build is based on
053df86.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/deep-review/builds/225561181
https://travis-ci.org/greenelab/deep-review/jobs/225561182

[ci skip]

The full commit message that triggered this build is copied below:

initial draft of sequencing and variant calling (#344)

* initial draft of seq/variants

* minor changes
dhimmel pushed a commit that referenced this pull request Apr 25, 2017
This build is based on
053df86.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/deep-review/builds/225561181
https://travis-ci.org/greenelab/deep-review/jobs/225561182

[ci skip]

The full commit message that triggered this build is copied below:

initial draft of sequencing and variant calling (#344)

* initial draft of seq/variants

* minor changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants