Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNA Methylation Deep Review Section # 2 of 3 - Inference, Imputation, and Prediction #954

Merged
merged 10 commits into from
Jul 22, 2019

Conversation

jlevy44
Copy link
Contributor

@jlevy44 jlevy44 commented May 9, 2019

Hi there,

Following discussions from #942 , and the closing of #947 , our team has finished our internal edits, and are ready to PR. Our PR plan is for each author to submit their section in the same space reserved for DNA methylation, and then we can move sections around from there and merge/stitch together content from our three PRs.

I will be PR'ing the second of three deep review sections. It's focused on inferences, imputation, and prediction on methylation using deep learning.

Here are the order of the PR sections that our group will be submitting:

  1. Introductions - @Christensen-Lab @Christensen-Lab-Dartmouth
  2. Inference, Imputation, and Prediction - @jlevy44
  3. Latent Space Construction and Conclusions - @AlexanderTitus

Thanks, and looking forward to the review.

Planning on adding two more sections that expand on the points of the last paragraph. Will need help editing these points and making text more concise, to leave room for remaining two paragraphs. Also looking to adjust some text from the previous gene expression paragraphs and text surrounding latent space prediction.
Pulling recent changes from greenelab
@jlevy44
Copy link
Contributor Author

jlevy44 commented May 9, 2019

Just need to tab delimit those citations. I think this was an IDE error (using Atom to edit).

Copy link
Member

@cgreene cgreene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really nice! I just had a couple places where I wanted either some clarification, a bit of synthesis, or a little bit more information. I am happy to take another look quickly once these changes are made.

#### Inference, Imputation, and Prediction

Deep learning approaches are beginning to help address some of the current limitations of feature-by-feature analysis approaches to DNA methylation data, and may help uncover additional important features necessary to understand the biological underpinnings behind different pathological states.
One of the more popular applications is the prediction of the degree of methylation at CpG sites neighboring measured sites.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does neighboring mean immediately adjacent or in the region? How big are the windows (in rough terms)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends, I think some methods like DeepCpG predict methylation in sites of local windows, integrating local information from sequence features at distances that could be well over 1kb away from the site. Methods like DAPL could potentially be integrating information that could be more long range to impute missingness, there's no threshold on window size like DeepCpG because it's a fully connected denoising autoencoder. I think it can be an assumption that sites that are nearby could be more useful.

The short answer is within a region of a few kb using methods like DeepCpG.

DeepSignal employs a convolutional neural network to construct features from raw electrical Nanopore signals from sites near a methylated base, and concatenates uses a bi-directional recurrent neural network on DNA sequences of the aligned signals to detect methylation [@tag:Ni2018].
DeepCpG applies a similar method using scBS-Seq, DNA sequence and Bidirectional GRUs [@tag:Angermueller2017] and methods like DAPL and DeepMethyl incorporate sequence and topological structure [@tag:Qiu2018] [@tag:Khwaja2017] [@tag:Wang2016_methyl] [@tag:Fu2019].
In addition to this, Gene expression has been used to infer and impute methylation states [@tag:Peng2019] [@tag:Levy-Jurgenson2018], methylation of genes predicted from promoter methylation [@tag:Pan2018], and convolutional models have been able to predict methylation status from images [@tag:Momeni2018][@tag:Korfiatis2017].
While these examples of methylation imputation and inference methods have value it is imperative to recognize limitations of imputing cytosine modifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the current state of the art performance for imputation, and is it sufficient for downstream analyses (in your view) or is getting to "useful for many downstream analyses" still a work in progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally just use MICE, K-NN and even Mean imputation, and personally have not tried deep learning imputation approaches, though I am open to developing and implementing new methodologies. I think many of these methods are more geared towards BS-Seq, which can make it harder to adopt for users of 450K and EPIC arrays. Though its conceivable that some of these methods could speed up the analysis, incorporating other modalities may make them more accurate, but coming across this data could still be a challenge. I think making them useful, easy-to-use, and tractable may still be a challenge, but standardized and modular workflows that incorporate these methods may make them more easily adoptable and mainstream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add maybe one or two sentences at the very end of this paragraph around how these methods compare to what's used in practice and whether or not they are at the stage yet where they can replace current methods? From my read of what you wrote, the answer is no because there are still some bespoke processes to get them working on new data (which is not true of other methods). However, you can see a path to get there. Is that right?

content/04.study.md Outdated Show resolved Hide resolved
content/04.study.md Outdated Show resolved Hide resolved
Once DNA methylation is measured, deep learning approaches can also be used to perform classification and regression tasks.
For instance, one group employed a Deep Neural Network (DNN) to predict triglyceride concentrations pre- and post-treatment from approximately 450K features (differential DNAm levels) from the Illumina 450K microarray, and used the Dropout technique to generalize the model [@tag:Islam2018] [@tag:Darst2018].
Another study transformed methylation profiles of about ten thousand TCGA samples to perform classification tasks to differentiate 32 different cancer types using the concatenation of various Convolutional Neural Network Maps and learn important patterns of differentially methylated regions that were used to make the classifications [@tag:Chatterjee2018].
Finally, the prediction of cancer subtypes using DNAm was proposed based on a deep autoencoder. The system exploited content retrieval mechanisms to additionally understand the cancer cell type differentiation of the predicted cancer types [@tag:Khwaja2018] based on methylation of CpG islands.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you synthesize this a little bit? How did performance relate to other methods? Was there anything unique/particularly interesting about what was found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a new commit soon. Thanks for all of these edits and questions @cgreene . I'd also like to find a place to add https://www.biorxiv.org/content/10.1101/692665v1 , though it may be more appropriate in the embedding section (especially with mention of hyperparameter optimization). I'll add it here for now and see if we can move it around soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you'd like me to incorporate the above discussions into the text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I committed some more text, take a look!

I think the final paragraph could still use some work, though I synthesized some sections.

jlevy44 and others added 3 commits July 19, 2019 22:30
Co-Authored-By: Casey Greene <cgreene@users.noreply.github.com>
Co-Authored-By: Casey Greene <cgreene@users.noreply.github.com>
Copy link
Member

@cgreene cgreene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple more things. Looking good!

#### Inference, Imputation, and Prediction

Deep learning approaches are beginning to help address some of the current limitations of feature-by-feature analysis approaches to DNA methylation data, and may help uncover additional important features necessary to understand the biological underpinnings behind different pathological states.
One of the more popular applications is the prediction of the degree of methylation at CpG sites neighboring measured sites.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One of the more popular applications is the prediction of the degree of methylation at CpG sites neighboring measured sites.
One of the more popular applications is imputing the degree of methylation at CpG sites that are within a few thousand base pairs of measured sites.

Is this appropriate? I think it would be helpful to include some idea of what range the methods are trying to predict. Feel free to re-word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be appropriate enough. For the AE methods, it would be appropriate to add that either within a few thousand sites (one sample) or informed by similar samples.

DeepSignal employs a convolutional neural network to construct features from raw electrical Nanopore signals from sites near a methylated base, and concatenates uses a bi-directional recurrent neural network on DNA sequences of the aligned signals to detect methylation [@tag:Ni2018].
DeepCpG applies a similar method using scBS-Seq, DNA sequence and Bidirectional GRUs [@tag:Angermueller2017] and methods like DAPL and DeepMethyl incorporate sequence and topological structure [@tag:Qiu2018] [@tag:Khwaja2017] [@tag:Wang2016_methyl] [@tag:Fu2019].
In addition to this, Gene expression has been used to infer and impute methylation states [@tag:Peng2019] [@tag:Levy-Jurgenson2018], methylation of genes predicted from promoter methylation [@tag:Pan2018], and convolutional models have been able to predict methylation status from images [@tag:Momeni2018][@tag:Korfiatis2017].
While these examples of methylation imputation and inference methods have value it is imperative to recognize limitations of imputing cytosine modifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add maybe one or two sentences at the very end of this paragraph around how these methods compare to what's used in practice and whether or not they are at the stage yet where they can replace current methods? From my read of what you wrote, the answer is no because there are still some bespoke processes to get them working on new data (which is not true of other methods). However, you can see a path to get there. Is that right?

@cgreene
Copy link
Member

cgreene commented Jul 20, 2019

Also looks like some refs have spaces instead of tabs:

content/citation-tags.tsv contains rows with missing values:
                                         tag citation
140             Levy2019  doi:10.1101/692665      NaN
262  Tian2019  doi:10.1186/s12864-019-5488-5      NaN
This error can be caused by using spaces rather than tabs to delimit fields.

@jlevy44
Copy link
Contributor Author

jlevy44 commented Jul 22, 2019

Okay @cgreene I've added more edits. Hopefully this does the trick, but let me know if you have more questions. Thank you for the feedback.

@cgreene cgreene merged commit 9f19543 into greenelab:master Jul 22, 2019
@cgreene cgreene mentioned this pull request Jul 22, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants