Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proofread methylation sections #971

Merged
merged 6 commits into from
Jul 29, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 37 additions & 31 deletions content/04.study.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,48 +47,54 @@ Deep learning applied to gene expression data is still in its infancy, but the f
Many previously untestable hypotheses can now be interrogated as deep learning enables analysis of increasing amounts of data generated by new technologies.
For example, the effects of cellular heterogeneity on basic biology and disease etiology can now be explored by single-cell RNA-seq and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit immensely from deep learning approaches.

### DNA Methylation
### DNA methylation

#### Inference, Imputation, and Prediction
#### Inference, imputation, and prediction

Deep learning approaches are beginning to help address some of the current limitations of feature-by-feature analysis approaches to DNA methylation data, and may help uncover additional important features necessary to understand the biological underpinnings behind different pathological states.
Deep learning approaches are beginning to help address some of the current limitations of feature-by-feature analysis approaches to DNA methylation data and may help uncover additional important features necessary to understand the biological underpinnings behind different pathological states.
One of the more popular applications is imputing the degree of methylation at CpG sites that are within a few thousand base pairs of measured sites or present in similar samples.
DeepSignal employs a convolutional neural network to construct features from raw electrical Nanopore signals from sites near a methylated base, and concatenates uses a bi-directional recurrent neural network on DNA sequences of the aligned signals to detect methylation [@tag:Ni2018].
DeepCpG applies a similar method using scBS-Seq, DNA sequence and Bidirectional GRUs [@tag:Angermueller2017] and methods like DAPL, MRCNN and DeepMethyl incorporate sequence and topological structure [@tag:Qiu2018] [@tag:Tian2019] [@tag:Khwaja2017] [@tag:Wang2016_methyl] [@tag:Fu2019].
In addition to this, Gene expression has been used to infer and impute methylation states [@tag:Peng2019] [@tag:Levy-Jurgenson2018], methylation of genes predicted from promoter methylation [@tag:Pan2018], and convolutional models have been able to predict methylation status from images [@tag:Momeni2018] [@tag:Korfiatis2017].
While these examples of methylation imputation and inference methods have value it is imperative to recognize limitations of imputing cytosine modifications.
Imputing DNA methylation has complexities above and beyond genotype imputation: correlation of DNA methylation marks can depend on cell types and other factors that can vary by sample.
As the number of tissue types and cell types with whole-genome bisulfite sequencing (and oxidative bisulfite sequencing) grows, the accuracy of DNA methylation imputation is expected to increase.
While these methods reduce the computational overhead at comparable performance to other popular methylation imputation methods such as K-Nearest Neighbors, Random Forest, Singular Value Decomposition and Multiple Imputation by Chained Equations, the software implementations will need to become more user-friendly to gain widespread adoption.
DeepSignal employs a CNN to construct features from raw electrical Nanopore signals from sites near a methylated base.
It uses a bidirectional RNN on DNA sequences of the aligned signals to detect methylation [@tag:Ni2018].
DeepCpG applies a similar method using scBS-Seq, DNA sequence, and a bidirectional gated recurrent network [@tag:Angermueller2017].
Methods like DAPL, MRCNN, and DeepMethyl incorporate both sequence and topological structure [@tag:Qiu2018; @tag:Tian2019; @tag:Khwaja2017; @tag:Wang2016_methyl; @tag:Fu2019].
agitter marked this conversation as resolved.
Show resolved Hide resolved
In addition, gene expression has been used to infer and impute methylation states [@tag:Peng2019; @tag:Levy-Jurgenson2018], methylation of genes can be predicted from promoter methylation [@tag:Pan2018], and convolutional models have been able to predict methylation status from images [@tag:Momeni2018; @tag:Korfiatis2017].
While these examples of methylation imputation and inference methods have value, it is imperative to recognize limitations of imputing cytosine modifications.
Imputing DNA methylation has complexities above and beyond genotype imputation.
Correlation of DNA methylation marks can depend on cell types and other factors that vary by sample.
As the number of tissue types and cell types with whole-genome bisulfite sequencing and oxidative bisulfite sequencing grows, the accuracy of DNA methylation imputation is expected to increase.
While these methods reduce the computational overhead at comparable performance to other popular methylation imputation methods such as k-nearest neighbors, random forests, singular value decomposition, and multiple imputation by chained equations, the software implementations will need to become more user-friendly to gain widespread adoption.
agitter marked this conversation as resolved.
Show resolved Hide resolved

Once DNA methylation is measured, deep learning approaches can also be used to perform classification and regression tasks.
For instance, Deep Neural Networks (DNN) have been employed on DNA methylation data to predict triglyceride concentrations pre- and post-treatment [@tag:Islam2018] [@tag:Darst2018] and differentiate cancer subtypes [@tag:Chatterjee2018] [@tag:Khwaja2018] while outperforming other methods such as Support Vector Machine (SVM).
Modular approaches to methylation prediction, such as MethylNet, have been able to predict age, cellular proportions and cancer subtypes, outperforming SVM and Elastic Net models while remaining concordant with expected biology [@tag:Levy2019].
These approaches aim to make embedding, hyperparameter selection, regression, classification and model interpretation tasks more tractable for epigenetics researchers and machine learning scientists.
For instance, deep neural networks have been employed on DNA methylation data to predict triglyceride concentrations pre- and post-treatment [@tag:Islam2018; @tag:Darst2018] and differentiate cancer subtypes [@tag:Chatterjee2018; @tag:Khwaja2018] better than other methods such as support vector machines (SVMs).
Modular approaches to methylation prediction, such as MethylNet, have been able to predict age, cellular proportions, and cancer subtypes, outperforming SVM and elastic net models while remaining concordant with expected biology [@tag:Levy2019].
These approaches aim to make embedding, hyperparameter selection, regression, classification, and model interpretation tasks more tractable for epigenetics researchers and machine learning scientists.

#### Latent Space Construction
#### Latent space construction

Unsupervised discovery of biologically-significant features is another major area of interest for researchers using DNA methylation data.
A consistent theme of these methods is that they construct a low-dimensional space that semantically encodes biologically important features from methylation profiles.
As with other applications, these low-dimensional representations are thought to capture a set of important, unmeasured sources of biological variability in the data, and that projection into these spaces results in biologically-similar examples being close together.
As with other applications, these low-dimensional representations are thought to capture a set of important, unmeasured sources of biological variability in the data.
Projection into these spaces results in biologically-similar examples being close together.
For this reason, they are often termed latent spaces.
One method used several stacked binary restricted Boltzmann machines (forming a deep neural network) to learn a low-dimensional subspace representation of the methylation profiles of 5000 CpG sites with highest variance across 136 women breast tissue samples, 113 breast cancer samples and 23 non-cancerous samples, and samples in the latent space were clustered (via self-organizing maps) to show that the latent space could differentiate breast cancer samples from non-neoplastic samples.
Furthermore, the latent space was visualized using t-SNE (t-distributed stochastic neighbor embedding) [@arxiv:1808.01359].
Titus et. al. [@doi:10.5220/0006636401400145] adapted a VAE strategy developed by Way et. al. [@doi:10.1142/9789813235533_0008] to methylation data.
One method used several stacked binary RBMs to learn a low-dimensional subspace representation of the methylation profiles of 5,000 CpG sites with the highest variance across 136 breast tissue samples, 113 breast cancer samples, and 23 non-cancerous samples.
Samples in the latent space were clustered via self-organizing maps to show that the latent space could differentiate breast cancer samples from non-neoplastic samples.
Furthermore, the latent space was visualized using t-Distributed Stochastic Neighbor Embedding (t-SNE) [@tag:Maaten2008_tsne; @arxiv:1808.01359].
Titus et al. [@doi:10.5220/0006636401400145] adapted a VAE strategy developed by Way et al. [@doi:10.1142/9789813235533_0008] to methylation data.
The VAE was modified to perform dimensionality reduction on 300,000 PAM50-assigned CpG features to 100 latent features in 862 samples.
The authors performed t-SNE visualization, clustering, and classified tumor subtypes from a Breast Cancer dataset from TCGA.
In an subsequent extension of this work [@doi:10.1101/433763], the authors constructed a 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy.
Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response.
Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders.
After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365].
The authors performed t-SNE visualization, clustering, and tumor subtype classification from a TCGA breast cancer dataset.
In an subsequent extension [@doi:10.1101/433763], the authors constructed a 100-dimensional latent space of 100,000 CpG sites across approximately 1,200 samples.
They selected latent space dimensions that were the most highly associated with the differentiation between estrogen receptor (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy.
Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response.
Another study explored the latent features of lung cancer methylation profiles that were extracted using VAEs.
After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimensions to accurately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365].
These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology.
Techniques that produce these representations provide the opportunity to discover important biological features that were previously missed.
The power of unsupervised deep learning models for this task comes from their ability to learn high-dimensional non-linear relationships among data.
The power of unsupervised deep learning models for this task comes from their ability to learn high-dimensional non-linear relationships among data.

Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples.
Unsupervised deep learning approaches such as variational autoencoders, which leverage measured points to produce a generative, low-dimensional representation, may provide a more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes.
In addition, latent representations may assist with biological hypothesis generation and have the ability to stratify patients by predicted risk.
While neural-network embeddings can outperform traditional embeddings, it is important to be aware that many of these methods can be highly sensitive to hyperparameter tuning and an evaluation of the impact of hyperparameter tuning should be included [@doi:10.1101/385534].
Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples.
Unsupervised deep learning approaches such as VAEs may provide a more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes.
In addition, latent representations may assist with biological hypothesis generation and have the ability to stratify patients by predicted risk.
While neural network embeddings can outperform traditional embeddings, it is important to be aware that many of these methods can be highly sensitive to hyperparameter tuning and an evaluation of the impact of hyperparameter tuning should be included [@doi:10.1101/385534].

### Splicing

Expand Down Expand Up @@ -135,7 +141,7 @@ Hence, predictive computational models of TF binding are essential to understand
Several machine learning approaches have been developed to learn generative and discriminative models of TF binding from *in vitro* and *in vivo* TF binding datasets that associate collections of synthetic DNA sequences or genomic DNA sequences to binary labels (bound/unbound) or continuous measures of binding.
The most common class of TF binding models in the literature are those that only model the DNA sequence affinity of TFs from *in vitro* and *in vivo* binding data.
The earliest models were based on deriving simple, compact, interpretable sequence motif representations such as position weight matrices (PWMs) and other biophysically inspired models [@tag:Stormo2000_dna; @doi:10.1093/nar/gkp335; @doi:10.1038/nbt.2486].
These models were outperformed by general k-mer based models including support vector machines (SVMs) with string kernels [@doi:10.1371/journal.pcbi.1000916; @tag:Ghandi2014_enhanced].
These models were outperformed by general k-mer based models including SVMs with string kernels [@doi:10.1371/journal.pcbi.1000916; @tag:Ghandi2014_enhanced].

In 2015, Alipanahi et al. developed DeepBind, the first CNN to classify bound DNA sequences based on *in vitro* and *in vivo* assays against random DNA sequences matched for dinucleotide sequence composition [@tag:Alipanahi2015_predicting].
The convolutional layers learn pattern detectors reminiscent of PWMs from a one-hot encoding of the raw input DNA sequences.
Expand Down Expand Up @@ -430,8 +436,8 @@ They achieved impressive performance, even for cell types where the subset perce
However, they did not benchmark against random forests, which tend to work better for imbalanced data, and their data was relatively low dimensional.

Neural networks can also learn low-dimensional representations of single-cell gene expression data for visualization, clustering, and other tasks.
Both scvis [@doi:10.1101/178624] and scVI [@arxiv:1709.02082] are unsupervised approaches based on variational autoencoders (VAEs).
Whereas scvis primarily focuses on single-cell visualization as a replacement for t-Distributed Stochastic Neighbor Embedding [@tag:Maaten2008_tsne], the scVI model accounts for zero-inflated expression distributions and can impute zero values that are due to technical effects.
Both scvis [@doi:10.1101/178624] and scVI [@arxiv:1709.02082] are unsupervised approaches based on variational autoencoders (VAEs).
Whereas scvis primarily focuses on single-cell visualization as a replacement for t-SNE [@tag:Maaten2008_tsne], the scVI model accounts for zero-inflated expression distributions and can impute zero values that are due to technical effects.
Beyond VAEs, Lin et al. developed a supervised model to predict cell type [@doi:10.1093/nar/gkx681].
Similar to transfer learning approaches for microscopy images [@doi:10.1101/085118], they demonstrated that the hidden layer representations were informative in general and could be used to identify cellular subpopulations or match new cells to known cell types.
The supervised neural network's representation was better overall at retrieving cell types than alternatives, but all methods struggled to recover certain cell types such as hematopoietic stem cells and inner cell mass cells.
Expand Down