Skip to content

Commit

Permalink
Standardize abbreviations, closes greenelab#345
Browse files Browse the repository at this point in the history
  • Loading branch information
agitter committed May 23, 2017
1 parent 50f76a4 commit bed8212
Show file tree
Hide file tree
Showing 5 changed files with 98 additions and 101 deletions.
4 changes: 2 additions & 2 deletions sections/02_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,15 +120,15 @@ help count mitotic divisions, a feature that is highly correlated with disease
outcome in histological images [@doi:10.1007/978-3-642-40763-5_51]. Despite
these recent advances, a number of challenges exist in this area of research,
most notably the integration of molecular and imaging data with other disparate
types of data such as electronic health records (EHR).
types of data such as electronic health records (EHRs).

#### Fundamental biological study

Deep learning can be applied to answer more fundamental biological questions; it
is especially suited to leveraging large amounts of data from high-throughput
"omics" studies. One classic biological problem where machine learning, and now
deep learning, has been extensively applied is molecular target prediction. For
example, deep recurrent neural networks (RNN) have been used to predict gene
example, deep recurrent neural networks (RNNs) have been used to predict gene
targets of microRNAs [@doi:10.1109/icnn.1994.374637], and CNNs have been applied
to predict protein residue-residue contacts and secondary structure
[@doi:10.1371/journal.pcbi.1005324 @doi:10.1109/tcbb.2014.2343960
Expand Down
15 changes: 7 additions & 8 deletions sections/03_categorize.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ However, it is both costly and time-consuming to annotate a large-scale
fully-labeled corpus to facilitate data-intensive deep learning models. As an
alternative, Wang et al. [@arxiv:1705.02315] proposed to use weakly labeled
images. To generate weak labels for X-ray images, they applied a series of
Natural Language Processing (NLP) techniques to the associated chest X-ray
natural language processing (NLP) techniques to the associated chest X-ray
radiological reports. Specifically, they first extracted all diseases mentioned
in the reports using a state-of-the-art NLP tool, then applied a newly-developed
negation and uncertainty detection tool (NegBio) to filter negative and
Expand Down Expand Up @@ -617,10 +617,9 @@ events can provide insight into a patient's disease and treatment
discrete combinatorial bucketing. Lasko et al.
[@doi:10.1371/journal.pone.0066341] used autoencoders on longitudinal sequences
of serum urine acid measurements to identify population subtypes. More recently,
deep learning has shown promise working with both sequences (Convolutional
Neural Networks) [@arxiv:1607.07519] and the incorporation of past and current
state (Recurrent Neural Networks, Long Short Term Memory Networks)
[@arxiv:1602.00357]. This may be a particular area of opportunity for deep
neural networks. The ability to recognize relevant sequences of events from a
large number of trajectories requires powerful and flexible feature construction
methods -- an area in which deep neural networks excel.
deep learning has shown promise working with both sequences (CNNs)
[@arxiv:1607.07519] and the incorporation of past and current state (RNNs,
LSTMs) [@arxiv:1602.00357]. This may be a particular area of opportunity for
deep neural networks. The ability to recognize relevant sequences of events from
a large number of trajectories requires powerful and flexible feature
construction methods -- an area in which deep neural networks excel.
81 changes: 40 additions & 41 deletions sections/04_study.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ potentially illuminate important aspects of splicing behavior. For instance,
tissue-specific splicing mechanisms could be inferred by training networks on
splicing data from different tissues, then searching for common versus
distinctive hidden nodes, a technique employed by Qin et al. for tissue-specific
TF binding predictions [@tag:Qin2017_onehot].
transcription factor (TF) binding predictions [@tag:Qin2017_onehot].

A parallel effort has been to use more data with simpler models. An exhaustive
study using readouts of splicing for millions of synthetic intronic sequences
Expand Down Expand Up @@ -159,7 +159,7 @@ integrate diverse data sources will be required.

### Transcription factors and RNA-binding proteins

Transcription factors (TFs) and RNA-binding proteins are key components in gene
Transcription factors and RNA-binding proteins are key components in gene
regulation and higher-level biological processes. TFs are regulatory proteins
that bind to certain genomic loci and control the rate of mRNA production. While
high-throughput sequencing techniques such as chromatin immunoprecipitation and
Expand Down Expand Up @@ -194,14 +194,14 @@ introduced several new convolutional and recurrent neural network models that
further improved TFBS predictive accuracy. Due to the motif-driven nature of the
TFBS task, most architectures have been convolution-based
[@tag:Zeng2016_convolutional]. While many models for TFBS prediction resemble
computer vision and natural language processing (NLP) tasks, it is important to
note that DNA sequence tasks are fundamentally different. Thus the models should
be adapted from traditional deep learning models in order to account for such
differences. For example, motifs may appear in either strand of a DNA sequence,
resulting in two different forms of the motif (forward and reverse complement)
due to complementary base pairing. To handle this issue, specialized reverse
complement convolutional models share parameters to find motifs in both
directions [@tag:Shrikumar2017_reversecomplement].
computer vision and NLP tasks, it is important to note that DNA sequence tasks
are fundamentally different. Thus the models should be adapted from traditional
deep learning models in order to account for such differences. For example,
motifs may appear in either strand of a DNA sequence, resulting in two different
forms of the motif (forward and reverse complement) due to complementary base
pairing. To handle this issue, specialized reverse complement convolutional
models share parameters to find motifs in both directions
[@tag:Shrikumar2017_reversecomplement].

Despite these advances, several challenges remain. First, because the inputs
(ChIP-seq measurements) are continuous and most current algorithms are designed
Expand Down Expand Up @@ -602,21 +602,21 @@ whole-genome shotgun DNA -- from microbial communities, has revolutionized the
study of micro-scale ecosystems within and around us. In recent years, machine
learning has proved to be a powerful tool for metagenomic analysis. 16S rRNA has
long been used to deconvolve mixtures of microbial genomes, yet this ignores
>99% of the genomic content. Subsequent tools aimed to classify 300-3000 base
pair reads from complex mixtures of microbial genomes based on tetranucleotide
frequencies, which differ across organisms [@tag:Karlin], using supervised
[@tag:McHardy @tag:nbc] or unsupervised methods [@tag:Abe]. Then, researchers
began to use techniques that could estimate relative abundances from an entire
sample faster than classifying individual reads [@tag:Metaphlan @tag:wgsquikr
@tag:lmat @tag:Vervier]. There is also great interest in identifying and
annotating sequence reads [@tag:yok @tag:Soueidan]. However, the focus on
taxonomic and functional annotation is just the first step. Several groups have
proposed methods to determine host or environment phenotypes from the organisms
that are identified [@tag:Guetterman @tag:Knights @tag:Stratnikov @tag:Segata]
or overall sequence composition [@tag:Ding]. Also, researchers have looked into
how feature selection can improve classification [@tag:Liu @tag:Segata], and
techniques have been proposed that are classifier-independent [@tag:Ditzler
@tag:Ditzler2].
more than 99% of the genomic content. Subsequent tools aimed to classify
300-3000 base pair reads from complex mixtures of microbial genomes based on
tetranucleotide frequencies, which differ across organisms [@tag:Karlin], using
supervised [@tag:McHardy @tag:nbc] or unsupervised methods [@tag:Abe]. Then,
researchers began to use techniques that could estimate relative abundances from
an entire sample faster than classifying individual reads [@tag:Metaphlan
@tag:wgsquikr @tag:lmat @tag:Vervier]. There is also great interest in
identifying and annotating sequence reads [@tag:yok @tag:Soueidan]. However, the
focus on taxonomic and functional annotation is just the first step. Several
groups have proposed methods to determine host or environment phenotypes from
the organisms that are identified [@tag:Guetterman @tag:Knights @tag:Stratnikov
@tag:Segata] or overall sequence composition [@tag:Ding]. Also, researchers have
looked into how feature selection can improve classification [@tag:Liu
@tag:Segata], and techniques have been proposed that are classifier-independent
[@tag:Ditzler @tag:Ditzler2].

How have neural networks been of use? Most neural networks are being used for
phylogenetic classification or functional annotation from sequence data where
Expand All @@ -630,27 +630,26 @@ identification [@tag:Hochreiter @tag:Sonderby].

One of the first techniques of *de novo* genome binning used self-organizing
maps, a type of neural network [@tag:Abe]. Essinger et al.
[@tag:Essinger2010_taxonomic] used Adaptive Resonance Theory (ART) to cluster
similar genomic fragments and showed that it had better performance than
k-means. However, other methods based on interpolated Markov models
[@tag:Salzberg] have performed better than these early genome binners. Neural
networks can be slow and therefore have had limited use for reference-based
taxonomic classification, with TAC-ELM [@tag:TAC-ELM] being the only neural
network-based algorithm to taxonomically classify massive amounts of metagenomic
data. An initial study successfully applied neural networks to taxonomic
classification of 16S rRNA genes, with convolutional networks providing about
10% accuracy genus-level improvement over RNNs and random forests [@tag:Mrzelj].
However, this study evaluated only 3000 sequences.
[@tag:Essinger2010_taxonomic] used Adaptive Resonance Theory to cluster similar
genomic fragments and showed that it had better performance than k-means.
However, other methods based on interpolated Markov models [@tag:Salzberg] have
performed better than these early genome binners. Neural networks can be slow
and therefore have had limited use for reference-based taxonomic classification,
with TAC-ELM [@tag:TAC-ELM] being the only neural network-based algorithm to
taxonomically classify massive amounts of metagenomic data. An initial study
successfully applied neural networks to taxonomic classification of 16S rRNA
genes, with convolutional networks providing about 10% accuracy genus-level
improvement over RNNs and random forests [@tag:Mrzelj]. However, this study
evaluated only 3000 sequences.

Neural network uses for classifying phenotype from microbial composition are
just beginning. A standard multi-layer perceptron (MLP) was able to classify
wound severity from microbial species present in the wound
[@doi:10.1016/j.bjid.2015.08.013]. Recently, Ditzler et al. associated soil
samples with pH level using MLPs, deep-belief networks (DBNs), and recurrant
neural networks (RNNs) [@tag:Ditzler3]. Besides classifying samples
appropriately, internal phylogenetic tree nodes inferred by the networks
represented features for low and high pH. Thus, hidden nodes might provide
biological insight as well as new features for future metagenomic sample
samples with pH level using MLPs, DBNs, and RNNs [@tag:Ditzler3]. Besides
classifying samples appropriately, internal phylogenetic tree nodes inferred by
the networks represented features for low and high pH. Thus, hidden nodes might
provide biological insight as well as new features for future metagenomic sample
comparison. Also, an initial study has shown promise of these networks for
diagnosing disease [@tag:Faruqi].

Expand Down
47 changes: 23 additions & 24 deletions sections/05_treat.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,20 +65,20 @@ Discussion).

A common application for deep learning in this domain is the temporal structure
of healthcare records. Many studies [@tag:Lipton2016_missing @tag:Che2016_rnn
@tag:Huddar2016_predicting @tag:Lipton2015_lstm] have used deep recurrent
networks to categorize patients, but most stop short of suggesting clinical
decisions. Nemati et al. [@tag:Nemati2016_rl] used deep reinforcement learning
to optimize a heparin dosing policy for intensive care patients. However,
because the ideal dosing policy is unknown, the model's predictions must be
evaluated on counter-factual data. This represents a common challenge when
bridging the gap between research and clinical practice. Because the
ground-truth is unknown, researchers struggle to evaluate model predictions in
the absence of interventional data, but clinical application is unlikely until
the model has been shown to be effective. The impressive applications of deep
reinforcement learning to other domains [@tag:Silver2016_alphago] have relied on
knowledge of the underlying processes (e.g. the rules of the game). Some models
have been developed for targeted medical problems [@tag:Gultepe2014_sepsis], but
a generalized engine is beyond current capabilities.
@tag:Huddar2016_predicting @tag:Lipton2015_lstm] have used RNNs to categorize
patients, but most stop short of suggesting clinical decisions. Nemati et al.
[@tag:Nemati2016_rl] used deep reinforcement learning to optimize a heparin
dosing policy for intensive care patients. However, because the ideal dosing
policy is unknown, the model's predictions must be evaluated on counter-factual
data. This represents a common challenge when bridging the gap between research
and clinical practice. Because the ground-truth is unknown, researchers
struggle to evaluate model predictions in the absence of interventional data,
but clinical application is unlikely until the model has been shown to be
effective. The impressive applications of deep reinforcement learning to other
domains [@tag:Silver2016_alphago] have relied on knowledge of the underlying
processes (e.g. the rules of the game). Some models have been developed for
targeted medical problems [@tag:Gultepe2014_sepsis], but a generalized engine is
beyond current capabilities.

#### Clinical trials efficiency

Expand Down Expand Up @@ -122,8 +122,8 @@ using both cell line and drug features, opening the door to precision medicine
and drug repositioning opportunities in cancer. More recently, Aliper et al.
[@doi:10.1021/acs.molpharmaceut.6b00248] used gene- and pathway-level drug
perturbation transcriptional profiles from the Library of Network-Based Cellular
Signatures (LINCS) [@doi:10.3389/fgene.2014.00342] to train a fully connected
deep neural network to predict drug therapeutic uses and indications. By using
Signatures [@doi:10.3389/fgene.2014.00342] to train a fully connected deep
neural network to predict drug therapeutic uses and indications. By using
confusion matrices and leveraging misclassification, the authors formulated a
number of interesting hypotheses, including repurposing cardiovascular drugs
such as otenzepad and pinacidil for neurological disorders.
Expand All @@ -138,9 +138,8 @@ Wang et al. [@doi:10.1093/bioinformatics/btt234] trained individual RBMs for
each target in a drug-target interaction network and used these models to
predict novel interactions pointing to new indications for existing drugs. Wen
et al. [@doi:10.1021/acs.jproteome.6b00618] extended this concept to deep
learning by creating a DBN of stacked RBMs called DeepDTIs, which is able to
predict interactions on the basis of chemical structure and protein sequence
features.
learning by creating a DBN called DeepDTIs, which is able to predict
interactions on the basis of chemical structure and protein sequence features.

Drug repositioning appears to be an obvious candidate for deep learning both
because of the large amount of high-dimensional data available and the
Expand Down Expand Up @@ -198,11 +197,11 @@ Merck Molecular Activity Challenge on Kaggle generated substantial excitement
about the potential for high-parameter deep learning approaches. The winning
submission was an ensemble that included a multi-task multi-layer perceptron
network [@tag:Dahl2014_multi_qsar]. The sponsors noted drastic improvements
over a random forest (RF) baseline, remarking "we have seldom seen any method in
the past 10 years that could consistently outperform RF by such a margin"
[@tag:Ma2015_qsar_merck]. Subsequent work (reviewed in more detail by Goh et al.
[@doi:10.1002/jcc.24764]) explored the effects of jointly modeling far more
targets than the Merck challenge [@tag:Unterthiner2014_screening
over a random forest baseline, remarking "we have seldom seen any method in the
past 10 years that could consistently outperform [random forest] by such a
margin" [@tag:Ma2015_qsar_merck]. Subsequent work (reviewed in more detail by
Goh et al. [@doi:10.1002/jcc.24764]) explored the effects of jointly modeling
far more targets than the Merck challenge [@tag:Unterthiner2014_screening
@tag:Ramsundar2015_multitask_drug], with Ramsundar et al.
[@tag:Ramsundar2015_multitask_drug] showing that the benefits of multi-task
networks had not yet saturated even with 259 targets. Although DeepTox
Expand Down
Loading

0 comments on commit bed8212

Please sign in to comment.