Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add discussion in interpretability section and updates to molecular design section and discussion sections (issues with previous PR fixed) #988

Merged
merged 24 commits into from
Aug 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
1cb14cb
add discussion in interpretability section and update section on mole…
delton137 Dec 19, 2019
e7f6ca6
remove build files
cgreene Feb 10, 2020
4e53c22
rehash/update my previous commit - single lines and other fixes
delton137 Feb 14, 2020
721829f
rehash/update my previous commit - single lines and other fixes
delton137 Feb 14, 2020
6f0e609
rehash/update my previous commit - single lines and other fixes
delton137 Feb 14, 2020
e18d939
rehash/update my previous commit - single lines and other fixes
delton137 Feb 14, 2020
b77c8a1
Merge branch 'master' of https://github.com/delton137/deep-review
delton137 Feb 14, 2020
5dcf0da
Minor changes for recent version 2.0 updates (#1002)
agitter Mar 16, 2020
081fb46
Spelling and wording cleanup (#1014)
cbrueffer Apr 11, 2020
77e5ff3
add discussion in interpretability section and update section on mole…
delton137 Dec 19, 2019
cf85f39
remove build files
cgreene Feb 10, 2020
ebb27b1
rehash/update my previous commit - single lines and other fixes
delton137 Feb 14, 2020
15564c5
rehash/update my previous commit - single lines and other fixes
delton137 Feb 14, 2020
1de073d
rehash/update my previous commit - single lines and other fixes
delton137 Feb 14, 2020
0089d29
rehash/update my previous commit - single lines and other fixes
delton137 Feb 14, 2020
9d190e4
merge
delton137 Apr 20, 2020
fea5c22
Delete citation tags
agitter Aug 8, 2020
9a1f92a
Convert tags to Markdown format
agitter Aug 8, 2020
617ff70
Apply suggestions from code review
delton137 Aug 8, 2020
8c431ee
Update 05.treat.md
delton137 Aug 8, 2020
d60d129
Remove interpretability changes from this pull request
agitter Aug 9, 2020
19e9c80
Citation fixes
agitter Aug 9, 2020
7975507
Apply suggestions from code review
delton137 Aug 9, 2020
a88f0b6
Update content/05.treat.md
agitter Aug 9, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,12 @@ The original version of the Deep Review was published in 2018 and should be cite
> Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow P-M, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, DeCaprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, and Greene CS. 2018. Opportunities and obstacles for deep learning in biology and medicine. _Journal of The Royal Society Interface_ 15(141):20170387. [doi:10.1098/rsif.2017.0387](https://doi.org/10.1098/rsif.2017.0387)


### Current stage: planning Deep Review 2019
### Current stage: planning Deep Review version 2.0

As of writing, we are aiming to publish an update of the deep review each year, with the next such release occurring at the end of 2019.
As of writing, we are aiming to publish an update of the deep review.
We will continue to make project preprints available on bioRxiv or another preprint service and aim to continue publishing the finished reviews in a peer-reviewed venue as well.
Like the initial release, we are planning for an open and collaborative effort.
New contributors are welcome and will be listed as version 2.0 authors.
Please see [issue #810](https://github.com/greenelab/deep-review/issues/810) to contribute to the discussion of future plans, and help decide how to best continue this project.

**Manubot updates:**
Expand Down
2 changes: 1 addition & 1 deletion build/randomize-authors.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

def parse_args():
parser = argparse.ArgumentParser(
description="Randomize metadata.authors. Ovewrites metadata.yaml"
description="Randomize metadata.authors. Overwrites metadata.yaml"
)
parser.add_argument(
"--path", default="content/metadata.yaml", help="path to metadata.yaml"
Expand Down
8 changes: 4 additions & 4 deletions content/00.front-matter.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css">
[
[]{.fas .fa-info-circle .fa-lg} **Update Underway**<br>
A published version of this manuscript from 04 April 2018, termed Version 1.0, is available at <https://doi.org/10.1098/rsif.2017.0387>.
A new effort is underway to update the manuscript to a Version 2.0 that is current as of the first half of 2020.
A published version of this manuscript from 04 April 2018, termed version 1.0, is available at <https://doi.org/10.1098/rsif.2017.0387>.
A new effort is underway to update the manuscript to a version 2.0 that is current as of the first half of 2020.
New authors and links to new sections are available in [GitHub Issue #959](https://github.com/greenelab/deep-review/issues/959).
]{.banner .lightred}

Expand All @@ -30,7 +30,7 @@ on {{manubot.date}}.

## Authors

### Version 2.0 Authors
### Version 2.0 authors

{% for author in manubot.authors %}
{% if author.v2 -%}
Expand All @@ -54,7 +54,7 @@ on {{manubot.date}}.

</small>

### Version 1.0 Authors
### Version 1.0 authors

[![ORCID icon](images/orcid.svg){height="11px" width="11px"}](https://orcid.org/0000-0002-5577-3516)
Travers Ching<sup>1.1,☯</sup>,
Expand Down
6 changes: 3 additions & 3 deletions content/04.study.md
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,7 @@ MHCflurry adds placeholder amino acids to transform variable-length peptides to
In training the MHCflurry feed-forward neural network [@doi:10.1101/054775], the authors imputed missing MHC-peptide binding affinities using a Gibbs sampling method, showing that imputation improves performance for data-sets with roughly 100 or fewer training examples.
MHCflurry's imputation method increases its performance on poorly characterized alleles, making it competitive with NetMHCpan for this task.
Kuksa et al. [@doi:10.1093/bioinformatics/btv371] developed a shallow, higher-order neural network (HONN) comprised of both mean and covariance hidden units to capture some of the higher-order dependencies between amino acid locations.
Pretraining this HONN with a semi-restricted Boltzmann machine, the authors found that the performance of the HONN exceeded that of a simple deep neural network, as well as that of NetMHC.
Pre-training this HONN with a semi-restricted Boltzmann machine, the authors found that the performance of the HONN exceeded that of a simple deep neural network, as well as that of NetMHC.

Deep learning's unique flexibility was recently leveraged by Bhattacharya et al. [@doi:10.1101/154757], who used a gated RNN method called MHCnuggets to overcome the difficulty of multiple peptide lengths.
Under this framework, they used smoothed sparse encoding to represent amino acids individually.
Expand Down Expand Up @@ -484,7 +484,7 @@ Also, researchers have looked into how feature selection can improve classificat

Most neural networks are used for phylogenetic classification or functional annotation from sequence data where there is ample data for training.
Neural networks have been applied successfully to gene annotation (e.g. Orphelia [@tag:Hoff] and FragGeneScan [@doi:10.1093/nar/gkq747]).
Representations (similar to Word2Vec [@tag:Word2Vec] in natural language processing) for protein family classification have been introduced and classified with a skip-gram neural network [@tag:Asgari].
Representations (similar to word2vec [@tag:word2vec] in natural language processing) for protein family classification have been introduced and classified with a skip-gram neural network [@tag:Asgari].
Recurrent neural networks show good performance for homology and protein family identification [@tag:Hochreiter; @tag:Sonderby].

One of the first techniques of *de novo* genome binning used self-organizing maps, a type of neural network [@tag:Abe].
Expand Down Expand Up @@ -558,7 +558,7 @@ Even when they are not directly modeling biological neurons, deep networks have
They have been developed as statistical time series models of neural activity in the brain.
And in contrast to the encoding models described earlier, these models are used for decoding neural activity, for instance in brain machine interfaces [@doi:10.1101/152884].
They have been crucial to the field of connectomics, which is concerned with mapping the connectivity of biological neural networks in the brain.
In connectomics, deep networks are used to segment the shapes of individual neurons and to infer their connectivity from 3D electron microscopic images [@doi:10.1016/j.conb.2010.07.004], and they have been also been used to infer causal connectivity from optical measurement and perturbation of neural activity [@tag:Aitchison2017].
In connectomics, deep networks are used to segment the shapes of individual neurons and to infer their connectivity from 3D electron microscopic images [@doi:10.1016/j.conb.2010.07.004], and they have also been used to infer causal connectivity from optical measurement and perturbation of neural activity [@tag:Aitchison2017].

It is an exciting time for neuroscience.
Recent rapid progress in deep networks continues to inspire new machine learning based models of brain computation [@doi:10.3389/fncom.2016.00094].
Expand Down
51 changes: 35 additions & 16 deletions content/05.treat.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,28 +180,47 @@ However, in the long term, atomic convolutions may ultimately overtake grid-base

#### *De novo* drug design

*De novo* drug design attempts to model the typical design-synthesize-test cycle of drug discovery [@doi:10.1002/wcms.49; @doi:10.1021/acs.jmedchem.5b01849].
*De novo* drug design attempts to model the typical design-synthesize-test cycle of drug discovery *in silico* [@doi:10.1002/wcms.49; @doi:10.1021/acs.jmedchem.5b01849].
It explores an estimated 10<sup>60</sup> synthesizable organic molecules with drug-like properties without explicit enumeration [@doi:10.1002/wcms.1104].
To test or score structures, algorithms like those discussed earlier are used.
To score molecules after generation or during optimization, physics-based simulation could be used [@tag:Sumita2018], but machine learning models based on techniques discussed earlier may be preferable [@tag:Gomezb2016_automatic], as they are much more computationally expedient.
Computational efficiency is particularly important during optimization as the "scoring function" may need to be called thousands of times.

To "design" and "synthesize", traditional *de novo* design software relied on classical optimizers such as genetic algorithms.
Unfortunately, this often leads to overfit, "weird" molecules, which are difficult to synthesize in the lab.
Current programs have settled on rule-based virtual chemical reactions to generate molecular structures [@doi:10.1021/acs.jmedchem.5b01849].
Deep learning models that generate realistic, synthesizable molecules have been proposed as an alternative.
In contrast to the classical, symbolic approaches, generative models learned from data would not depend on laboriously encoded expert knowledge.
The challenge of generating molecules has parallels to the generation of syntactically and semantically correct text [@arxiv:1308.0850].

As deep learning models that directly output (molecular) graphs remain under-explored, generative neural networks for drug design typically represent chemicals with the simplified molecular-input line-entry system (SMILES), a standard string-based representation with characters that represent atoms, bonds, and rings [@tag:Segler2017_drug_design].
This allows treating molecules as sequences and leveraging recent progress in recurrent neural networks.
Gómez-Bombarelli et al. designed a SMILES-to-SMILES autoencoder to learn a continuous latent feature space for chemicals [@tag:Gomezb2016_automatic].
In this learned continuous space it was possible to interpolate between continuous representations of chemicals in a manner that is not possible with discrete
(e.g. bit vector or string) features or in symbolic, molecular graph space.
Even more interesting is the prospect of performing gradient-based or Bayesian optimization of molecules within this latent space.
These algorithms use a list of hard-coded rules to perform virtual chemical reactions on molecular structures during each iteration, leading to physically stable and synthesizable molecules [@doi:10.1021/acs.jmedchem.5b01849].
Deep learning models have been proposed as an alternative.
In contrast to the classical approaches, in theory generative models learned from big data would not require laboriously encoded expert knowledge to generate realistic, synthesizable molecules.

In the past few years, a large number of techniques for the generative modeling and optimization of molecules with deep learning have been explored, including RNNs, VAEs, GANs, and reinforcement learning---for a review see Elton et al. [@tag:Elton_molecular_design_review] or Vamathevan et al. [@tag:Vamathevan2019].

agitter marked this conversation as resolved.
Show resolved Hide resolved
Building off the large amount of work that has already gone into text generation [@arxiv:1308.0850], many generative neural networks for drug design initially represented chemicals with the simplified molecular-input line-entry system (SMILES), a standard string-based representation with characters that represent atoms, bonds, and rings [@tag:Segler2017_drug_design].

The first successful demonstration of a deep learning based approach for molecular optimization occurred in 2016 with the development of a SMILES-to-SMILES autoencoder capable of learning a continuous latent feature space for molecules [@tag:Gomezb2016_automatic].
In this learned continuous space it is possible to interpolate between molecular structures in a manner that is not possible with discrete (e.g. bit vector or string) features or in symbolic, molecular graph space.
Even more interesting is that one can perform gradient-based or Bayesian optimization of molecules within this latent space.
The strategy of constructing simple, continuous features before applying supervised learning techniques is reminiscent of autoencoders trained on high-dimensional EHR data [@tag:BeaulieuJones2016_ehr_encode].
A drawback of the SMILES-to-SMILES autoencoder is that not all SMILES strings produced by the autoencoder's decoder correspond to valid chemical structures.
Recently, the Grammar Variational Autoencoder, which takes the SMILES grammar into account and is guaranteed to produce syntactically valid SMILES, has been proposed to alleviate this issue [@arxiv:1703.01925].
The Grammar Variational Autoencoder, which takes the SMILES grammar into account and is guaranteed to produce syntactically valid SMILES, helps alleviate this issue to some extent [@arxiv:1703.01925].

Another approach to *de novo* design is to train character-based RNNs on large collections of molecules, for example, ChEMBL [@doi:10.1093/nar/gkr777], to first obtain a generic generative model for drug-like compounds [@tag:Segler2017_drug_design].
These generative models successfully learn the grammar of compound representations, with 94% [@tag:Olivecrona2017_drug_design] or nearly 98% [@tag:Segler2017_drug_design] of generated SMILES corresponding to valid molecular structures.
The initial RNN is then fine-tuned to generate molecules that are likely to be active against a specific target by either continuing training on a small set of positive examples [@tag:Segler2017_drug_design] or adopting reinforcement learning strategies [@tag:Olivecrona2017_drug_design; @arxiv:1611.02796].
Both the fine-tuning and reinforcement learning approaches can rediscover known, held-out active molecules.
The great flexibility of neural networks, and progress in generative models offers many opportunities for deep architectures in *de novo* design (e.g. the adaptation of GANs for molecules).

Reinforcement learning approaches where operations are performed directly on the molecular graph bypass the need to learn the details of SMILES syntax, allowing the model to focus purely on chemistry.
Additionally, they seem to require less training data and generate more valid molecules since they are constrained by design only to graph operations which satisfy chemical valiance rules [@tag:Elton_molecular_design_review].
A reinforcement learning agent developed by Zhou et al. [@doi:10.1038/s41598-019-47148-x] demonstrated superior molecular optimization performance on optimizing the quantitative estimate of drug-likeness (QED) metric and the "penalized logP" metric (logP minus the synthetic accessibility) when compared with other deep learning based approaches such as the Junction Tree VAE [@arxiv:1802.04364], Objective-Reinforced Generative Adversarial Network [@arxiv:1705.10843], and Graph Convolutional Policy Network [@arxiv:1806.02473].
As another example, Zhavoronkov et al. used generative tensorial reinforcement learning to discover inhibitors of discoidin domain receptor 1 (DDR1) [@tag:Zhavoronkov2019_drugs].
In contrast to most previous work, six lead candidates discovered using their approach were synthesized and tested in the lab, with 4/6 achieving some degree of binding to DDR1.
One of the molecules was chosen for further testing and showed promising results in a cancer cell line and mouse model [@tag:Zhavoronkov2019_drugs].

In concluding this section, we want to highlight two areas where work is still needed before AI can bring added value to the existing drug discovery process---novelty and synthesizability.
The work of Zhavoronkov et al. is arguably an important milestone and received much fanfare in the popular press, but Walters and Murko have presented a more sober assessment, noting that the generated molecule they choose to test in the lab is very similar to an existing drug that was present in their training data [@doi:10.1038/s41587-020-0418-2].
Small variations of existing molecules are likely not to be much better and may not be patentable.
One way to tackle this problem is to add novelty and diversity metrics to the reward function of reinforcement learning based algorithms.
Novelty should also be taken into account when comparing different models---and thus is included in the proposed GuacaMol benchmark (2019) for accessing generative molecules for molecular design [@doi:10.1021/acs.jcim.8b00839].
The other area which has been pointed to as a key limitation of current approaches is synthesizability [@doi:10.1021/acs.jcim.0c00174; @doi:10.1021/acsmedchemlett.0c00088].
Current heuristics of synthesizability, such as the synthetic accessibility score, are based on a relatively limited domain of chemical data and are too restrictive, so better models of synthesizability should help in this area [@doi:10.1021/acs.jcim.0c00174].

As noted before, genetic algorithms use hard-coded rules based on possible chemical reactions to generate molecular structures and therefore may have less trouble generating synthesizable molecules [@doi:10.1021/acs.jmedchem.5b01849].
We note in passing that Jensen (2018) [@doi:10.1039/C8SC05372C] and Yoshikawa et al. (2019) [@doi:10.1246/cl.180665] have both demonstrated genetic algorithms that are competitive with deep learning approaches.
Progress on overcoming both of these shortcomings is proceeding on many fronts, and we believe the future of deep learning for molecular design is quite bright.
4 changes: 2 additions & 2 deletions content/06.discussion.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ The contribution scores were then used to identify key phrases from a model trai
#### Latent space manipulation

Interpretation of embedded or latent space features learned through generative unsupervised models can reveal underlying patterns otherwise masked in the original input.
Embedded feature interpretation has been emphasized mostly in image and text based applications [@tag:Radford_dcgan; @tag:Word2Vec], but applications to genomic and biomedical domains are increasing.
Embedded feature interpretation has been emphasized mostly in image and text based applications [@tag:Radford_dcgan; @tag:word2vec], but applications to genomic and biomedical domains are increasing.

For example, Way and Greene trained a VAE on gene expression from The Cancer Genome Atlas (TCGA) [@doi:10.1038/ng.2764] and use latent space arithmetic to rapidly isolate and interpret gene expression features descriptive of high grade serous ovarian cancer subtypes [@tag:WayGreene2017_tybalt].
The most differentiating VAE features were representative of biological processes that are known to distinguish the subtypes.
Expand Down Expand Up @@ -270,7 +270,7 @@ There is a risk that a model will easily discriminate synthetic examples but not
Multimodal, multi-task, and transfer learning, discussed in detail below, can also combat data limitations to some degree.
There are also emerging network architectures, such as Diet Networks for high-dimensional SNP data [@tag:Romero2017_diet].
These use multiple networks to drastically reduce the number of free parameters by first flipping the problem and training a network to predict parameters (weights) for each input (SNP) to learn a feature embedding.
This embedding (e.g. from principal component analysis, per class histograms, or a Word2vec [@tag:Word2Vec] generalization) can be learned directly from input data or take advantage of other datasets or domain knowledge.
This embedding (e.g. from principal component analysis, per class histograms, or a word2vec [@tag:word2vec] generalization) can be learned directly from input data or take advantage of other datasets or domain knowledge.
Additionally, in this task the features are the examples, an important advantage when it is typical to have 500 thousand or more SNPs and only a few thousand patients.
Finally, this embedding is of a much lower dimension, allowing for a large reduction in the number of free parameters.
In the example given, the number of free parameters was reduced from 30 million to 50 thousand, a factor of 600.
Expand Down
Loading