Skip to content

gvisona/COSMIC_Embedding

Repository files navigation

Deep Representations of Somatic Mutations in Cancer Samples

This repository describes in detail a methodology for the embedding of somatic mutations profiles in human cancer. With the development of advanced machine learning methods for precision medicine, it is increasingly important to ensure that these novel tools are made availale to researchers whose expertise does not necessarily lie in deep learning and representation learning methods. Deep learning models have the capability to integrate several data sources in a unified framework, and how this data is represented constitutes a significant portion of the experimental design involved.

The present reprository offers a complete implementation of a deep learning model for the embedding of somatic mutations profiles, as well as pre-computed embeddings for a selected set of samples from the COSMIC database, that are readily available.

The following sections describe the approach used to learn representations of cancer samples, the experimental design, and the steps necessary to use or reproduce the results.

Dataset

The following model is trained using the Sanger Catalogue of Somatic Mutations in Cancer (COSMIC) database, which is a curated source of information on somatic mutations in human cancer. We are specifically interested in its vast amount of genome-wide screen data, which is a rich resource for the training of deep learning model.

More specifically, we make use of the "COSMIC Mutation Data (Genome Screens)" dataset that is available in version v92 of the project, which offers information on individual somatic mutations present in cancer samples. To reproduce the results shown in this repository, it is necessary to download and extract this dataset in the root folder of the project.

With the mutation data, we aim to encode each cancer sample as a binary vector where each component signals whether a mutation in the corresponding gene is observed. More accurately, a cancer sample is represented by two such vectors, as we follow the design choices of Palazzo et al. (2019), with which we consider deleterious and non-deleterious mutations separately.

The somatic mutations in the dataset are filtered according to the following quality control criteria:

  • labelled as "Confirmed somatic variant"
  • they must come from a genome-wide screen study
  • mapped onto the GRCh38 reference genome
  • the mutations must correspond to a gene (only coding mutations), i.e. there is an HGNC identifier available
  • we exclude hypermutated samples (>1000 mutation) according to previous works in the literature (Vogelstein et al. (2013))
  • underrepresented genes are removed (gene mutations that appear fewer than 100 times)

The resulting processed dataset includes 5.491.567 somatic mutations in 12.449 unique genes from 26.355 cancer samples.

Model

The model chosen for the embedding of somatic mutation profiles is a bimodal Information-maximizing Variational Autoencoder (InfoVAE), derived from the work of Zhao et al. (2019). The structure of the model, inspired by the work of Palazzo et al. (2019), considers separately deleterious and non-deleterious mutations in the input stage. Contrary to the original work, we split the mutations into these two categories based on the FATHMM prediction (Shabib et al. (2015)) included in the COSMIC dataset.

Bimodal MMD-VAE model

The InfoVAE model relies om Maximum Mean Discrepancy to regularize the latent representations learned by the autoencoder model, which reportedly improves on the quality of the embeddings compared to standard VAEs.

Additionally, we implemented a beta VAE model that is closer to the original method of Palazzo et al. (2019)

Bimodal VAE model

We used a Beta Variational Autoencoder with warmup (Sønderby et al. (2016)), as this configuration allows the model to focus on the reconstruction of the data for the first few epochs while gradually introducting the variational objective.

Phenotype Prediction Evaluation

To evaluate how well the learned embeddings describe the phenotype of the cancer samples, we adopted a similar approach to the work of Amar et al. (2017).

Using information from Disease Ontology, we constructed a hierarchy of cancer subtypes, based on their site of origin and histology, which results in 56 categories. These categories were chosen so that there would be a significant enough number of samples both in the positive and negative class for each of them. If the overlap between a parent and child category was too large, we considered only the parent class.

Hierarchy of cancer subtypes

Each cancer sample is assigned a positive (1) or negative (0) class for each category, while respecting the hierarchical structure. Some of the categories considered (e.g. adenocarcinoma) consist of purely histological subtypes, while most other categories are related to the primary site of origin of the cancer sample. Of the 26.355 samples, 25.645 were assigned at least one category, while the remaining samples were excluded from this analysis.

The embeddings analysed were obtained through the use of an InfoVAE model with latent dimension of 50, thus achieving a compression factor of 498 (12449+12449 -> 50).

We first used a UMAP projection to visualize and observe the patterns shown by the embeddings according to their categories. The plots, available in the folder "embeddings/plots_mmd_vae", show an interesting distribution of the various cancer subtypes in the embedding manifold. Interestingly, it appears that there is a concentration of "liquid tumours" in the right-hand side of the UMAP representation, while gastrointestinal and thoracic cancers are grouped mostly in the left branch.

UMAP embeddings

To evaluate how well the embeddings capture the phenotypic information of the samples, we trained a Support Vector Machine (SVM) classifier for each of the 56 categories to evaluate the prediction accuracy on a validation set. Given the imbalance of categories when using the full dataset, for each cancer subtype, where possible, a set of negative samples equal in size to the positive samples was randomly selected. This smaller subtype-specific dataset was split into training and test sets, and used for the classifier evaluations, by applying a 5-fold cross-validation scheme. The results reported in the following table represent the median value of the ROC-AUC for the SVM classifiers across folds.

As evidenced by the results, the compressed embeddings capture a lot of the information contained in the deleterious mutation profile, while often remaining inferior to the non-deleterious mutations profile for site identification. This suggests that, while tissue-specific mutation pattern might be more effective for site classification, it might not be the determining factor in analysing the characteristics of cancer samples. This is in-line with the paradigm shift in cancer study that prioritizes biomarker information rather than tissue-specific information. These results show that the learned embedding improves classification performance in 24 out of 56 categories, while mantaining a significant amount of information from the original mutational profile in the other cases. Interestingly, the performance of the embeddings varies significantly between cancer subtypes. Some cancer types (e.g. gastrointestinal system cancer) show a marked improvement when using the embeddings, while others (e.g. atrocytoma) show a decrease in performance.

Cancer Subtype N. Pos. Samples Embeddings AUC Xdel AUC Xnd AUC
hepatocellular_carcinoma 849 0.73 0.69 0.65
liver_cancer 1315 0.67 0.67 0.68
colon_adenocarcinoma 575 0.78 0.77 0.76
colon_carcinoma 582 0.78 0.75 0.76
colon_cancer 633 0.79 0.75 0.76
colorectal_cancer 633 0.79 0.8 0.75
intestinal_cancer 1814 0.81 0.77 0.74
esophageal_carcinoma 969 0.8 0.79 0.76
stomach_cancer 906 0.81 0.73 0.75
biliary_tract_cancer 613 0.62 0.61 0.65
gastrointestinal_system_cancer 6255 0.8 0.67 0.66
integumentary_system_cancer 668 0.63 0.77 0.55
diffuse_large_B_cell_lymphoma 256 0.68 0.81 0.75
lymphoma 760 0.74 0.74 0.74
acute_myeloid_leukaemia 942 0.84 0.87 0.84
T_cell_acute_lymphoblastic_leukemia 359 0.81 0.83 0.76
B_cell_acute_lymphoblastic_leukemia 251 0.78 0.75 0.65
lymphoid_leukemia 778 0.75 0.81 0.7
chronic_lymphocytic_leukemia 798 0.83 0.6 0.56
leukemia 3059 0.81 0.73 0.74
immune_system_cancer 3847 0.78 0.75 0.74
breast_ductal_carcinoma 122 0.63 0.77 0.69
breast_cancer 1819 0.62 0.69 0.66
lung_adenocarcinoma 778 0.57 0.63 0.65
lung_squamous_cell_carcinoma 506 0.8 0.79 0.84
lung_cancer 1639 0.6 0.65 0.65
respiratory_system_cancer 2157 0.69 0.63 0.66
thoracic_cancer 3976 0.63 0.63 0.64
ovarian_cancer 643 0.65 0.67 0.65
cervical_cancer 253 0.69 0.77 0.79
endometrial_cancer 399 0.71 0.82 0.68
female_reproductive_organ_cancer 1297 0.67 0.71 0.64
prostate_adenocarcinoma 817 0.64 0.63 0.65
male_reproductive_organ_cancer 1135 0.65 0.66 0.63
reproductive_organ_cancer 2432 0.62 0.63 0.6
pancreatic_ductal_carcinoma 737 0.78 0.87 0.57
pancreatic_cancer 1052 0.72 0.81 0.61
thyroid_gland_cancer 943 0.82 0.81 0.55
adrenal_gland_cancer 318 0.88 0.81 0.76
endocrine_gland_cancer 2403 0.73 0.69 0.63
clear_cell_renal_cell_carcinoma 1161 0.75 0.8 0.71
kidney_cancer 1938 0.72 0.72 0.68
urinary_bladder_cancer 501 0.75 0.83 0.84
urinary_system_cancer 2439 0.63 0.71 0.68
brain_glioma 1451 0.64 0.79 0.66
brain_cancer 1820 0.66 0.71 0.61
central_nervous_system_cancer 2171 0.67 0.73 0.62
nervous_system_cancer 2703 0.71 0.69 0.67
connective_tissue_cancer 505 0.69 0.53 0.58
squamous_cell_carcinoma 2273 0.81 0.69 0.66
adenocarcinoma 4079 0.73 0.65 0.66
carcinoma 16377 0.79 0.75 0.71
astrocytoma 972 0.59 0.74 0.65
malignant_glioma 1656 0.63 0.75 0.64
cancer 25233 0.56 0.54 0.54
benign_neoplasm 416 0.55 0.63 0.64

Overall, the performance shown and the almost 500-fold compression of the data suggest that the use of deep representations of somatic mutation profiles is beneficial for downstream analyses.

Reproducing the Results

To reproduce the results of this project, download and extract the "COSMIC Mutation Data (Genome Screens)" dataset (version v92) in the root folder of the project, and then execute the following files in order:

  1. COSMIC Data Processing.ipynb notebook
  2. mmd_train_model.py
  3. embed_cancer_samples.py
  4. plot_embeddings.py

The files require a python 3 setup with the packages listed in requirements.txt

Embeddings Availability

Pre-computed embeddings are available in the embeddings/mmd_vae_embeddings.csv and embeddings/vae_embeddings.csv files for the InfoVAE and BetaVAE respectively, which contain the 50-dimensional representations for the 25.645 samples for which a cancer subtype was assigned.

Files Description

  • COSMIC Data Processing.ipynb: this jupyter notebook contains all the proedures to process the raw Mutation Data dataset into the formats required by the VAE model
  • mmd_train_model.py and train_model.py: training scripts for the BetaVAE and InfoVAE models respectively
  • mmd_vae_model.py and vae_model.py: pytorch implementations of the Dataset and Modules for the BetaVAE and InfoVAE.
  • embed_cancer_samples.py: script to embed the full dataset using a pre-trained model (resulting from mmd_train_model.py or train_model.py)
  • plot_embeddings.py: script to project the embeddings using UMAP and visualize the cancer subtypes on the manifold
  • embeddings/mmd_vae_embeddings.csv and embeddings/vae_embeddings.csv: 50-dimensional embeddings resulting from the analysis, indexed by the COSMIC Sample ID, and made directly available

Funding

This project has received funding from the European Union's Framework Programme for Research and Innovation Horizon 2020 (2014-2020) under the Marie Skłodowska-Curie Grant Agreement No. 813533-MSCA-ITN-2018

References

  • Amar, D., Izraeli, S., & Shamir, R. (2017). Utilizing somatic mutation data from numerous studies for cancer research: proof of concept and applications. Oncogene, 36(24), 3375-3383
  • Palazzo, M., Beauseroy, P., & Yankilevich, P. (2019). A pan-cancer somatic mutation embedding using autoencoders. BMC bioinformatics, 20(1), 655
  • Shihab, H. A., Rogers, M. F., Gough, J., Mort, M., Cooper, D. N., Day, I. N., ... & Campbell, C. (2015). An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics, 31(10), 1536-1543.
  • Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). Ladder variational autoencoders. In Advances in neural information processing systems (pp. 3738-3746).
  • Vogelstein, B., Papadopoulos, N., Velculescu, V. E., Zhou, S., Diaz, L. A., and Kinzler, K. W. (2013). Cancer genome landscapes
  • Zhao, S., Song, J., & Ermon, S. (2019, July). Infovae: Balancing learning and inference in variational autoencoders. In Proceedings of the aaai conference on artificial intelligence (Vol. 33, pp. 5885-5892).