updating readme [skip ci]

ersilia-os · Nov 9, 2024 · 909974a · 909974a
1 parent 3a0c836
commit 909974a
Show file tree

Hide file tree

Showing 15 changed files with 12,082 additions and 63 deletions.
diff --git a/README.md b/README.md
@@ -1,50 +1,43 @@
-# Ersilia Model Contribution
+# Chemical space 2D projections against DrugBank
 
-This README contains the instructions to incorporate a model. Please follow along to bring your model into the Ersilia Model Hub. After successful incorporation of the model, this README will be automatically updated to reflect model specific details.
+This tool performs PCA, UMAP and tSNE projections taking the DrugBank chemical space as a reference. The Ersilia Compound Embeddings as used as descriptors. Four PCA components and two UMAP and tSNE components are returned.
 
-## Folder Structure
+## Identifiers
 
-Generally, two important pieces make up a model that goes into the hub: the model checkpoints, and the code to load and make predictions with that model. With that in mind, the model folder is organised as follows:
+* EOS model ID: `eos9gg2`
+* Slug: `chemical-space-projections-drugbank`
 
-```
-└── model
-    ├── checkpoints
-    │   └── README.md
-    └── framework
-        ├── README.md
-        ├── code
-        │   └── main.py
-        ├── examples
-        │   ├── run_input.csv
-        │   └── run_output.csv
-        └── run.sh
-```
+## Characteristics
 
-- `model/checkpoints` contains checkpoint files required by the model
-- `model/framework` contains the driver code to load the model and run inferences from it. There are two files of interest here: `main.py`, and `run.sh`. The `main.py` file will contain the driver code to load model checkpoints and call its prediction API, while `run.sh` serves two purposes, it runs the code in the `main.py` file and also tells Ersilia that this model server will have a `run` API.
+* Input: `Compound`
+* Input Shape: `Single`
+* Task: `Representation`
+* Output: `Descriptor`
+* Output Type: `Float`
+* Output Shape: `List`
+* Interpretation: Coordinates of 2D projections, namely PCA, UMAP and tSNE.
 
-## Specifying Dependencies
+## References
 
-To specify dependencies for this model, use the `install.yml` file to populate all the necessary dependencies required by the model to successfully run. This dependency configuration file has two top level keys:
+* [Publication](https://academic.oup.com/nar/article/52/D1/D1265/7416367)
+* [Source Code](https://github.com/ersilia-os/compound-embedding)
+* Ersilia contributor: [miquelduranfrigola](https://github.com/miquelduranfrigola)
 
-- `python` which expects a string value denoting a python version (eg `"3.10"`)
-- `commands` which expects a list of values, each of which is a list on its own, denoting the dependencies required by the model. Currently, dependencies `pip` and `conda` are supported. 
-- `pip` dependencies are expected to be three element lists in the format `["pip", "library", "version"]`
-- `conda` dependencies are expected to be four element lists in the format `["conda", "library", "version", "channel"]`, where channel is the conda channel to install the required library.
+## Ersilia model URLs
+* [GitHub](https://github.com/ersilia-os/eos9gg2)
 
-The installation parser will raise an exception if dependencies are not specified in the aforementioned format.
+## Citation
 
-**Note**: Please note that we realise that this form of dependency specification is restrictive. We are [working](https://github.com/ersilia-os/ersilia-pack/issues/21) on extending how Ersilia Pack handles dependency specification, for example, to handle VCS and URL based dependencies. 
+If you use this model, please cite the [original authors](https://academic.oup.com/nar/article/52/D1/D1265/7416367) of the model and the [Ersilia Model Hub](https://github.com/ersilia-os/ersilia/blob/master/CITATION.cff).
 
+## License
 
-## Specifying Model Metadata
+This package is licensed under a GPL-3.0 license. The model contained within this package is licensed under a GPL-3.0-or-later license.
 
-Model metadata should be specified within metadata.yml. An explanation of what these metadata fields correspond to can be found [here.](https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/incorporate-models/model-template#the-metadata.json-file)
+Notice: Ersilia grants access to these models 'as is' provided by the original authors, please refer to the original code repository and/or publication if you use the model in your research.
 
-## Specifying Model APIs
+## About Us
 
-A bash script within the `model/framework` directory is interepreted by Ersilia as an API for the model. For example, `run.sh` corresponds to a model `run` API, and similarly, a `fit.sh` would correspond to a model `fit` API. However arbitrary file names for bash script files are not allowed, and the acceptable names are one of the following: [`run`, `fit`]. 
+The [Ersilia Open Source Initiative](https://ersilia.io) is a Non Profit Organization ([1192266](https://register-of-charities.charitycommission.gov.uk/charity-search/-/charity-details/5170657/full-print)) with the mission is to equip labs, universities and clinics in LMIC with AI/ML tools for infectious disease research.
 
-## Adding Example Input and Output
-
-It is always helpful to provide an example input and output while contributing a model to ease the verification of the model's working. To ensure all models always have an example, Ersilia checks for example CSV files in the `model/framework/examples` directory. In particular, Ersilia looks for `input.csv`, and `output.csv` files in this folder. These files are used to generate the necessary API end points for building a model server and therefore must always be provided.
+[Help us](https://www.ersilia.io/donate) achieve our mission!
diff --git a/install.yml b/install.yml
@@ -1,4 +1,7 @@
 python: "3.10"
 commands:
-    - ["pip", "rdkit-pypi", "2022.3.1b1"]
-    - ["conda", "pandas", "1.3.5", "default"]
+    - ["pip", "scikit-learn", "1.5.2"]
+    - ["pip", "umap-learn", "0.5.7"]
+    - ["pip", "eosce", "0.2.0"]
+    - ["pip", "openTSNE", "1.0.2"]
+    - ["pip", "numpy", "1.26.4"]
diff --git a/metadata.yml b/metadata.yml
@@ -2,19 +2,20 @@ Identifier: eos9gg2
 Slug: chemical-space-projections-drugbank
 Status: In progress
 Title: Chemical space 2D projections against DrugBank
-Description: This tool performs PCA and UMAP projections taking the DrugBank chemical
-  space as a reference. The Ersilia Compound Embeddings as used as descriptors. Four
-  PCA components and two UMAP components are returned.
-Mode: ''
-Task: []
-Input: []
-Input Shape: ''
-Output: []
-Output Type: []
-Output Shape: ''
-Interpretation: ''
+Description: This tool performs PCA, UMAP and tSNE projections taking the DrugBank
+  chemical space as a reference. The Ersilia Compound Embeddings as used as descriptors.
+  Four PCA components and two UMAP and tSNE components are returned.
+Mode: In-house
+Task: Representation
+Input: Compound
+Input Shape: Single
+Output: Descriptor
+Output Type: Float
+Output Shape: List
+Interpretation: Coordinates of 2D projections, namely PCA, UMAP and tSNE.
 Tag:
 - Embedding
 Publication: https://academic.oup.com/nar/article/52/D1/D1265/7416367
 Source Code: https://github.com/ersilia-os/compound-embedding
 License: GPL-3.0-or-later
+S3: https://ersilia-models-zipped.s3.eu-central-1.amazonaws.com/eos9gg2.zip
diff --git a/mock.txt b/mock.txt
diff --git a/model/checkpoints/pca.joblib b/model/checkpoints/pca.joblib
diff --git a/model/checkpoints/pca_scaler.joblib b/model/checkpoints/pca_scaler.joblib
diff --git a/model/checkpoints/tsne.joblib b/model/checkpoints/tsne.joblib
diff --git a/model/checkpoints/tsne_scaler.joblib b/model/checkpoints/tsne_scaler.joblib
diff --git a/model/checkpoints/umap.joblib b/model/checkpoints/umap.joblib
diff --git a/model/checkpoints/umap_scaler.joblib b/model/checkpoints/umap_scaler.joblib
diff --git a/model/framework/code/main.py b/model/framework/code/main.py
@@ -1,9 +1,26 @@
 # imports
 import os
-import csv
 import sys
-from rdkit import Chem
-from rdkit.Chem.Descriptors import MolWt
+import csv
+import joblib
+from eosce.models import ErsiliaCompoundEmbeddings
+
+ROOT = os.path.abspath(os.path.dirname(__file__))
+
+# loading methods
+embedder = ErsiliaCompoundEmbeddings()
+f = os.path.join(ROOT, "..", "..", "checkpoints", "pca.joblib")
+pca = joblib.load(f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "umap.joblib")
+umap = joblib.load(f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne.joblib")
+tsne = joblib.load(f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "pca_scaler.joblib")
+pca_scaler = joblib.load(f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "umap_scaler.joblib")
+umap_scaler = joblib.load(f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne_scaler.joblib")
+tsne_scaler = joblib.load(f)
 
 # parse arguments
 input_file = sys.argv[1]
@@ -12,19 +29,43 @@
 # current file directory
 root = os.path.dirname(os.path.abspath(__file__))
 
-# my model
-def my_model(smiles_list):
-    return [MolWt(Chem.MolFromSmiles(smi)) for smi in smiles_list]
-
-
 # read SMILES from .csv file, assuming one column with header
 with open(input_file, "r") as f:
     reader = csv.reader(f)
     next(reader)  # skip header
     smiles_list = [r[0] for r in reader]
 
-# run model
-outputs = my_model(smiles_list)
+print(len(smiles_list))
+
+# calculate embeddings
+X = embedder.transform(smiles_list)
+print(X.shape)
+
+# make projections
+X_pca = pca.transform(X)
+X_umap = umap.transform(X_pca)
+X_tsne = tsne.transform(X_pca)
+
+# scale projections
+X_pca = pca_scaler.transform(X_pca[:, :4])
+X_umap = umap_scaler.transform(X_umap)
+X_tsne = tsne_scaler.transform(X_tsne)
+
+# assemble dataset
+columns = [
+    "pca-1",
+    "pca-2",
+    "pca-3",
+    "pca-4",
+    "umap-1",
+    "umap-2",
+    "tsne-1",
+    "tsne-2"
+]
+
+outputs = []
+for i in range(len(smiles_list)):
+    outputs += [list(X_pca[i]) + list(X_umap[i]) + list(X_tsne[i])]
 
 #check input and output have the same lenght
 input_len = len(smiles_list)
@@ -34,6 +75,6 @@ def my_model(smiles_list):
 # write output in a .csv file
 with open(output_file, "w") as f:
     writer = csv.writer(f)
-    writer.writerow(["value"])  # header
+    writer.writerow(columns)  # header
     for o in outputs:
-        writer.writerow([o])
+        writer.writerow(o)
diff --git a/model/framework/examples/input.csv b/model/framework/examples/input.csv
@@ -1 +1,7 @@
-smiles
+smiles
+Cc1cccc(NC(=O)CN2CCC(c3ccccn3)CC2)c1
+Clc1cccc(-c2nnnn2Cc2cccnc2)c1Cl
+CNC(=O)Nc1ccc2c(c1)CC[C@@]21OC(=O)N(CC(=O)N(Cc2ccc(F)cc2)[C@@H](C)C(F)(F)F)C1=O
+Cc1[nH]nc2ccc(-c3cncc(OC[C@@H](N)Cc4ccccc4)c3)cc12
+NCCCCCCCCCCNS(=O)(=O)c1cccc2c(Cl)cccc12
+N#Cc1c(O)c2c(-c3ccc(-c4ccccc4O)cc3)csc2[nH]c1=O
diff --git a/model/framework/examples/output.csv b/model/framework/examples/output.csv
@@ -0,0 +1,7 @@
+pca-1,pca-2,pca-3,pca-4,umap-1,umap-2,tsne-1,tsne-2
+0.93429697,-0.013587162,0.07135894,0.5486406,0.27600825,0.44487536,0.3588257270613559,-0.4519206357833884
+0.9365769,-0.07189764,0.077680945,0.3353931,-0.010398388,0.4344536,-0.08191483717869656,-0.3160242774148638
+0.8938919,-0.10100889,0.16663171,0.5110347,-0.028684974,0.30495894,-0.10819866698245109,-0.262778039363111
+0.9165611,-0.13018623,0.10664324,0.22172563,-0.038452625,0.5460043,-0.13562216426727128,-0.4091770175597786
+0.9100312,-0.059196636,0.09250181,0.32043228,0.21895015,0.4505087,-0.0009358663260258882,-0.6254777418597811
+0.8995627,0.001626417,0.15434188,0.33493558,0.1880995,0.103464246,-0.04592879320774234,0.011337310416822297
diff --git a/model/framework/fit/00_fit.py b/model/framework/fit/00_fit.py
@@ -0,0 +1,62 @@
+import csv
+import os
+from sklearn.decomposition import PCA
+from sklearn.preprocessing import MinMaxScaler
+from umap import UMAP
+from openTSNE import TSNE
+from eosce.models import ErsiliaCompoundEmbeddings
+import joblib
+
+print("Done with imports")
+
+ROOT = os.path.abspath(os.path.dirname(__file__))
+
+with open(os.path.join(ROOT, "data", "drugbank_inchikeys.csv"), "r") as f:
+    reader = csv.reader(f)
+    next(reader)
+    smiles_list = [row[0] for row in reader]
+
+print("Number of SMILES strings:", len(smiles_list))
+
+print("Calculating Ersilia Compound Embeddings...")
+embedder = ErsiliaCompoundEmbeddings()
+X = embedder.transform(smiles_list)
+
+print("Calculating PCA...")
+pca = PCA(n_components=100)
+pca.fit(X)
+X_pca = pca.transform(X)
+
+print("Calculating UMAP...")
+umap = UMAP(n_components=2)
+umap.fit(X_pca)
+X_umap = umap.transform(X_pca)
+
+print("Calculating tSNE")
+tsne = TSNE(n_components=2)
+tsne = tsne.fit(X_pca)
+X_tsne = tsne.transform(X_pca)
+
+print("Saving reducers")
+f = os.path.join(ROOT, "..", "..", "checkpoints", "pca.joblib")
+joblib.dump(pca, f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "umap.joblib")
+joblib.dump(umap, f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne.joblib")
+joblib.dump(tsne, f)
+
+print("Scalers")
+pca_scaler = MinMaxScaler(feature_range=(-1, 1))
+umap_scaler = MinMaxScaler(feature_range=(-1, 1))
+tsne_scaler = MinMaxScaler(feature_range=(-1, 1))
+pca_scaler.fit(X_pca[:, :4])
+umap_scaler.fit(X_umap)
+tsne_scaler.fit(X_tsne)
+
+print("Saving scalers")
+f = os.path.join(ROOT, "..", "..", "checkpoints", "pca_scaler.joblib")
+joblib.dump(pca_scaler, f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "umap_scaler.joblib")
+joblib.dump(umap_scaler, f)
+f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne_scaler.joblib")
+joblib.dump(tsne_scaler, f)