Skip to content

Commit

Permalink
updating readme [skip ci]
Browse files Browse the repository at this point in the history
  • Loading branch information
miquelduranfrigola authored and ersilia-bot committed Nov 9, 2024
1 parent 3a0c836 commit 909974a
Show file tree
Hide file tree
Showing 15 changed files with 12,082 additions and 63 deletions.
61 changes: 27 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,43 @@
# Ersilia Model Contribution
# Chemical space 2D projections against DrugBank

This README contains the instructions to incorporate a model. Please follow along to bring your model into the Ersilia Model Hub. After successful incorporation of the model, this README will be automatically updated to reflect model specific details.
This tool performs PCA, UMAP and tSNE projections taking the DrugBank chemical space as a reference. The Ersilia Compound Embeddings as used as descriptors. Four PCA components and two UMAP and tSNE components are returned.

## Folder Structure
## Identifiers

Generally, two important pieces make up a model that goes into the hub: the model checkpoints, and the code to load and make predictions with that model. With that in mind, the model folder is organised as follows:
* EOS model ID: `eos9gg2`
* Slug: `chemical-space-projections-drugbank`

```
└── model
├── checkpoints
│ └── README.md
└── framework
├── README.md
├── code
│ └── main.py
├── examples
│ ├── run_input.csv
│ └── run_output.csv
└── run.sh
```
## Characteristics

- `model/checkpoints` contains checkpoint files required by the model
- `model/framework` contains the driver code to load the model and run inferences from it. There are two files of interest here: `main.py`, and `run.sh`. The `main.py` file will contain the driver code to load model checkpoints and call its prediction API, while `run.sh` serves two purposes, it runs the code in the `main.py` file and also tells Ersilia that this model server will have a `run` API.
* Input: `Compound`
* Input Shape: `Single`
* Task: `Representation`
* Output: `Descriptor`
* Output Type: `Float`
* Output Shape: `List`
* Interpretation: Coordinates of 2D projections, namely PCA, UMAP and tSNE.

## Specifying Dependencies
## References

To specify dependencies for this model, use the `install.yml` file to populate all the necessary dependencies required by the model to successfully run. This dependency configuration file has two top level keys:
* [Publication](https://academic.oup.com/nar/article/52/D1/D1265/7416367)
* [Source Code](https://github.com/ersilia-os/compound-embedding)
* Ersilia contributor: [miquelduranfrigola](https://github.com/miquelduranfrigola)

- `python` which expects a string value denoting a python version (eg `"3.10"`)
- `commands` which expects a list of values, each of which is a list on its own, denoting the dependencies required by the model. Currently, dependencies `pip` and `conda` are supported.
- `pip` dependencies are expected to be three element lists in the format `["pip", "library", "version"]`
- `conda` dependencies are expected to be four element lists in the format `["conda", "library", "version", "channel"]`, where channel is the conda channel to install the required library.
## Ersilia model URLs
* [GitHub](https://github.com/ersilia-os/eos9gg2)

The installation parser will raise an exception if dependencies are not specified in the aforementioned format.
## Citation

**Note**: Please note that we realise that this form of dependency specification is restrictive. We are [working](https://github.com/ersilia-os/ersilia-pack/issues/21) on extending how Ersilia Pack handles dependency specification, for example, to handle VCS and URL based dependencies.
If you use this model, please cite the [original authors](https://academic.oup.com/nar/article/52/D1/D1265/7416367) of the model and the [Ersilia Model Hub](https://github.com/ersilia-os/ersilia/blob/master/CITATION.cff).

## License

## Specifying Model Metadata
This package is licensed under a GPL-3.0 license. The model contained within this package is licensed under a GPL-3.0-or-later license.

Model metadata should be specified within metadata.yml. An explanation of what these metadata fields correspond to can be found [here.](https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/incorporate-models/model-template#the-metadata.json-file)
Notice: Ersilia grants access to these models 'as is' provided by the original authors, please refer to the original code repository and/or publication if you use the model in your research.

## Specifying Model APIs
## About Us

A bash script within the `model/framework` directory is interepreted by Ersilia as an API for the model. For example, `run.sh` corresponds to a model `run` API, and similarly, a `fit.sh` would correspond to a model `fit` API. However arbitrary file names for bash script files are not allowed, and the acceptable names are one of the following: [`run`, `fit`].
The [Ersilia Open Source Initiative](https://ersilia.io) is a Non Profit Organization ([1192266](https://register-of-charities.charitycommission.gov.uk/charity-search/-/charity-details/5170657/full-print)) with the mission is to equip labs, universities and clinics in LMIC with AI/ML tools for infectious disease research.

## Adding Example Input and Output

It is always helpful to provide an example input and output while contributing a model to ease the verification of the model's working. To ensure all models always have an example, Ersilia checks for example CSV files in the `model/framework/examples` directory. In particular, Ersilia looks for `input.csv`, and `output.csv` files in this folder. These files are used to generate the necessary API end points for building a model server and therefore must always be provided.
[Help us](https://www.ersilia.io/donate) achieve our mission!
7 changes: 5 additions & 2 deletions install.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
python: "3.10"
commands:
- ["pip", "rdkit-pypi", "2022.3.1b1"]
- ["conda", "pandas", "1.3.5", "default"]
- ["pip", "scikit-learn", "1.5.2"]
- ["pip", "umap-learn", "0.5.7"]
- ["pip", "eosce", "0.2.0"]
- ["pip", "openTSNE", "1.0.2"]
- ["pip", "numpy", "1.26.4"]
23 changes: 12 additions & 11 deletions metadata.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,20 @@ Identifier: eos9gg2
Slug: chemical-space-projections-drugbank
Status: In progress
Title: Chemical space 2D projections against DrugBank
Description: This tool performs PCA and UMAP projections taking the DrugBank chemical
space as a reference. The Ersilia Compound Embeddings as used as descriptors. Four
PCA components and two UMAP components are returned.
Mode: ''
Task: []
Input: []
Input Shape: ''
Output: []
Output Type: []
Output Shape: ''
Interpretation: ''
Description: This tool performs PCA, UMAP and tSNE projections taking the DrugBank
chemical space as a reference. The Ersilia Compound Embeddings as used as descriptors.
Four PCA components and two UMAP and tSNE components are returned.
Mode: In-house
Task: Representation
Input: Compound
Input Shape: Single
Output: Descriptor
Output Type: Float
Output Shape: List
Interpretation: Coordinates of 2D projections, namely PCA, UMAP and tSNE.
Tag:
- Embedding
Publication: https://academic.oup.com/nar/article/52/D1/D1265/7416367
Source Code: https://github.com/ersilia-os/compound-embedding
License: GPL-3.0-or-later
S3: https://ersilia-models-zipped.s3.eu-central-1.amazonaws.com/eos9gg2.zip
3 changes: 0 additions & 3 deletions mock.txt

This file was deleted.

Binary file added model/checkpoints/pca.joblib
Binary file not shown.
Binary file added model/checkpoints/pca_scaler.joblib
Binary file not shown.
Binary file added model/checkpoints/tsne.joblib
Binary file not shown.
Binary file added model/checkpoints/tsne_scaler.joblib
Binary file not shown.
Binary file added model/checkpoints/umap.joblib
Binary file not shown.
Binary file added model/checkpoints/umap_scaler.joblib
Binary file not shown.
65 changes: 53 additions & 12 deletions model/framework/code/main.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,26 @@
# imports
import os
import csv
import sys
from rdkit import Chem
from rdkit.Chem.Descriptors import MolWt
import csv
import joblib
from eosce.models import ErsiliaCompoundEmbeddings

ROOT = os.path.abspath(os.path.dirname(__file__))

# loading methods
embedder = ErsiliaCompoundEmbeddings()
f = os.path.join(ROOT, "..", "..", "checkpoints", "pca.joblib")
pca = joblib.load(f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "umap.joblib")
umap = joblib.load(f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne.joblib")
tsne = joblib.load(f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "pca_scaler.joblib")
pca_scaler = joblib.load(f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "umap_scaler.joblib")
umap_scaler = joblib.load(f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne_scaler.joblib")
tsne_scaler = joblib.load(f)

# parse arguments
input_file = sys.argv[1]
Expand All @@ -12,19 +29,43 @@
# current file directory
root = os.path.dirname(os.path.abspath(__file__))

# my model
def my_model(smiles_list):
return [MolWt(Chem.MolFromSmiles(smi)) for smi in smiles_list]


# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
reader = csv.reader(f)
next(reader) # skip header
smiles_list = [r[0] for r in reader]

# run model
outputs = my_model(smiles_list)
print(len(smiles_list))

# calculate embeddings
X = embedder.transform(smiles_list)
print(X.shape)

# make projections
X_pca = pca.transform(X)
X_umap = umap.transform(X_pca)
X_tsne = tsne.transform(X_pca)

# scale projections
X_pca = pca_scaler.transform(X_pca[:, :4])
X_umap = umap_scaler.transform(X_umap)
X_tsne = tsne_scaler.transform(X_tsne)

# assemble dataset
columns = [
"pca-1",
"pca-2",
"pca-3",
"pca-4",
"umap-1",
"umap-2",
"tsne-1",
"tsne-2"
]

outputs = []
for i in range(len(smiles_list)):
outputs += [list(X_pca[i]) + list(X_umap[i]) + list(X_tsne[i])]

#check input and output have the same lenght
input_len = len(smiles_list)
Expand All @@ -34,6 +75,6 @@ def my_model(smiles_list):
# write output in a .csv file
with open(output_file, "w") as f:
writer = csv.writer(f)
writer.writerow(["value"]) # header
writer.writerow(columns) # header
for o in outputs:
writer.writerow([o])
writer.writerow(o)
8 changes: 7 additions & 1 deletion model/framework/examples/input.csv
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
smiles
smiles
Cc1cccc(NC(=O)CN2CCC(c3ccccn3)CC2)c1
Clc1cccc(-c2nnnn2Cc2cccnc2)c1Cl
CNC(=O)Nc1ccc2c(c1)CC[C@@]21OC(=O)N(CC(=O)N(Cc2ccc(F)cc2)[C@@H](C)C(F)(F)F)C1=O
Cc1[nH]nc2ccc(-c3cncc(OC[C@@H](N)Cc4ccccc4)c3)cc12
NCCCCCCCCCCNS(=O)(=O)c1cccc2c(Cl)cccc12
N#Cc1c(O)c2c(-c3ccc(-c4ccccc4O)cc3)csc2[nH]c1=O
7 changes: 7 additions & 0 deletions model/framework/examples/output.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
pca-1,pca-2,pca-3,pca-4,umap-1,umap-2,tsne-1,tsne-2
0.93429697,-0.013587162,0.07135894,0.5486406,0.27600825,0.44487536,0.3588257270613559,-0.4519206357833884
0.9365769,-0.07189764,0.077680945,0.3353931,-0.010398388,0.4344536,-0.08191483717869656,-0.3160242774148638
0.8938919,-0.10100889,0.16663171,0.5110347,-0.028684974,0.30495894,-0.10819866698245109,-0.262778039363111
0.9165611,-0.13018623,0.10664324,0.22172563,-0.038452625,0.5460043,-0.13562216426727128,-0.4091770175597786
0.9100312,-0.059196636,0.09250181,0.32043228,0.21895015,0.4505087,-0.0009358663260258882,-0.6254777418597811
0.8995627,0.001626417,0.15434188,0.33493558,0.1880995,0.103464246,-0.04592879320774234,0.011337310416822297
62 changes: 62 additions & 0 deletions model/framework/fit/00_fit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import csv
import os
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from umap import UMAP
from openTSNE import TSNE
from eosce.models import ErsiliaCompoundEmbeddings
import joblib

print("Done with imports")

ROOT = os.path.abspath(os.path.dirname(__file__))

with open(os.path.join(ROOT, "data", "drugbank_inchikeys.csv"), "r") as f:
reader = csv.reader(f)
next(reader)
smiles_list = [row[0] for row in reader]

print("Number of SMILES strings:", len(smiles_list))

print("Calculating Ersilia Compound Embeddings...")
embedder = ErsiliaCompoundEmbeddings()
X = embedder.transform(smiles_list)

print("Calculating PCA...")
pca = PCA(n_components=100)
pca.fit(X)
X_pca = pca.transform(X)

print("Calculating UMAP...")
umap = UMAP(n_components=2)
umap.fit(X_pca)
X_umap = umap.transform(X_pca)

print("Calculating tSNE")
tsne = TSNE(n_components=2)
tsne = tsne.fit(X_pca)
X_tsne = tsne.transform(X_pca)

print("Saving reducers")
f = os.path.join(ROOT, "..", "..", "checkpoints", "pca.joblib")
joblib.dump(pca, f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "umap.joblib")
joblib.dump(umap, f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne.joblib")
joblib.dump(tsne, f)

print("Scalers")
pca_scaler = MinMaxScaler(feature_range=(-1, 1))
umap_scaler = MinMaxScaler(feature_range=(-1, 1))
tsne_scaler = MinMaxScaler(feature_range=(-1, 1))
pca_scaler.fit(X_pca[:, :4])
umap_scaler.fit(X_umap)
tsne_scaler.fit(X_tsne)

print("Saving scalers")
f = os.path.join(ROOT, "..", "..", "checkpoints", "pca_scaler.joblib")
joblib.dump(pca_scaler, f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "umap_scaler.joblib")
joblib.dump(umap_scaler, f)
f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne_scaler.joblib")
joblib.dump(tsne_scaler, f)
Loading

0 comments on commit 909974a

Please sign in to comment.