generated from ersilia-os/eos-template
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
3a0c836
commit 909974a
Showing
15 changed files
with
12,082 additions
and
63 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,50 +1,43 @@ | ||
# Ersilia Model Contribution | ||
# Chemical space 2D projections against DrugBank | ||
|
||
This README contains the instructions to incorporate a model. Please follow along to bring your model into the Ersilia Model Hub. After successful incorporation of the model, this README will be automatically updated to reflect model specific details. | ||
This tool performs PCA, UMAP and tSNE projections taking the DrugBank chemical space as a reference. The Ersilia Compound Embeddings as used as descriptors. Four PCA components and two UMAP and tSNE components are returned. | ||
|
||
## Folder Structure | ||
## Identifiers | ||
|
||
Generally, two important pieces make up a model that goes into the hub: the model checkpoints, and the code to load and make predictions with that model. With that in mind, the model folder is organised as follows: | ||
* EOS model ID: `eos9gg2` | ||
* Slug: `chemical-space-projections-drugbank` | ||
|
||
``` | ||
└── model | ||
├── checkpoints | ||
│ └── README.md | ||
└── framework | ||
├── README.md | ||
├── code | ||
│ └── main.py | ||
├── examples | ||
│ ├── run_input.csv | ||
│ └── run_output.csv | ||
└── run.sh | ||
``` | ||
## Characteristics | ||
|
||
- `model/checkpoints` contains checkpoint files required by the model | ||
- `model/framework` contains the driver code to load the model and run inferences from it. There are two files of interest here: `main.py`, and `run.sh`. The `main.py` file will contain the driver code to load model checkpoints and call its prediction API, while `run.sh` serves two purposes, it runs the code in the `main.py` file and also tells Ersilia that this model server will have a `run` API. | ||
* Input: `Compound` | ||
* Input Shape: `Single` | ||
* Task: `Representation` | ||
* Output: `Descriptor` | ||
* Output Type: `Float` | ||
* Output Shape: `List` | ||
* Interpretation: Coordinates of 2D projections, namely PCA, UMAP and tSNE. | ||
|
||
## Specifying Dependencies | ||
## References | ||
|
||
To specify dependencies for this model, use the `install.yml` file to populate all the necessary dependencies required by the model to successfully run. This dependency configuration file has two top level keys: | ||
* [Publication](https://academic.oup.com/nar/article/52/D1/D1265/7416367) | ||
* [Source Code](https://github.com/ersilia-os/compound-embedding) | ||
* Ersilia contributor: [miquelduranfrigola](https://github.com/miquelduranfrigola) | ||
|
||
- `python` which expects a string value denoting a python version (eg `"3.10"`) | ||
- `commands` which expects a list of values, each of which is a list on its own, denoting the dependencies required by the model. Currently, dependencies `pip` and `conda` are supported. | ||
- `pip` dependencies are expected to be three element lists in the format `["pip", "library", "version"]` | ||
- `conda` dependencies are expected to be four element lists in the format `["conda", "library", "version", "channel"]`, where channel is the conda channel to install the required library. | ||
## Ersilia model URLs | ||
* [GitHub](https://github.com/ersilia-os/eos9gg2) | ||
|
||
The installation parser will raise an exception if dependencies are not specified in the aforementioned format. | ||
## Citation | ||
|
||
**Note**: Please note that we realise that this form of dependency specification is restrictive. We are [working](https://github.com/ersilia-os/ersilia-pack/issues/21) on extending how Ersilia Pack handles dependency specification, for example, to handle VCS and URL based dependencies. | ||
If you use this model, please cite the [original authors](https://academic.oup.com/nar/article/52/D1/D1265/7416367) of the model and the [Ersilia Model Hub](https://github.com/ersilia-os/ersilia/blob/master/CITATION.cff). | ||
|
||
## License | ||
|
||
## Specifying Model Metadata | ||
This package is licensed under a GPL-3.0 license. The model contained within this package is licensed under a GPL-3.0-or-later license. | ||
|
||
Model metadata should be specified within metadata.yml. An explanation of what these metadata fields correspond to can be found [here.](https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/incorporate-models/model-template#the-metadata.json-file) | ||
Notice: Ersilia grants access to these models 'as is' provided by the original authors, please refer to the original code repository and/or publication if you use the model in your research. | ||
|
||
## Specifying Model APIs | ||
## About Us | ||
|
||
A bash script within the `model/framework` directory is interepreted by Ersilia as an API for the model. For example, `run.sh` corresponds to a model `run` API, and similarly, a `fit.sh` would correspond to a model `fit` API. However arbitrary file names for bash script files are not allowed, and the acceptable names are one of the following: [`run`, `fit`]. | ||
The [Ersilia Open Source Initiative](https://ersilia.io) is a Non Profit Organization ([1192266](https://register-of-charities.charitycommission.gov.uk/charity-search/-/charity-details/5170657/full-print)) with the mission is to equip labs, universities and clinics in LMIC with AI/ML tools for infectious disease research. | ||
|
||
## Adding Example Input and Output | ||
|
||
It is always helpful to provide an example input and output while contributing a model to ease the verification of the model's working. To ensure all models always have an example, Ersilia checks for example CSV files in the `model/framework/examples` directory. In particular, Ersilia looks for `input.csv`, and `output.csv` files in this folder. These files are used to generate the necessary API end points for building a model server and therefore must always be provided. | ||
[Help us](https://www.ersilia.io/donate) achieve our mission! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,7 @@ | ||
python: "3.10" | ||
commands: | ||
- ["pip", "rdkit-pypi", "2022.3.1b1"] | ||
- ["conda", "pandas", "1.3.5", "default"] | ||
- ["pip", "scikit-learn", "1.5.2"] | ||
- ["pip", "umap-learn", "0.5.7"] | ||
- ["pip", "eosce", "0.2.0"] | ||
- ["pip", "openTSNE", "1.0.2"] | ||
- ["pip", "numpy", "1.26.4"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,7 @@ | ||
smiles | ||
smiles | ||
Cc1cccc(NC(=O)CN2CCC(c3ccccn3)CC2)c1 | ||
Clc1cccc(-c2nnnn2Cc2cccnc2)c1Cl | ||
CNC(=O)Nc1ccc2c(c1)CC[C@@]21OC(=O)N(CC(=O)N(Cc2ccc(F)cc2)[C@@H](C)C(F)(F)F)C1=O | ||
Cc1[nH]nc2ccc(-c3cncc(OC[C@@H](N)Cc4ccccc4)c3)cc12 | ||
NCCCCCCCCCCNS(=O)(=O)c1cccc2c(Cl)cccc12 | ||
N#Cc1c(O)c2c(-c3ccc(-c4ccccc4O)cc3)csc2[nH]c1=O |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
pca-1,pca-2,pca-3,pca-4,umap-1,umap-2,tsne-1,tsne-2 | ||
0.93429697,-0.013587162,0.07135894,0.5486406,0.27600825,0.44487536,0.3588257270613559,-0.4519206357833884 | ||
0.9365769,-0.07189764,0.077680945,0.3353931,-0.010398388,0.4344536,-0.08191483717869656,-0.3160242774148638 | ||
0.8938919,-0.10100889,0.16663171,0.5110347,-0.028684974,0.30495894,-0.10819866698245109,-0.262778039363111 | ||
0.9165611,-0.13018623,0.10664324,0.22172563,-0.038452625,0.5460043,-0.13562216426727128,-0.4091770175597786 | ||
0.9100312,-0.059196636,0.09250181,0.32043228,0.21895015,0.4505087,-0.0009358663260258882,-0.6254777418597811 | ||
0.8995627,0.001626417,0.15434188,0.33493558,0.1880995,0.103464246,-0.04592879320774234,0.011337310416822297 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
import csv | ||
import os | ||
from sklearn.decomposition import PCA | ||
from sklearn.preprocessing import MinMaxScaler | ||
from umap import UMAP | ||
from openTSNE import TSNE | ||
from eosce.models import ErsiliaCompoundEmbeddings | ||
import joblib | ||
|
||
print("Done with imports") | ||
|
||
ROOT = os.path.abspath(os.path.dirname(__file__)) | ||
|
||
with open(os.path.join(ROOT, "data", "drugbank_inchikeys.csv"), "r") as f: | ||
reader = csv.reader(f) | ||
next(reader) | ||
smiles_list = [row[0] for row in reader] | ||
|
||
print("Number of SMILES strings:", len(smiles_list)) | ||
|
||
print("Calculating Ersilia Compound Embeddings...") | ||
embedder = ErsiliaCompoundEmbeddings() | ||
X = embedder.transform(smiles_list) | ||
|
||
print("Calculating PCA...") | ||
pca = PCA(n_components=100) | ||
pca.fit(X) | ||
X_pca = pca.transform(X) | ||
|
||
print("Calculating UMAP...") | ||
umap = UMAP(n_components=2) | ||
umap.fit(X_pca) | ||
X_umap = umap.transform(X_pca) | ||
|
||
print("Calculating tSNE") | ||
tsne = TSNE(n_components=2) | ||
tsne = tsne.fit(X_pca) | ||
X_tsne = tsne.transform(X_pca) | ||
|
||
print("Saving reducers") | ||
f = os.path.join(ROOT, "..", "..", "checkpoints", "pca.joblib") | ||
joblib.dump(pca, f) | ||
f = os.path.join(ROOT, "..", "..", "checkpoints", "umap.joblib") | ||
joblib.dump(umap, f) | ||
f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne.joblib") | ||
joblib.dump(tsne, f) | ||
|
||
print("Scalers") | ||
pca_scaler = MinMaxScaler(feature_range=(-1, 1)) | ||
umap_scaler = MinMaxScaler(feature_range=(-1, 1)) | ||
tsne_scaler = MinMaxScaler(feature_range=(-1, 1)) | ||
pca_scaler.fit(X_pca[:, :4]) | ||
umap_scaler.fit(X_umap) | ||
tsne_scaler.fit(X_tsne) | ||
|
||
print("Saving scalers") | ||
f = os.path.join(ROOT, "..", "..", "checkpoints", "pca_scaler.joblib") | ||
joblib.dump(pca_scaler, f) | ||
f = os.path.join(ROOT, "..", "..", "checkpoints", "umap_scaler.joblib") | ||
joblib.dump(umap_scaler, f) | ||
f = os.path.join(ROOT, "..", "..", "checkpoints", "tsne_scaler.joblib") | ||
joblib.dump(tsne_scaler, f) |
Oops, something went wrong.