This library provides bioactivity aware chemical embeddings for small molecules.
The lite
directory contains the eosce
library intended for end-user use.
The compound-embedding
library is intended for development use.
git clone https://github.com/ersilia-os/compound-embedding.git
cd compound-embedding/lite
conda create -n eosce python=3.8
conda activate eosce
pip install -e .
or if you have a GPU
pip install -e .[gpu]
from eosce import ErsiliaCompoundEmbeddings
model = ErsiliaCompoundEmbeddings()
embeddings = model.transform(["CCOC(=O)C1=CC2=CC(OC)=CC=C2O1"])
# Optionally if you want grid embeddingd
grid_embeddings = model.transform(["CCOC(=O)C1=CC2=CC(OC)=CC=C2O1"], grid=True)
For a single smiles:
eosce embed "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1"
or, to save the results in a .csv file:
eosce embed "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1" -o output.csv
For multiple smiles, pass an input file with a single column as a smiles list. An example is provided in lite/data
eosce embed -i data/input.csv -o data/output.csv
For grid embedding
eosce embed --grid "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1" -o output.csv
Get support by running
eosce embed --help
git clone https://github.com/ersilia-os/compound-embedding.git
cd compound-embedding
conda config --set channel_priority flexible
conda env create -f env.yaml
conda activate crux
bash install_grover.sh
We generated a unique dataset that contains grover descriptors, mordred descriptors and Assay labels for the molecules present in the FS-MOL dataset. This dataset is then used to train a ProtoNET with euclidean distance as the metric.
We then used the trained Protonet to generate 2 million embedding for the molecules in the reference library generated using CHEMBL.
Finally we used the generated training dataset generated in Phase 2 to train a simple and fast neural network that maps ECFPs to embeddings generated by the ProtoNET.
The FS-Mol dataset is available as a download, FS-Mol Data, split into train
, valid
and test
folders.
Tasks are stored as individual compressed JSONLines files, with each line corresponding to the information to a single datapoint for the task.
Each datapoint is stored as a JSON dictionary, following a fixed structure:
{
"SMILES": "SMILES_STRING",
"Property": "ACTIVITY BOOL LABEL",
"Assay_ID": "CHEMBL ID",
"RegressionProperty": "ACTIVITY VALUE",
"LogRegressionProperty": "LOG ACTIVITY VALUE",
"Relation": "ASSUMED RELATION OF MEASURED VALUE TO TRUE VALUE",
"AssayType": "TYPE OF ASSAY",
"fingerprints": [...],
"descriptors": [...],
"graph": {
"adjacency_lists": [
[... SINGLE BONDS AS PAIRS ...],
[... DOUBLE BONDS AS PAIRS ...],
[... TRIPLE BONDS AS PAIRS ...]
],
"node_types": [...ATOM TYPES...],
"node_features": [...NODE FEATURES...],
}
}
crux gen grover --inp "path/to/fs-mol/data" --out "path/to/save/output"
crux gen grover --inp "path/to/fs-mol-merged-grover/data" --out "path/to/save/output"
crux qc --inp "path/to/fs-mol/data" --out "path/to/fs-mol-merged-grover-mordred/data"
crux train protonet \
--save_dir path/to/save/trained_model \
--data_dir "path/to/fs-mol-merged-grover-mordred/data" \
--num_train_steps 10000
cp path/to/save/trained_model/FSMOL_protonet_{run identifier}/fully_trained.pt ./src/compound_embedding/
Reference libraray can be downloaded from here. Move it to the package root as we did in the last step.
mpiexec -n 4 python gen_efp_train.py
This will create a efp_training.hdf5
file in the directory where the command is executed.
crux train efp --save_dir /path/to/save/checkpoints --data_file /path/to/efp_training.hdf5
This repository is open-sourced under the GPL-3 License. Please cite us if you use it.
The Ersilia Open Source Initiative is a Non Profit Organization (1192266) with the mission is to equip labs, universities and clinics in LMIC with AI/ML tools for infectious disease research.