Molbotomy

Description

Molbotomy is a easy-to-use toolkit to clean, process, and split molecular data.

Functionality

SpringCleaning: Clean up a set of molecules
scaffold_split, random_split, stratified_random_split: Splits molecules into a train and test set
check_splits: Do a sanity check on the train and test split and compute some statistics (and/or distances)
MolecularDistanceMatrix: Computes a pairwise distance matrix of molecules

Usage

Cleaning data:

from molbotomy import SpringCleaning

smiles = ['NC(=O)c1ccc(N2CCCN(Cc3ccc(F)cc3)CC2)nc1', 'CC(=O)NCC1(C)CCC(c2ccccc2)(N(C)C)CC1', ...]

C = SpringCleaning(neutralize=True,
                   check_for_uncommon_atoms=True,
                   desalt=True,
                   remove_solvent=True,
                   sanitize=True,
                   ...)

smiles_clean = C.clean(smiles)

C.summary()

> Parsed 10000 molecules of which 9964 successfully.
  Failed to clean 36 molecules: {'unfamiliar token': 25, 'fragmented SMILES': 11}

Splitting data:

from molbotomy import scaffold_split, tools

smiles = ['NC(=O)c1ccc(N2CCCN(Cc3ccc(F)cc3)CC2)nc1', 'CC(=O)NCC1(C)CCC(c2ccccc2)(N(C)C)CC1', ...]
mols = tools.smiles_to_mols(smiles, sanitize=True)

train_idx, test_idx = scaffold_split(mols, ratio=0.2)

Evaluating splits:

from molbotomy import check_splits

train_smiles = ['NC(=O)c1ccc(N2CCCN(Cc3ccc(F)cc3)CC2)nc1', 'CC(=O)NCC1(C)CCC(c2ccccc2)(N(C)C)CC1', ...]
test_smiles = ['CC(C)C[C@]1(c2cccc(O)c2)CCN(CC2CC2)C1', 'NC(=O)c1ccc(Oc2ccc(CN3CCCC3c3ccccc3)cc2)nc1', ...]

check_splits(train_smiles, test_smiles)

> Data leakage:
      Found 0 intersecting SMILES between the train and test set.
      Found 0 intersecting Bemis-Murcko scaffolds between the train and test set.
      Found 0 intersecting stereoisomers between the train and test set.
  Duplicates:
      Found 0 duplicate SMILES in the train set.
      Found 0 duplicate SMILES in the test set.
  Stereoisomers:
      Found 0 Stereoisomer SMILES in the train set.
      Found 0 Stereoisomer SMILES in the test set.

Installation

pip install molbotomy

This codebase uses Python 3.9 and depends on:

RDKit (2023.3.2)

License

All code is under MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
molbotomy		molbotomy
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molbotomy

Description

Functionality

Usage

Installation

License

About

Releases

Packages

Languages

License

derekvantilborg/molbotomy

Folders and files

Latest commit

History

Repository files navigation

Molbotomy

Description

Functionality

Usage

Installation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages