by Axel Pahl
A set of tools to use with the Open Source Cheminformatics toolkit
RDKit in the Jupyter Notebook.
Written for Python3, only tested on Linux (Ubuntu 16.04)
and the conda install of the RDkit.
A Mol_List class was introduced, which is a subclass of a Python list for holding lists of RDKit molecule objects and allows direct access to a lot of the RDKit functionality. It is meant to be used with the Jupyter Notebook and includes a.o.:
- display of the Mol_List
- as HTML table, nested table or grid
- display of a summary including number of records and min, max, mean, median for numeric properties
- display of correlations between the Mol_List's properties (using np.corrcoef, this allows getting a quick overview on which properties correlate with each other)
- methods for sorting, searching (by property or substructure) and filtering the Mol_List
- methods for renaming, reordering and calculating properties
- direct plotting of properties as publication-grade Highcharts or Bokeh plots with structure tooltips (!).
- the plotting functionalities reside in their own module and can also be used for plotting Pandas dataframes and Python dicts.
- further development will focus on Bokeh because of the more pythonic interface
- jsme: Display Peter Ertl's Javascript Molecule Editor to enter a molecule directly in the IPython notebook (how cool is that??).
The module tries to find a local version of JSME in <notebook_dir>/lib/ and when it fails to do so, loads a web version of the editor. I use a central lib/ folder and create symlinks to it in all notebook folders where I want to use these libraries
...plus many others.
A Pipelining Workflow using Python Generators, mainly for RDKit and large compound sets. The use of generators allows working with arbitrarily large data sets, the memory usage at any given time is low.
Example use:
>>> from rdkit_ipynb_tools import pipeline as p
>>> s = Summary()
>>> rd = start_csv_reader(test_data_b64.csv.gz", summary=s)
>>> b64 = pipe_mol_from_b64(rd, summary=s)
>>> filt = pipe_mol_filter(b64, "[H]c2c([H])c1ncoc1c([H])c2C(N)=O", summary=s)
>>> stop_sdf_writer(filt, "test.sdf", summary=s)
or, using the pipe function:
>>> s = Summary()
>>> rd = start_sdf_reader("test.sdf", summary=s)
>>> pipe(rd,
>>> pipe_keep_largest_fragment,
>>> (pipe_neutralize_mol, {"summary": s}),
>>> (pipe_keep_props, ["Ordernumber", "NP_Score"]),
>>> (stop_csv_writer, "test.csv", {"summary": s})
>>> )
The progress of the pipeline is displayed as a HTML table in the Notebook and can also be followed in a separate terminal with: watch -n 2 cat pipeline.log
.
Starting | Running | Stopping |
---|---|---|
start_cache_reader | pipe_calc_props | stop_cache_writer |
start_csv_reader | pipe_custom_filter | stop_count_records |
start_mol_csv_reader | pipe_custom_man | stop_csv_writer |
start_sdf_reader | pipe_do_nothing | stop_df_from_stream |
start_stream_from_dict | pipe_has_prop_filter | stop_dict_from_stream |
start_stream_from_mol_list | pipe_id_filter | stop_mol_list_from_stream |
pipe_inspect_stream | stop_sdf_writer | |
pipe_join_data_from_file | ||
pipe_keep_largest_fragment | ||
pipe_keep_props | ||
pipe_merge_data | ||
pipe_mol_filter | ||
pipe_mol_from_b64 | ||
pipe_mol_from_smiles | ||
pipe_mol_to_b64 | ||
pipe_mol_to_smiles | ||
pipe_neutralize_mol | ||
pipe_remove_props | ||
pipe_rename_prop | ||
pipe_sim_filter | ||
pipe_sleep |
Limitation: unlike in other pipelining tools, because of the nature of Python generators, the pipeline can not be branched.
Fully usable, documentation needs to be written. Please refer to the docstrings until then.
New, WIP, not usable. Has been moved to the scaffolds branch.
Much of the functionality is shown in the tools tutorial notebook. SAR functionality is shown in the SAR tutorial notebook. The SAR module is new and Work in Progress.
The module documentation can be built with sphinx using the make_doc.sh
script
The recommended way to use this project is via conda.
- Python 3
- RDKit
- Jupyter Notebook
- ipywidgets
- cairo (via conda or pip) and cairocffi (only via pip) to get decent-looking structures
- Bokeh for high-quality data plots with structure tooltips
After installing the requirements,
clone this repo, then the rdkit_ipynb_tools can be used by including
the project's base directory (rdkit_ipynb_tools
)
in Python's import path (I actually prefer this to using setuptools,
because a simple git pull
will get you the newest version).
This can be achieved by one of the following:
- If you use conda (recommended), use conda develop. This works similar to the next option.
- Put a file with the extension
.pth
, e.g.my_packages.pth
, into one of thesite-packages
directories of your Python installation and put the path to the base directory of this project (rdkit_ipynb_tools
) into it.
(I have the path to a dedicated folder on my machine included in such a.pth
file and link all my development projects to that folder. This way, I only need to create the.pth
file once.)
Processing data from 200k compounds takes 10-15 sec on my notebook.
Substructure searches take longer.
For performance reasons, I use b64encode and pickle strings of mol objects to store the molecule structures in text format
(see also Greg's blog post for faster structure generation):
b64encode(pickle.dumps(mol)).decode()
For me, that has proven to be the fastest method when dealing with flat text files and is also the reason why there are pipe_mol_to_b64
and pipe_mol_from_b64
components in the pipeline
module.
- When you use a local copy of the Javascript Molecule Editor as described above and use Bokeh for plotting, you can work completely offline in your Notebook.
- make pipelines more user-friendly
- complete the scaffolds module
- add functionality as needed / requested
(probably not in this order)