Skip to content
thsa edited this page Oct 31, 2024 · 3 revisions

General Concepts

OpenChemLib is a robust Java based rich framework to develop server and client applications in the field of cheminformatics. Its development started at the end of the last century at Actelion Ltd., where it provided the cheminformatics functionality for in-house applications including an electronic chemical Notebook, a compound registration system, a chemicals inventory, and DataWarrior, a data visualization tool with chemical intelligence. Both, OpenChemLib and DataWarrior are open-source projects today. DataWarrior showcases much of the OpenChemLib functionality and is used by about 100'000 users worldwide (end 2024).

Different from other cheminformatics frameworks, OpenChemLib also contains user-interface components that allow the development of desktop applications. These include an editor for chemical structures or reactions. As part of the openchemlib-js project this editor is also available for JavaScript opening OpenChemLib also for web-based front ends.

Since OpenChemLib is Java, its functionality is modular and object-oriented. The most important object is the StereoMolecule. This and other important objects are described in the following:

StereoMolecule

A StereoMolecule contains the connection table of a chemical entity, which either is a molecule or a substructure fragment. In the first case all open valences are implicitly meant to be filled with hydrogen atoms. In the second case implicit hydrogen atoms don't exist, but query features may be assigned to atoms and/or bonds. For instance, these may require an atom be be aromatic or a bond to be either a single or a double bond. The methods isFragment() and setFragment(boolean) query or set the state of a StereoMolecule.

A StereoMolecule is derived from an ExtendedMolecule, which in turn inherits from a Molecule. Most of the StereoMolecule's functionality is derived from one of these classes, but one should always instantiate a StereoMolecule. There is hardly a reason to ever instantiate one of its parent classes. The Molecule class contains functionality regarding the primary information about the molecule, which is a list of all atoms, their coordinates and types, all bonds, which atoms they connect, their types, and atom and bond associated query features, etc. Methods to add an atom, to change a bond or to remove a query feature are located in the Molecule class.

The ExtendedMolecule contains functionality that goes beyond the primary information. It is responsible to calculate atom neighbors from the connection table and to perceive rings and aromaticity. It also sorts the atom table such that non-hydrogen atoms come first and all explicit hydrogen atoms are at the end of the atom table. It knows about the number of non-hydrogen neighbors of all atoms. It also calculates atom and bond related properties as the number of pi-electrons at atoms. It contains functionality related to the molecular graph, e.g. to find the shortest connection between two atoms, to locate and copy a substituent or to count all disconnected fragments.

The StereoMolecule itself adds complete stereo perception to the derived functionality. It knows about stereo centers, relations between them, recognizes meso fragments and, thus, knows which atoms or bonds are equivalent and which are not.

Most important, when working with molecules, is the concept of helper arrays. These arrays keep track of the information calculated from the primary Molecule data (atom and bond neighbors, rings, aromaticity, stereo centers and bonds, etc. For the sake of performance, these helper arrays are not automatically updated with every small change of the primary molecule, e.g. when changing a bond order. However, the molecule knows about the validity of its helper data. If you need to get valid, calculated, low-level information as with getAtomRingSize(), then you need to ensure the validity of the molecule's ring related helper data first. This is done by calling ensureHelperArrays(Molecule.cHelperRings). Altogether, there are four levels of helper validity, of which every higher level includes the lower ones: cHelperNeighbours, cHelperRings, cHelperParities, cHelperCIP. Note also that high-level methods like getPath() validate helper array internally.

Most algorithms that work with molecules neglect hydrogen atoms. Typically, hydrogen atoms are implicit anyway, which means they were never drawn in an editor or specified in an input file. If, however, explicit hydrogen atoms exist, for instance when connected with an up- or down-bond to define a stereo center, then they are located at the end of the atom table and also at the end of any atom's neighbor tables. Whenever a molecule is displayed, explicit hydrogens are shown, but since other algorithms typically neglect them, multiple methods exist that ask for the number of atoms: getAtoms() and getConnAtoms(int) return the number of non-hydrogen atoms of the molecule and non-hydrogen neighbor atoms of a specified atom; getAllAtoms() and getAllConnAtoms(int) include explicit hydrogen atoms.

Canonizer

Whenever StereoMolecules need to be stored in text files or databases, then some kind of an encoding into a text string is needed. Ideally, encodings are canonical such that the same molecule always leads to the same text string, even of it has been drawn in a different way, e.g. with a different order of atoms or a different location of the double bonds in an aromatic ring. De-facto standard to represent a molecule as text are the Molfile format published by MDL, SMILES strings by Daylight, or Inchi by the NIH. Molfiles are not canonical and they take plenty of space. SMILES don't support the concept of enhanced stereo representation and canonical versions are not standardized. Inchis don't cover query features and the rules for the canonicalization are not documented. Therefore, OpenChemLib uses its own string representation, called ID-Code. Conceptually, ID-Codes are similar to MDL's SEMA encoding used in MACCS-, REACCS-, and IsisBase databases. They are very compact, canonical, include Enhanded Stereo Representation encode both, molecules and substructures, and can be parsed to reconstruct the original chemical entity from the string.

A Canonizer object, when instantiated with a passed StereoMolecule runs a complete symmetry perception of the molecule (or substructure fragment) from which it derives a unique order of atoms that is then used to encode the molecule in a reproducible way. The generated ID-Code, since canonical, cannot contain atom coordinates. However, when recreating molecules from an ID-Code, then original atom coordinates, 2D or 3D, are often needed. Therefore, the Canonizer is able to encode atom coordinates as a second text string. Use getIDCode() and getEncodedCoordinates() from a freshly instantiated Canonizer to generate both strings.

IDCodeParserWithoutCoordinateInvention

This class is used to construct a StereoMolecule from an ID-Code. If encoded coordinates are supplied in addition to the ID-Code, the the original molecule is constructed including original atom coordinates. Otherwise, the molecule doesn't have valid atom coordinates after parsing.

IDCodeParser

This class is derived from the IDCodeParserWithoutCoordinateInvention. In addition to its parent class it can generate 2-dimensional atom coordinates, if they are not given. For this purpose it employs the CoordinateInventor.