Skip to content

Latest commit

 

History

History
1534 lines (1203 loc) · 85 KB

IUPAC_SMILES+.asciidoc

File metadata and controls

1534 lines (1203 loc) · 85 KB

IUPAC SMILES+ Specification [Work in progress draft - incomplete and not approved yet]

IUPAC SMILES+ Contributors: Vincent F. Scalfani (Chair), Evan Bolton, Helen Cooke, Chris Grulke, John Irwin, Oliver Koepler, Greg Landrum, José L. Medina-Franco, Miguel Quirós Olozábal, Susan Richardson, and Issaku Yamada.

v0.1, 2023: Work in progress draft - incomplete and not approved yet
IUPAC SMILES+ Project No. 2019-002-024
Copyright © 2021, IUPAC
Content is available under GNU Free Documentation License 1.2

This IUPAC SMILES+ Specification [working draft] document is a modified derivative of the OpenSMILES Specification. We have endeavored to maintain all prior author names, contributor names, copyright notices, and revision history.

OpenSMILES Specification
Craig A. James
v1.0,2016-05-15: Current specification

Copyright © 2007-2016, Craig A. James
Content is available under GNU Free Documentation License 1.2

OpenSMILES Contributors: Richard Apodaca, Noel O’Boyle, Andrew Dalke, John van Drie, Peter Ertl, Geoff Hutchison, Craig A. James, Greg Landrum, Chris Morley, Egon Willighagen, Hans De Winter, Tim Vandermeersch, John May

Introduction

"…​ we cannot improve the language of any science, without, at the same time improving the science itself; neither can we, on the other hand, improve a science, without improving the language or nomenclature which belongs to it …​"

Purpose and Motivation

This document formally defines an open specification version of the SMILES language, a typographical line notation for specifying chemical structure representation. This specification builds upon the Blue Obelisk OpenSMILES work, and is currently maintained (work in progress draft) within an IUPAC Task Group: IUPAC SMILES+ Project No. 2019-002-024. We welcome all contributions and comments from the entire community.

SMILES was originally developed as a proprietary specification by Daylight Chemical Information Systems. Since the introduction of SMILES in the late 1980’s, it has become widely accepted as a defacto standard for exchange of molecular structures. Many independent SMILES software packages have been written in C, C++, Java, Python, LISP, and probably even FORTRAN.

Daylight’s SMILES Theory Manual has long been the "gold standard" for the SMILES language, but as a proprietary specification, it limits the universal adoption of SMILES, and has no formal mechanism for community contributions and ongoing maintenance. We salute Daylight for their past contributions, and the excellent SMILES documentation they provided free of charge for the past three decades.

In the early 2000s, the Blue Obelisk community created a new, non-proprietary specification for the SMILES language, namely, OpenSMILES. Our efforts seek to build open the work of OpenSMILES by further clarifying ambiguities in the SMILES language, creating a formal mechanism for adoption of new extensions (i.e., the `` in IUPAC SMILES), and provide on-going maintenance of the specification through IUPAC.

Interoperability of reading the SMILES language is the priority of the IUPAC SMILES+ specification. In order to remove ambiguity, we needed to make some distinct syntax choices and, for example, not allow flexibility where a parser may correct or ignore errors in the SMILES syntax. With that in mind, we believe our choices are conservative and have not removed any important functionality of the SMILES language.

Audience

This document is intended for developers designing or improving a SMILES parser or writer. Readers are expected to be acquainted with SMILES. Due to the formality of this document, it is not a good tutorial for those trying to learn SMILES. This document is written with precision as the primary goal; readability is secondary.

What is a Molecule? The Valence Model of Chemistry

Before defining the SMILES language, it is important to state the physical model on which it is based: the valence model of chemistry, which uses a mathematician’s graph to represent a molecule. In a chemical graph, the nodes are atoms, and the edges are semi-rigid bonds that can be single, double, or triple according to the rules of valence bond theory.

This simple mental model has little resemblance to the underlying quantum-mechanical reality of electrons, protons and neutrons, yet it has proved to be a remarkably useful approximation of how atoms behave in close proximity to one another. However, the valence model is an imperfect representation of molecular structure, and the SMILES language inherits these imperfections. Chemical bonds are often tautomeric, aromatic or otherwise fractional rather than neat integer multiples. Delocalized bonds, multi-centered bonds, hydrogen bonds and various other inter-atom forces that are well characterized by a quantum-mechanics description simply do not fit into the valence model.

"If you can build a molecule from a modeling kit, you can name it."

McLeod and Peters' quip captures the deficiencies of SMILES well: if you can not build a molecule from a modeling kit, the deficiencies of SMILES and other connection-table formats become apparent.

Formal Grammar

Syntax versus Semantics

This SMILES specification is divided into two distinct parts: A syntactic specification specifies how the atoms, bonds, parentheses, digits and so forth are represented, and a semantic specification that describes how those symbols are interpreted as a sensible molecule. For example, the syntax specifies how ring closures are written, but the semantics require that they come in pairs. Likewise, the syntax specifies how atomic elements are written, but the semantics determines whether a particular ring system is actually aromatic.

For this specification, the syntax and semantics are explained separately; in practice, the syntax and semantics are usually mixed together in the code that implements a SMILES parser. This chapter is only concerned with syntax.

Grammar

Section Formal Grammar

ATOMS

Atoms

atom ::= bracket_atom | aliphatic_organic | aromatic_organic | '*'

ORGANIC SUBSET ATOMS

Organic Subset

aliphatic_organic ::= 'B' | 'C' | 'N' | 'O' | 'S' | 'P' | 'F' | 'Cl' | 'Br' | 'I'

aromatic_organic ::= 'b' | 'c' | 'n' | 'o' | 's' | 'p'

BRACKET ATOMS

Bracket Atoms

bracket_atom ::= '[' isotope? symbol chiral? hcount? charge? class? ']'

symbol ::= element_symbols | element_numbers | aromatic_symbols | element_undefined

isotope ::= digit | digit_nonzero digit | digit_nonzero digit digit

element_symbols ::= 'H' | 'He' | 'Li' | 'Be' | 'B' | 'C' | 'N' | 'O' | 'F' | 'Ne' | 'Na' | 'Mg' | 'Al' | 'Si' | 'P' | 'S' | 'Cl' | 'Ar' | 'K' | 'Ca' | 'Sc' | 'Ti' | 'V' | 'Cr' | 'Mn' | 'Fe' | 'Co' | 'Ni' | 'Cu' | 'Zn' | 'Ga' | 'Ge' | 'As' | 'Se' | 'Br' | 'Kr' | 'Rb' | 'Sr' | 'Y' | 'Zr' | 'Nb' | 'Mo' | 'Tc' | 'Ru' | 'Rh' | 'Pd' | 'Ag' | 'Cd' | 'In' | 'Sn' | 'Sb' | 'Te' | 'I' | 'Xe' | 'Cs' | 'Ba' | 'La' | 'Ce' | 'Pr' | 'Nd' | 'Pm' | 'Sm' | 'Eu' | 'Gd' | 'Tb' | 'Dy' | 'Ho' | 'Er' | 'Tm' | 'Yb' | 'Lu' | 'Hf' | 'Ta' | 'W' | 'Re' | 'Os' | 'Ir' | 'Pt' | 'Au' | 'Hg' | 'Tl' | 'Pb' | 'Bi' | 'Po' | 'At' | 'Rn' | 'Fr' | 'Ra' |'Ac' | 'Th' | 'Pa' | 'U' | 'Np' | 'Pu' | 'Am' | 'Cm' | 'Bk' | 'Cf' | 'Es' | 'Fm' | 'Md' | 'No' | 'Lr' | 'Rf' | 'Db' | 'Sg' | 'Bh' | 'Hs' | 'Mt' | 'Ds' | 'Rg' | 'Cn' | 'Nh' | 'Fl' | 'Mc' | 'Lv' | 'Ts' | 'Og'

element_numbers ::= '#1' |'#2' | '#3' | …​ |'#118'

aromatic_symbols ::= 'b' | 'c' | 'n' | 'o' | 'p' | 's' | 'se' | 'te' | 'as'

element_undefined ::= '*' | '#0'

CHIRALITY

Chirality

chiral ::= '@' | '@@' | '@TH1' | '@TH2' | '@AL1' | '@AL2' | '@SP1' | '@SP2' | '@SP3' | '@TB1' | '@TB2' | '@TB3' | …​ | '@TB20' | '@OH1' | '@OH2' | '@OH3' | …​ | '@OH30' | '@TB' DIGIT DIGIT | '@OH' DIGIT DIGIT

HYDROGENS

Hydrogens

hcount ::= 'H' | 'H' digit | 'H' digit_nonzero digit

CHARGES

Charge

charge ::= '-' | '-' digit | '-' digit_nonzero digit |''` | `'' digit | ''` _digit_nonzero_ _digit_ |`'--'` | ... | `'---------------'` | `'+' | …​ | '+++++++++++++'

ATOM CLASS

Atom Class

class ::= ':' digit | ':' digit_nonzero digit | ':' digit_nonzero digit digit | ':' digit_nonzero digit digit digit

BONDS AND CHAINS

Bonds

bond ::= '-' | '=' | '#' | '$' | ':' | '/' | '\'

ringbond ::= bond? digit_nonzero | bond? '%' digit_nonzero digit | bond? '%' '(' digit_nonzero digit digit ')'

branched_atom ::= atom ringbond* branch*

branch ::= '(' chain ')' | '(' bond chain ')' | '(' dot chain ')'

chain ::= branched_atom | chain branched_atom | chain bond branched_atom | chain dot branched_atom

dot ::= '.'

DIGITS

digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

digit_nonzero ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

SMILES STRINGS

smiles ::= terminator | chain terminator

terminator ::= SPACE | TAB | LINEFEED | CARRIAGE_RETURN | END_OF_STRING

Reading SMILES

Atoms

Atomic Symbol

An atom is represented by its atomic symbol, enclosed in square brackets, []. The first character of the symbol is uppercase and the second (if any) is lowercase, except that for aromatic atoms (see Aromaticity), the first character is lowercase. Alternatively, an atom can also be represented using the symbol '#' followed by its atomic number enclosed in square brackets, []. There are 118 valid atomic symbols as defined by IUPAC.

The symbols '*' and '#0' are accepted as a valid atomic symbols, and represent a "wildcard" or unknown atom. Importantly, '#0' must always be written within brackets (see Wildcard Atoms), while '*' is considered part of the Organic Subset and does not have this requirement.

Examples:

SMILES Atomic Symbol

[U]

Uranium

[Pb]

Lead

[He]

Helium

[*]

Unknown atom

[#0]

Unknown atom

[#6]

Carbon

Hydrogens

Hydrogens inside of brackets are specified as Hn or Hnn. In the case of a single digit, Hn, n is a digit number from 0 to 9. When two-digit numbers are used, leading zeros are not permitted, so Hnn is valid in the range of 10 through 99. If Hn is not specified, it is identical to H0. If H is specified without a number, it is identical to H1. For example, [C] and [CH0] are identical, and [CH] and [CH1] are identical.

Hydrogens that are specified in brackets with this notation have undefined isotope, no chirality, no other bound hydrogen, neutral charge, and an undefined atom class.

Examples:

SMILES Name Comments

[CH4]

methane

[ClH]

hydrochloric acid

H1 implied

[ClH1]

hydrochloric acid

A hydrogen atom can not have a hydrogen count, for example, [HH1] is invalid. Hydrogens connected to other hydrogens must be represented as explicit atoms in square brackets. For example molecular hydrogen must be written as [H][H].

Charge

Charge is specified by a +n,+nn,-n, or -nn. In the case of a single digit,n is a digit number from 0 to 9. When two-digit numbers are used, leading zeros are not permitted and nn is valid in the range of 10 through 99. For example, a charge specification such as [Ag+01] is invalid.

If a charge is specified without a number, a 1 is implied. If the number is 0, the charge is interpreted as no charge. That is, [Ag+0] and [Ag-0] are equivalent to [Ag]. A 0 charge should be avoided and is not recommended as a best practice.

Repeated symbols such as '--' and '++' are valid and interpreted as charges of -2 and +2, respectively. Symbols can repeat up to 15 times. However, this form is not recommended as a best practice and should be avoided.

Examples:

SMILES Name Comments

[Cl-]

chloride anion

-1 charge, H0 implied

[OH1-]

hydroxyl anion

-1 charge, H1

[OH-1]

hydroxyl anion

-1 charge, H1

[Cu+2]

copper cation

+2 charge, H0 implied

[Cu++]

copper cation

+2 charge, H0 implied

[AlH4-]

alumanuide

-1 charge, H4

[NH2-]

azanide

-1 charge, H2

The charge symbol '-' or ''` must come before the numeric digit. That is, `[Mg+2]` is valid, while `[Mg2] is invalid. Moreover, providing multiple charge specifications such as [Li+1-2] or [Li+-] are invalid.

Isotopes

Isotopic specification is placed inside the square brackets for an atom preceding the atomic symbol; for example:

SMILES Atomic Symbol

[13CH4]

methane

[2H+]

deuterium ion

[238U]

Uranium 238 atom

An isotope is interpreted as a digit number specified as n, nn, or nnn. In the case of a single digit, n is a digit number from 0 to 9. When two-digit or three-digit numbers are used, leading zeros are not permitted, so nn is valid in the range of 10 through 99, and nnn in the range of 100 through 999. As a result, an isotopic specification leading with a 0 and followed by digits such as [098Ru] is invalid. The isotope 98Ru is written as [98Ru].

A 0 isotope specification is equivalent to undefined, and the atom is assumed to have the naturally-occurring isotopic ratios. For example, [0S] is equivalent to [S].

There is no requirement that the isotope is a genuine isotope of the element. Thus, [36Cl] is allowed even though 35Cl and 37Cl are the actual known stable isotopes of chlorine.

Organic Subset

Most elements require specifying the hydrogen count with either an atom property H-count or with the use of explicit hydrogen, for example:

SMILES Name Molecular Formula

[Ru]

Ruthenium

Ru

[RuH2]

Ruthenium Dihydride

H2Ru

[H][Ru][H]

Ruthenium Dihydride

H2Ru

However, a special subset of elements called the "organic subset" of 'B', 'C', 'N', 'O', 'P', 'S', 'F', 'Cl', 'Br', 'I', and '*' (the "wildcard" atom) can be written using only the atomic symbol (that is, without the square brackets, H-count, etc.). An atom that is specified this way has the following properties:

  • implicit hydrogens are added such that valence of the atom is in the lowest normal state for that element

  • the atom’s charge is zero

  • the atom has no isotopic specification

  • the atom has no chiral specification

The implicit hydrogen count is determined by summing the bond orders of the bonds connected to the atom. If that sum is equal to a known valence for the element or is greater than any known valence then the implicit hydrogen count is 0. Otherwise the implicit hydrogen count is the difference between that sum and the next highest known valence.

Atom Valence and Hydrogens

The "normal valence" for the organic subset elements is as follows:

Element

B

C

N

O

F, Cl, Br, I

P

S

*

Neutral Valence

3

4

3

2

1

3,5

2,4,6

unspecified

Notes

b

a

a

a: In IUPAC SMILES+, any atom that is above its lowest valence state will not have implicit hydrogens.

b: In IUPAC SMILES+, 5 valent Nitrogen is not permitted.

Examples of valence:

SMILES Molecular Formula Notes

C

CH4

Add four H for neutral valence

N

NH3

Add three H for neutral valence

Cl

HCl

Add one H for neutral valence

C[N+](=O)[O-]

CH3NO2

Nitrogen is valence 4

C1=CC=C(C=C1)P(C2=CC=CC=C2)C3=CC=CC=C3

(C6H5)3P

Phosphorus is valence 3

C1=CC=C(C=C1)[P+](=O)C2=CC=CC=C2

(C6H5)2OP+

Phosphorus is valence 4

CS

CH3SH

Add one H to Sulfur for neutral valence

CS(=O)C

(CH3)2SO

Sulfur is valence 4

Hydrogens in a SMILES can be represented in three different ways:

Method SMILES Name Comments

implicit hydrogen

C

methane

h-count deduced from normal valence (4)

atom property

[CH4]

methane

h-count specified for heavy atom

explicit hydrogen

[H]C([H])([H])[H]

methane

hydrogens represented as normal atoms

All three forms are equivalent. However, some situations require that one form must be used:

  • Implicit hydrogen count may only be used for elements of the organic elements subset.

  • Any atom that is specified with square brackets must have its attached hydrogens explicitly represented, either as a hydrogen count or as normal atoms.

A hydrogen that meets one of the following criteria must be represented as an explicit atom:

  • hydrogens with charge ([H+])

  • a hydrogen connected to another hydrogen (such as molecular hydrogen, [H][H])

  • hydrogens with more than one bond (bridging hydrogens)

  • Deuterium [2H] and tritium [3H]

It is permissible to use a mixture of an atom h-count and explicit hydrogen. In such a case, the atom’s hydrogen count is the sum of the atomic h-count property and the number of attached hydrogens. For example:

SMILES Name

[CH4]

methane

[H][CH2][H]

methane

[2H][CH2]C

deuteroethane

As a best practice, when possible, use the atom property designation instead of explicit hydrogen; that is, [CH4] is preferred over [CH3][H].

Note: The remaining atom properties, chirality and ring-closures, are discussed in later sections.

The Wildcard '*' Atomic Symbol and '#0' Atomic Number Symbol

The '*' and '#0' atoms represents an atom whose atomic number is unknown or unspecified. The '*' atom can occur either inside or outside of square brackets, [], as the '*' atom is considered part of the special Organic Subset elements. In contrast, the '#0' atom can only occur inside of square brackets. When an '*' or '#0' atom occur inside of square brackets, the unknown atom can have its isotope, chirality, hydrogen count and charge specified. When an '*' atom occurs outside of brackets, it has no assumed isotope, a mass of zero, unspecified chirality, a hydrogen count of zero, and a charge of zero.

The '*' and '#0' atom do not have any specific electronic properties or valence. When these symbols occur inside of square brackets, they take on the valence implied by its bonds, hydrogens and/or charge.

In the case where the '*' atom can be specified outside of square brackets, it takes on the valence implied by its bonds.

An '*' or '#0' atom can be part of an aromatic ring. When deducing the aromaticity of a ring system, the ring system is considered aromatic if there is an element which could replace the '*' or '#0' atom and make the ring system meet the aromaticity rules (see Aromaticity, below).

SMILES Comments

CCC(*)C

any atom at index number 3

CCC([#0])C

any atom at index number 3

c1cc[#0]c1

five membered arene with any atom

c1cc*c1

five membered arene with any atom

CCC[#0-]

any atom with charge -1 at index number 3

[73*]

any atom with isotope value 73

Note: the term index in the table comments above is referring to the SMILES string, counting from left to right and starting at 0.

Atom Class

An "atom class" is an arbitrary integer that has no chemical meaning. It is used by applications to mark atoms in ways that are meaningful only to the application. Multiple atoms may be labeled with the same atom class.

The atom class is specified as a digit number n, nn, nnn, or nnnn. In the case of a single digit, n is a digit number from 0 to 9. When two-digit, three-digit, or four-digit numbers are used, leading zeros are not permitted, so nn is valid in the range of 10 through 99, nnn in the range of 100 through 999, and nnnn in the range of 1000 to 9999. So for example, [NH4+:5] is valid, while [NH4+:005] is invalid.

If the atom class is not specified, then the atom class is interpreted as zero.

The atom class is specified after all other properties in square brackets. For example:

SMILES Name

[CH4:2]

methane, atom’s class is 2

Bonds

Atoms that are adjacent in a SMILES string are assumed to be joined by a single or aromatic bond (see Aromaticity). For example:

SMILES Name

CC

ethane

CCO

ethanol

NCCCC

n-butylamine

CCCCN

n-butylamine

Double, triple and quadruple bonds are represented by '=', '#', and '$' respectively:

SMILES Name

C=C

ethene

C#N

hydrogen cyanide

CC#CC

2-butyne

CCC=O

propanol

[Rh-](Cl)(Cl)(Cl)(Cl)$[Rh-](Cl)(Cl)(Cl)Cl

octachlorodirhenate (III)

Misplaced bonds such as CC= or duplicate bonds such as C==C are invalid.

A single bond can be explicitly represented with '-', but it is rarely necessary.

SMILES

C-C

same as: CC

C-C-O

same as: CCO

C-C=C-C

same as: CC=CC

The remaining bond symbols, ':\/', are discussed in later sections.

Branches

An atom with three or more bonds is called a branched atom, and is represented using parentheses.

Depiction SMILES Name

2 ethyl 1 butanol

CCC(CC)CO

2-ethyl-1-butanol

Branches can be nested or "stacked" to any depth:

Depiction SMILES Name

2 4 dimethyl 3 pentanone

CC(C)C(=O)C(C)C

2,4-dimethyl-3-pentanone

3 isopropyl 2 propylhexan 1 ol

OCC(CCC)C(C(C)C)CCC

3‐isopropyl‐2‐propylhexan‐1‐ol

thiosulfate

OS(=O)(=S)O

thiosulfate

The SMILES branch/chain rules allow nested parenthetical expressions (branches) to an arbitrary depth. For example, the following SMILES, though peculiar, is legal:

SMILES Formula

C(C(C(C(C(C(C(C(C(C(C(C(C(C(C(C(C(C(C(C(C))))))))))))))))))))C

C22H46

In IUPAC SMILES+ syntax, a SMILES can not start with a branch. Duplicate branches and branching without atom(s) inside are also invalid. Some invalid examples include: C((C))O, (N1CCCC1), C(1CC1) C(), and C1CC(1). Note that the form (CO)N is never allowed, since it is not clear which atom the nitrogen should connect to.

Rings

In a SMILES string such as C1CCCCC1, the first occurrence of a ring-closure number (an "rnum") creates an "open bond" to the atom that precedes the ring-closure number (the "rnum"). When that same rnum is encountered later in the string, a bond is made between the two atoms, which typically forms a cyclic structure.

Depiction SMILES Name

cyclohexane

C1CCCCC1

cyclohexane

perhydroisoquinoline

N1CC2CCCCC2CC1

perhydroisoquinoline

If a bond symbol is present between the atom and rnum, it can be present on either or both bonded atoms. However, if it appears on both bonded atoms, the two bond symbols must be the same. If there are conflicting ring closure bonds defined such as C=1CCCCC#1, C-1CCCCC=1, or c=1ccccc:1, the SMILES are invalid. Two defined bonds must match. In contrast, if only one bond is defined as in C=1CCCCC1 or C1CCCC=1, the SMILES are valid and interpreted with the defined bond at ring closure. The implicit bond is ignored.

Depiction SMILES Name

cyclohexene

C=1CCCCC=1

cyclohexene

C=1CCCCC1

cyclohexene (preferred form)

C1CCCCC=1

cyclohexene

C-1CCCCC=1

invalid

Ring closures must be matched pairs in a SMILES string, for example, C1CCC or C1CCCCC2 are not valid SMILES.

It is permissible to re-use ring-closure numbers, but is not recommended as a best practice. Once a particular number has been encountered twice, that number is available again for subsequent ring closures.

Depiction SMILES Name Comment

dicyclohexyl

C1CCCCC1C1CCCCC1

bicyclohexyl

C1CCCCC1C2CCCCC2

bicyclohexyl

recommended form

Two-digit ring numbers are permitted, but must be preceded by the percent '%' symbol, such as C%25CCCCC%25 for cyclohexane.

Three digit ring numbers must use parentheses, %(nnn), and start with 100. This is to avoid the ambiguity of, for example C%123 being interpreted as one ring closure (%123) or two (%12 and 3). When parentheses are used, the ring closure is interpreted with one rnum specification, so cyclohexane can be represented as C%(123)CCCCC%(123)

Note that the ring number zero is invalid (e.g., C0CCCCC0) in IUPAC SMILES+. Ring numbers must start at n=1. In addition, when multiple digits are used in the case of %nn or %(nnn), a leading number of 0 is invalid, such as C%01CCCCC%01 or C%00CCCCC%00. For notation using %nn, start with a nn of 10, and for notation in the form %(nnn), start with nnn of 100.

A single atom can have several ring-closure numbers, such as this spiro atom:

Depiction SMILES Name

spiro

C12(CCCCC1)CCCCC2

spiro[5.5]undecane

spiro

C3%12(CCCCC3)CCCCC%12

spiro[5.5]undecane

Two atoms can not be joined by more than one bond, and an atom can not be bonded to itself. For example, the following are not allowed:

SMILES Comments

C12CCCCC12

illegal, two bonds between one pair of atoms

C12C2CCC1

illegal, two bonds between one pair of atoms

C11

illegal, atom bonded to itself

Aromaticity

The Meaning of "Aromaticity" in SMILES

"Aromaticity" in SMILES is primarily for cheminformatics purposes. In a cheminformatics system, we would like to have a single representation for each molecule. The Kekule form masks the inherent uniformity of the bonds in an aromatic ring. SMILES uses a simplified definition of aromaticity that facilitates substructure and exact-structure searches, as well as Normalization and Canonicalization of SMILES.

The definition of "aromaticity" in SMILES is not intended to imply anything about the physical or chemical properties of a substance. In many or most cases, the SMILES definition of aromaticity will match the chemist’s notion of what is aromatic, but in some cases it will not.

Kekule and Aromatic Representations

Aromaticity can be represented in one of two ways in a SMILES.

  • In the Kekule form, using alternating single and double bonds, with uppercase symbols for the atoms.

  • In the aromatic form, using aromatic atomic symbols that begin with a lowercase letter, such as 'c' for aromatic carbon. Aromatic bond symbols, ':', can be optionally used as well.

Aromatic Atom and Bond Rules

Rule #1: Avoid sharing aromatic SMILES. It is strongly recommended to Kekulize all SMILES prior to exchanging data, as doing so removes ambiguity in the interpretation of an aromaticity data model. There is no standard aromaticity model or consistent encoding and decoding approach.

Rule #2: A lowercase aromatic symbol is defined as an atom in the sp2 configuration in a ring system. For example:

Depiction SMILES Name

benzene

c1ccccc1

benzene

C1=CC=CC=C1

indane

c1ccc2CCCc2c1

indane

C1=CC=CC(CCC2)=C12

furan

c1occc1

furan

C1OC=CC=1

cyclobutadiene

c1ccc1

cyclobutadiene

C1=CC=C1

Rule #3: The allowed aromatic atoms are 'b' | 'c' | 'n' | 'o' | 'p' | 's' | 'se' | 'te' | 'as'

Rule #4: The aromatic-bond symbol ':' can be used between aromatic atoms, although it is rarely or never necessary to specify a chemical structure (e.g., both c1ccccc1 and c:1:c:c:c:c:c:1 are acceptable)

Rule #5: Aromatic bonds must be connected to aromatic atoms. As such, it is invalid syntax to define an aromatic bond between non-aromatic atoms (e.g., C:1:C:C:C:C:C:1).

Rule #6: Aromatic atoms must have at least two and no more than three aromatic bonds. As such, the following example notations are invalid: c-1-c-c-c-c-c-1, c-1=c-c=c-c=c-1.

Rule #7 Aromatic atoms must be within rings. Aromatic atoms outside of rings are invalid. For example, the use of notation such as CCc or CccccC are invalid.

Rule #8 Aromaticity must be continuous; that is, notation such as c1ccCCc1 is invalid as the aromaticity is “broken”.

Rule #9 A bond between two aromatic atoms is assumed to be aromatic unless it is explicitly represented as a single bond '-', which is required when there is a single bond between two aromatic atoms or a single bond that is part of an aliphatic ring. In these cases, a single bond must be explicitly represented in aromatic SMILES. See Noel O'Boyle and John Mayfield's ACS Talk on Kekulization and Aromaticity. For example:

Depiction SMILES Name

biphenyl

c1ccccc1-c2ccccc2

biphenyl

910 dihydrophenanthrene

C1Cc2ccccc2-c2ccccc12

9,10-dihydrophenanthrene

biphenylene

c1ccc2c(c1)-c1ccccc1-2

biphenylene

Rule #10 - Unsubstituted aromatic heteroatoms with a non-zero hydrogen count must indicate the hydrogen count explicitly, for example in the case of imidazole and 2-methylbenzimidazole, the aromatic nitrogen is written as c1c[nH]cn1 and Cc1nc2ccccc2[nH]1, respectively (e.g., https://doi.org/10.1186/s13321-015-0061-y)

Reading Aromatic SMILES

The concept of aromaticity in SMILES has led to considerable discussion over the past two decades (See Aromaticity Discussion References ). In short, ambiguities and differences in aromaticity perception, via kekulization algorithms, has led to inconsistent SMILES interpretation across toolkits. A common area of disagreement is the number of hydrogen and double bond counts (absence or loss of H2, or in other words addition or subtraction of double bond). As the main goal of IUPAC SMILES+ is to promote consistent data exchange of SMILES, exchanging aromatic SMILES is strongly discouraged (due to inconsistencies of interpretation between toolkit versions and between different toolkits). However, if it is necessary to read aromatic SMILES, the aromaticity as defined in the SMILES atoms and bonds should be preserved. By preserving the defined aromaticity, no special chemical perception is necessary:

  1. An atom specified in lower case is aromatic and unspecified bonds between two aromatic atoms are aromatic.

  2. A bond specified with ':' is aromatic.

  3. To determine the hydrogen count for aromatic atoms, follow the below steps (adapted from Noel O'Boyle and John Mayfield's ACS Talk on Kekulization and Aromaticity):

    1. Non zero heteroatom hydrogen count must be explicitly specified using brackets (Rule 10). As such, if the aromatic atom is within square brackets, the hydrogen count is explicit (e.g., for [nH], the hydrogen count is 1)

    2. If an aromatic carbon atom does not contain an explicit hydrogen count (no brackets) and has 2-3 aromatic bonds (meets Rule #6), then the standard valence rules apply. The implicit hydrogen count is determined by summing the bond orders of the connected bonds to the carbon atom (as all aromatic heteroatom hydrogen counts must be explicit). Aromatic bonds are equivalent to single bond orders for the purposes of hydrogen counting. If that sum is equal to a known valence for the element or is greater than any known valence then the implicit hydrogen count is 0. Otherwise the implicit hydrogen count is the difference between that sum and the next highest known valence, minus 1 (e.g., for c1ccccc1`, benzene, each 'c' is 4 (valence) - 2 (bond order) - 1 (aromatic rule) = 1 (hydrogen)).

If the hydrogens can not be assigned based on steps 1-3 above without invalid valencies occuring (e.g., to fulfill a particular aromaticity model), then the SMILES is invalid.

Aromaticity Models

Users may choose to apply an aromaticity algorithm depending on the specific SMILES input and desired downstream cheminformatics computations. See Aromaticity Perception references for example aromaticity models. However, the use of aromatic SMILES should be temporary and for internal use only within the specific toolkit and version; that is, aromatic SMILES may be required for certain cheminformatics tasks, but upon sharing data, they should be kekulized.

Disconnected Structures

The dot '.' symbol (also called a "dot bond") is legal most places where a bond symbol would occur, but indicates that the atoms are not bonded. The most common use of the dot-bond symbol is to represent disconnected and ionic compounds.

Depiction SMILES Name

sodium chloride

[Na+].[Cl-]

sodium chloride

phenol 2 amino ethanol

Oc1ccccc1.NCCO

phenol, 2-amino ethanol

diammonium thiosulfate

[NH4+].[NH4+].[O-]S(=O)(=O)[S-]

diammonium thiosulfate

The dot can appear most places that a bond symbol is allowed, for example, the phenol example above can also be written:

Depiction SMILES Name

phenol 2 amino ethanol

c1cc(O.NCCO)ccc1

phenol, 2-amino ethanol

Oc1cc(.NCCO)ccc1

phenol, 2-amino ethanol

The second example above is an odd, but legal, use of parentheses and the dot bond, since the syntax allows a dot most places a regular bond could appear (the exception is that a dot can not appear before a ring-closure digit).

Although dot-bonds are commonly used to represent compounds with disconnected parts, a dot-bond does not in itself mean that there are disconnected parts in the compound. See the following section regarding ring digits for some examples that illustrate this.

The dot bond can not be used in front of a ring-closure digit. For example, C.1CCCCC.1 is invalid. Duplicate dot bonds such as [Na+]..[Cl-] are invalid. Further, disconnections must occur between exactly two components, and as a result, leading or trailing dots are invalid (e.g., .CCO or CCO.).

Other Uses of Ring Numbers and Dot Bond

A ring-number specifications ("rnum") is most commonly used to specify a ring-closure bond, but when used with the '.' dot-bond symbol, it can also specify a non-ring bond. Two rnums in a SMILES mean that the two atoms that precede the rnums are bonded. A dot-bond '.' means that the atoms to which it is adjacent in the SMILES string are not bonded to each other. By combining these two constructs, one can "piece together" fragments of SMILES into a whole molecule. The following SMILES illustrate this:

SMILES/Depiction Fragment SMILES Name

CC

C1.C1

ethane

CCC

C1.C12.C2

propane

1 bromo 2 3 dichlorobenzene

c1c2c3c4cc1.Br2.Cl3.Cl4

1-bromo-2,3-dichlorobenzene

This feature of SMILES provides a convenient method of enumerating the molecules of a combinatorial library using string concatenation.

Stereochemistry

Scope of Stereochemistry in SMILES

A SMILES string can specify the cis/trans configuration around a double bond, and can specify the chiral configuration of specific atoms in a molecule.

SMILES strings do not represent all types of stereochemistry. Examples of stereochemistry that can not be encoded into a SMILES include:

  • Gross conformational left or right handedness such as helices

  • Mechanical interferences, such as rotatable bonds that are constrained by mechanical interferences

  • Gross conformational stereochemistry such as the shape of a protein after folding

Tetrahedral Centers

SMILES uses an atom-centered chirality specification, in which the atom’s left-to-right order in the SMILES string itself is used as the basis for the chirality marking.

Tetrahedral Chirality

look from N towards C (chiral center)

list the neighbors anticlockwise

tetrahedral

N[C@](Br)(O)C

…​or clockwise

N[C@@](Br)(C)O

For the structure above, starting with the nitrogen atom, one "looks" toward the chiral center. The remaining three neighbor atoms are written by listing them in anticlockwise order using the '@' chiral property on the atom, or in clockwise order using the '@@' chiral property, as illustrated above. The '@' symbol is a "visual mnemonic" in that the spiral around the character goes in the anticlockwise direction, and means "anticlockwise" in the SMILES string (thus, '@@' can be thought of as anti-anti-clockwise).

A chiral center can be written starting anywhere in the SMILES string, and the choice of whether to list the remaining neighbor in clockwise or anticlockwise order is also arbitrary. The following SMILES are all equivalent and all specify the exact same chiral center illustrated above:

Equivalent SMILES

N[C@](Br)(O)C

Br[C@](O)(N)C

O[C@](Br)(C)N

Br[C@](C)(O)N

C[C@](Br)(N)O

Br[C@](N)(C)O

C[C@@](Br)(O)N

Br[C@@](N)(O)C

[C@@](C)(Br)(O)N

[C@@](Br)(N)(O)C

One exception to the atom order is when these atoms are bonded to the chiral center via a ring bond. In these cases, it is to order of the bonds to these atoms that should be considered. The two SMILES below are equivalent:

Equivalent SMILES

FC1C[C@](Br)(Cl)CCC1

[C@]1(Br)(Cl)CCCC(F)C1

If one of the neighbor atoms is a hydrogen and is represented as an atomic property of the chiral center (rather than explicitly as [H]), then it is considered to be the first atom in the clockwise or anticlockwise accounting. For example, if we replaced the bromine in the illustration above with a hydrogen atom, its SMILES would be:

Implicit Hydrogen

N[C@H](O)C

Cis/Trans configuration of Double Bonds

The configuration of atoms around double bonds is specified by the bond symbols '/' and '\'. These symbols always come in pairs, and indicate cis or trans with a visual "same side" or "opposite side" concept. That is:

Depiction SMILES Name

trans difluoroethene

F/C=C/F

trans-difluoroethane (both SMILES are equivalent)

F\C=C\F

cis difluoroethene

F\C=C/F

cis-difluoroethane (both SMILES are equivalent)

F/C=C\F

The "visual interpretation" of the '/' and '\' symbol is that they are thought of as bonds that "point" above or below the alkene bond. That is, F/C=C/Br means "The F is below the first carbon, and the Br is above the second carbon," leading to the interpretation of a trans configuration.

This notation can be confusing when parentheses follow one of the alkene carbons:

SMILES Name

F/C=C/F

trans-difluoroethane

C(\F)=C/F

F\C=C/F

cis-difluoroethane

C(/F)=C/F

The "visual interpretation" of the "up-ness" or "down-ness" of each single bond is relative to the carbon atom, not the double bond, so the sense of the symbol changes when the fluorine atom moved from the left to the right side of the alkene carbon atom.

Note: This point was not well documented in earlier SMILES specifications, and several SMILES interpreters are known to interpret the '/' and '\' symbols incorrectly.

The following types of up/down syntax are considered invalid in IUPAC SMILES+:

  • conflicting up/down specifications

  • mismatched or incomplete cis/trans specification

  • duplicate up/down specifications sometimes used for escaping characters in computer processing

SMILES Comment

C/C(\F)=C/F

Invalid SMILES: Both the methyl and fluorine are "down" relative to the first allenal carbon

C/C=C

Invalid SMILES: Mismatched or incomplete cis/trans bonds

C/C=CC

Invalid SMILES: Mismatched or incomplete cis/trans bonds

CC/=C/C

Invalid SMILES: Mismatched or incomplete cis/trans bonds

C(/Br)=C\\I

Invalid SMILES: duplicate up/down

C(\\\\F)=C/F

Invalid SMILES: duplicate up/down

It is permissible, but not required, that every atom attached to a double bond be marked. As long as at least two neighbor atoms, one on each end of the double bond, is marked, the "up-ness" or "down-ness" of the unmarked neighbors can be deduced.

SMILES Comment

F/C(CC)=C/F

trans-difluoro configuration, position of methyl is implied

Extended cis and trans configurations can be specified for conjugated allenes with an odd number of double bonds:

SMILES Name

F/C=C=C=C/F

trans-difluorobutatriene

F/C=C=C=C\F

cis-difluorobutatriene

Tetrahedral Allene-like Systems

Extended tetrahedral configurations can be specified for conjugated allenes with an even number of double bonds. The normal tetrahedral rules using '@' and '@@' apply, but the "neighbor" atoms to which the chirality refers are at the ends of the allene system. For example:

Depiction SMILES

tetrahedral allene

NC(Br)=[C@]=C(O)C

To determine the correct clockwise or anticlockwise specification, the allene is conceptually "collapsed" into a single tetrahedral chiral center, and the resulting chirality is marked as a property of the center atom of the extended allene system.

Square Planar Centers

There are three tags to represent square planar stereochemistry: @SP1, @SP2 and @SP3. Since there is no way to determine to what chirality class an atom belongs based on the SMILES alone, the SP class is not the default class for tetravalent stereocenters. Therefore are the shorthand notations ('@', '@@') not equivalent to @SP1 and @SP2. That is, the full specification must be there (@SP followed by 1, 2 or 3). The square planar also differs from the other chiral primitives in that it does not use the notion of (anti-)clockwise. Instead, each primitive represents a shape that is formed by drawing a line starting from the atom that is first in the SMILES pattern to the next until the end atom is reached. This may result in 3 possible shaped which are referred to by a character with identical shape: 'U' for @SP1, '4' for @SP2 and 'Z' for @SP3. The graphical from of these shapes is illustrated in the image below.

SPshapes

Background:

Also note that each shape starts and ends at specific positions. Both U and Z start from atoms that are successors or predecessors when arranging the atoms in the plane in anti-clockwise or clockwise order. The start and end atoms for the Z shape are never adjacent in such an ordering. For each shape there are 4 possible ways to start (and end) drawing the line. Also, for all the drawn lines, the start and end point can be exchanged. Thus 3 shapes, 4 ways to start/end and 2 ways to order the atoms for a shape results in 3 * 4 * 2 or 24 combinations. This is the same as the number of permutations that can be made with 4 numbers (i.e. P(n) = n!). This allows for canonical SMILES writers to use any ordering to output the atoms.

Trigonal Bipyramidal Centers

The chiral atom’s neighbors are labeled a, b, c, d, and e in the order that they are parsed. For example, for S[As@@](F)(Cl)(Br)N S corresponds to a, F to b, Cl to c, Br to d and N to e. This order is the unit permutation, represented as the ordered set (a, b, c, d, e). In the simplest case @TB1 viewing from a towards e, (b, c, d) are anti-clockwise ('@‘). Likewise, `@TB2 is specified as viewing from a towards e, (b, c, d) are ordered clockwise (’@@‘). The remaining TB’s permute the axis as indicated in the table below. A final example, for @TB6 the viewing axis is from a towards c and (b, d, e) are clockwise (’@@'`).

Viewing Axis TB Number Order

From

Towards

a

e

TB1

@

TB2

@@

a

d

TB3

@

TB4

@@

a

c

TB5

@

TB6

@@

a

b

TB7

@

TB8

@@

b

e

TB9

@

TB11

@@

b

d

TB10

@

TB12

@@

b

c

TB13

@

TB14

@@

c

e

TB15

@

TB20

@@

c

d

TB16

@

TB19

@@

d

e

TB17

@

TB18

@@

The following SMILES are all equivalent:

Equivalent SMILES

S[As@TB1](F)(Cl)(Br)N

S[As@TB2](Br)(Cl)(F)N

S[As@TB5](F)(N)(Cl)Br

F[As@TB10](S)(Cl)(N)Br

F[As@TB15](Cl)(S)(Br)N

Br[As@TB20](Cl)(S)(F)N

Background:

The trigonal Bipyramidal chirality is considerably more complex than any of the previous classes since the chiral atom has an extra neighbor. This increases the number of combinations to order the neighbors in a SMILES string from 24 to 120. Since every order of the atoms should be representable by a SMILES string, the 20 TB primitives suffice for this. In the trigonal bipyramidal geometry, 3 atoms lie in a plane and the remaining 2 atoms are perpendicular to this plane and are on the opposite sides of the plane forming an axis. The anti-clockwise and clockwise refers to the order of the 3 plane atoms when viewing along the axis in the specified direction. Unlike tetrahedral geometry, reordering the 3 atoms does not require that the axis be changed. Given an order of the axis atoms the 3 plane atoms are ordered either anti-clockwise or clockwise. Although there are P(3) = 3! or 6 possible permutations of 3 numbers, exchanging a pair inverts the parity and the 6 permutations are therefore divided in two groups (@, @@) containing 3 permutations each. Because there are now two atoms that determine the viewing direction along the axis, these atoms too can be in any of the 5 positions in a permutation. Given the atoms as the set {a, b, c, d, e}, there are C(5, 2) = 20 possible combinations of 5 things taken 2 at a time. However, the use of the @ and @@ symbols halve this to 10. These 10 combinations are the ordered sets (a, e), (a, d) (a, c), (a, b), (b, e), (b, d), (b, c), (c, e), (c, d) and (d, e). Each of these pairs correspond to an TB primitive.

Octahedral Centers

For 6 atoms, the unit permutation is (a, b, c ,d ,e ,f). @OH1 means when viewing from a towards f, (b, c, d, e) are ordered anti-clockwise ('@'). @OH2 uses the same axis but the 4 intermediate atoms are ordered clockwise. The interpretation of the 28 remaining numbers is more complex though. The concept of shapes (see square planar stereochemistry) to describe the orientation of 4 atoms in a plane is reused. However, this time these shapes also have a clockwise or anti-clockwise winding. For the U shape, this is trivial since it means that the 4 atoms are listed clockwise or anti-clockwise. For the Z shape, the connection between the first two atoms determines the winding. Finally, for the 4 shape, the connection between the second and third atom determines the winding. The table below lists the shapes, axes and orders.

Shape Viewing Axis OH Number Order

From

Towards

U

a

f

OH1

@

OH2

@@

a

e

OH3

@

OH16

@@

a

d

OH6

@

OH18

@@

a

c

OH19

@

OH24

@@

a

b

OH25

@

OH30

@@

Z

a

f

OH4

@

OH14

@@

a

e

OH5

@

OH15

@@

a

d

OH7

@

OH17

@@

a

c

OH20

@

OH23

@@

a

b

OH26

@

OH29

@@

4

a

f

OH10

@

OH8

@@

a

e

OH11

@

OH9

@@

a

d

OH13

@

OH12

@@

a

c

OH22

@

OH21

@@

a

b

OH28

@

OH27

@@

The following SMILES are all equivalent:

Equivalent SMILES

C[Co@](F)(Cl)(Br)(I)S

F[Co@@](S)(I)(C)(Cl)Br

S[Co@OH5](F)(I)(Cl)(C)Br

Br[Co@OH9](C)(S)(Cl)(F)I

Br[Co@OH12](Cl)(I)(F)(S)C

Cl[Co@OH15](C)(Br)(F)(I)S

Cl[Co@OH19](C)(I)(F)(S)Br

I[Co@OH27](Cl)(Br)(F)(S)C

Background:

Octahedral stereochemistry is even more complicated since there is yet another extra neighboring atom. This raises the number of permutations to P(6) = 720. There are three axis that can be chosen and the orientation of the remaining 4 atoms has to be described. To describe these 4 atoms, P(4) = 24 permutations are used together with a shape. An axis always starts from the first neighbor atom and can end at any of the other neighbor atoms giving rise to 5 axis. As a result, each OH number encodes the axis positions, a shape and an order. Since all 3 axis can be placed in this positions, the start/end can be exchanged and each shape can start from any of the 4 atoms, each number represents 3 * 2 * 4 = 24 of the 720 permutations. Finally, 24 * 30 = 720 so all permutations can be used to write a canonical SMILES.

Partial Stereochemistry

SMILES allows partial stereochemical specifications. It is permissible for some chiral centers or double bonds to have stereochemical markings in the SMILES, while others in the same SMILES string do not. For example:

SMILES Comment

F/C=C/C/C=C\C

completely specified

F/C=C/CC=CC

partially specified

N1[C@H](Cl)[C@@H](Cl)C(Cl)CC1

partially specified

Other Chiral Configurations

The SMILES language supports a number of atom-centered chiral configurations:

SMILES Configuration

TH

Tetrahedral

AL

Allenal

SP

Square Planar

TB

Trigonal Bipyramidal

OH

Octahedral

The shorthand notations '@' and '@@' correspond to anti-clockwise and clockwise tetrahedral chirality, and are the same a '@TH1' and '@TH2', respectively. Likewise, in an allenal configuration, the shorthand notations '@' and '@@' correspond to '@AL1' and '@AL2', respectively.

Very few SMILES systems actually implement the rules for SP, TB or OH chirality.

Parsing Termination

A SMILES string is terminated by a whitespace terminator character (space, tab, newline, carriage-return), or by the end of the string. As a result, any leading space in a SMILES string is considered invalid in IUPAC SMILES+ (e.g. ' CCC').

Other data or information, such as a name, properties, registration number, etc., may follow the SMILES on a line after the whitespace character. SMILES parsers will ignore this data, although applications that use the SMILES parser will often make use of it.

Programming Practices

There is no formal limit to the length of a SMILES string; SMILES of over 1 million characters have been assembled for various purposes. There is no requirement that a SMILES parser must be able to parse these exceptionally long SMILES, but as a guideline, all implementations of SMILES parsers should, at a minimum, accept and correctly parse SMILES strings of 100,000 characters. If a SMILES parser encounters a string that is too long to parse, it should generate a relevant error message.

A SMILES parser should accept at least four digits for the atom class, and the values 0 to 9999.

There is no formal limit to the number of rings a molecule can contain. There are only 1000 ring-closure numbers (0, 1-999), but since numbers can be reused, a molecule can potentially have more than 1000 rings. SMILES parsers should accept and correctly parse molecules with at least 1000 rings; it is preferable to place no limits on the number of rings a molecule can contain.

Branches (parentheses) can be nested to an arbitrary depth. Some SMILES strings in standard databases contain over 30 levels of branches, and much deeper nesting is possible. A general purpose parser must handle at least 100 levels; it is preferable to place no limits on nesting depth for parentheses.

There is no formal limit on the number of bonds an atom can have. SMILES parsers should allow at least ten bonds to each atom; it is preferable to place no limits on the number of bonds to each atom.

There is no limit to the number of "dot-disconnected" fragments in a SMILES. A SMILES of 100,000 atoms could in principle contain no bonds at all; SMILES parsers should place no limits on the number of fragments allowed (except that it is limited to the number of atoms the parser can manage).

Programmers are strongly encouraged to provide detailed and clear error messages. If possible, the error message should show exactly which character or "phrase" of the SMILES string triggered the error message.

Writing SMILES: Normalizations

What is Normalization?

A wide variety of SMILES strings are acceptable as input. For example, all of the following represent ethanol:

SMILES Name

CCO

ethanol

OCC

ethanol

C(O)C

ethanol

[CH3][CH2][OH]

ethanol

[H][C]([H])([H])C([H])([H])[O][H]

ethanol

However, it is desirable to write SMILES in more standard forms; the first two forms above are preferred by most chemists, and require fewer bytes to store on a computer. Several levels of normalization of SMILES are recommended for systems that generate SMILES strings. Although these are not mandatory in any sense, they should be considered guidelines for software engineers creating SMILES systems.

No Normalization

The simplest "normalization" is no normalization. SMILES can be written in any form whatsoever, as long as they meet the rules for SMILES. Some examples of systems that might produce un-normalized SMILES are:

  • A system that enumerates combinatorial libraries using the rnum/dot-bond technique discussed above. SMILES produced by such a system will typically be a series of partial SMILES that are concatenated with dots into a complete molecule.

  • Simple pass-through "filters" that don’t have a full SMILES writer, but merely copy the input SMILES to the output. An example might be a molecular modeling program that reads SMILES to generates logP values, but has no capability to convert its molecular data structures back to a SMILES; instead it just copies its input SMILES to its output.

Standard Form

The "standard form" of a SMILES is designed to produce a compact SMILES, and one that is human readable (for smaller molecules).

In addition, a normalized SMILES has the important property that it matches itself as a SMARTS string. This is a very important feature of normalized SMILES in cheminformatics systems.

In IUPAC SMILES+, there is a strict atom property order within bracket atoms. The valid order from left to right is isotope, atom symbol/number, chirality, hydrogen count, charge, then atom class.

Note: In the examples below, some of the "Wrong" SMILES may be valid as per the IUPAC SMILES+ specification, but are "wrong" in the sense that they are not the preferred form for standard normalization.

Atoms

Correct Wrong Normalization Rule

CC

[CH3][CH3]

Write atoms in the "organic subset" as bare atomic symbols whenever possible.

[CH3-]

[CH3-1]

If the charge is +1 or -1, leave off the digit.

C[13CH](C)C

C[13CH1](C)C

If the hydrogen count is 1, leave off the digit.

[CH3-]

[C-H3]

Always write the atom properties in the order: Chirality, hydrogen-count, charge.

C[C@H](Br)Cl

C[CH@](Br)Cl

[CH3-]

[H][C-]([H])[H]

Represent hydrogens as a property of the heavy atom rather than as explicit atoms, unless other rules (e.g. [2H]) require that the hydrogen be explicit.

Bonds

Correct Wrong Normalization Rule

CC

C-C

Only write '-' (single bond) when it is between two aromatic atoms. Never write the ':' (aromatic bond) symbol. Bonds are single or aromatic by default (as appropriate).

c1ccccc1

c:1:c:c:c:c:c:1

c1ccccc1-c2ccccc2

c1ccccc1c2ccccc2

Cycles

Correct Wrong Normalization Rule

c1ccccc1C2CCCC2

c1ccccc1C1CCCC1

Don’t reuse ring-closure digits.

c1ccccc1C2CCCC2

c0ccccc0C1CCCC1

Begin ring numbering with 1, not zero (or any other number)

CC1=CCCCC1

CC=1CCCCC=1

Avoid making a ring-closure on a double or triple bond. For the ring-closure digits, choose a single bond whenever possible.

C1CC2CCCCC2CC1

C12(CCCCC1)CCCCC2

Avoid starting a ring system on an atom that is in two or more rings, such that two ring-closure bonds will be on the same atom.

C1CCCCC1

C%10CCCCC%10

Use the simpler single-digit form for rnums less than 10.

Starting Atom and Branches

Correct Wrong Normalization Rule

OCc1ccccc1

c1cc(CO)ccc1

Start on a terminal atom if possible.

CC(C)CCCCCC

CC(CCCCCC)C

Try to make "side chains" short; pick the longest chains as the "main branch" of the SMILES.

OCCC

CCCO

Start on a heteroatom if possible.

CC

C1.C1

Only use dots for disconnected components.

Aromaticity

Correct Wrong Normalization Rule

c1ccccc1

C1=CC=CC=C1

Write the aromatic form in preference to the Kekule form.

Chirality

Correct Wrong Normalization Rule

BrC(Br)C

Br[C@H](Br)C

Remove chiral markings for atoms that are not chiral.

FC(F)=CF

F/C(/F)=C/F

Remove cis/trans markings for double bonds that are not cis or trans.

Canonical SMILES

A Canonical SMILES is one that follows the Standard Form above, and additionally, always writes the atoms and bonds of any particular molecule in the exact same order, regardless of the source of the molecule or its history in the computer. Here are a few examples of Canonical versus non-Canonical SMILES:

Canonical SMILES Non-canonical Name

OCC

CCO

ethanol

C(C)O

Oc1ccccc1

c1ccccc1O

phenol

c1(O)ccccc1

c1(ccccc1)O

The primary use of Canonical SMILES is in cheminformatics systems. A molecule’s structure, when expressed as a canonical SMILES, will always yield the same SMILES string, which allows a chemical database system to:

  • Create a unique name (the SMILES) for each molecule in the system

  • Consolidate data about one molecule from a variety of sources into a single record

  • Given a molecule, find its record in the database

Canonical SMILES should not be considered a universal, global identifier (such as a permanent name that spans the WWW). Two systems that produces a canonical SMILES may use different rules in their code, or the same system may be improved or have bugs fixed as time passes, thus changing the SMILES it produces. A Canonical SMILES is primarily useful in a single database, or a system of related databases or information, in which all molecules were created using a single canonicalizer.

It is an unfortunately common misconception that a Canonical SMILES does not (or can not) contain stereochemistry/isotopes or alternatively that all SMILES must be canonical.

In general the properties encoded in a SMILES can be chosen by a program to suit a particular purpose. You may have the option to independently include or omit stereochemistry, isotopes, or atom map/class in a generated Canonical SMILES. When referencing a particular SMILES, confusion can be avoided by including the toolkit, version, and options used.

The rules (algorithms) by which the canonical ordering of the atoms in a SMILES are generated are quite complex, and beyond the scope of this document. There are many chemistry and mathematical graph-theory papers describing the canonical labeling of a graph, and writing a canonical SMILES string. See the Appendix for further information.

Those considering Canonical SMILES for a database system should also investigate InChI, a canonical naming system for chemicals that is an approved IUPAC naming convention.

SMILES Files

SMILES file consists of zero or more SMILES strings, one per line, optionally followed by at least one whitespace character (space or tab), and other data. There can be no leading whitespace before the SMILES string on a line. The optional whitespace character and data that follows it are not part of the SMILES specification, and interpretation of this data is up to applications that use the SMILES file. Each line of the file is terminated by either a single LF character, or by a CR/LF pair of characters (commonly called the "Unix" and "Windows" line terminators, respectively). A SMILES parser must accept either line terminator. A blank line in the SMILES file, or a line that begins with a whitespace character, should be completely ignored by a SMILES parser.

Appendix

Appendix 1 - Extensions

References

Tools

Cheminformatics Toolkits

There are a variety of commercial and open source cheminformatics toolkits available that support the SMILES format:

Molecular Editors

Many modern molecular editors can read and write SMILES:

Some Key Scientific Papers

  • Anderson, E.; Veith, G.D.; Weininger, D. SMILES: A Line Notation and Computerized Interpreter for Chemical Structures. U.S. Environmental Protection Agency, Washington, D.C., EPA/600/M-87/021 (NTIS PB88130034), 1987.

  • Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31-36.

  • Weininger, D.; Weininger, A. and Weininger, J.L. SMILES 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Comput. Sci. 1989, 29, 97-101.

  • Weininger, D. SMILES 3. Depict. Graphical Depiction of Chemical Structures. J. Chem. Inf. Comput. Sci. 1990, 30, 237-243.

  • Morgan, H.L. The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107-113.

  • Downs, G.M.; Gillet, V.J.; Holliday, J.D.; Lynch, M.F. Review of Ring Perception Algorithms for Chemical Graphs. J. Chem. Inf. Comput. Sci. 1989, 29, 172-187.

  • Balducci, R.; Pearlman, R.S. Efficient exact solution of the ring perception problem. J. Chem. Inf. Comput. Sci. 1994, 34, 822-831.

Revision History

OpenSMILES Specification Document

Revision Date Description Name

1.0

2007-11-13

Draft

Craig A. James

1.0

2012-09-29

Reformatting

Tim Vandermeersch

1.0

2012-09-29

Corrections

Andrew Dalke & Tim Vandermeersch

1.0

2012-11-17

SP, TB and OH stereochemistry

Tim Vandermeersch

1.0

2013-09-06

Corrections

Richard Apodaca

1.0

2013-09-17

Corrections

John May

IUPAC SMILES+ Specification Document

Revision Date Description Name

1.0

2019-04-15

Fixed asciidoc formatting and created derivative of OpenSMILES Document, IUPAC SMILES+ Specification Working Draft

Vincent F. Scalfani

1.0

2020-08-13

Minor typo corrections

Andrius Merkys

1.0

2020-08-13

Added escape symbols to prevent copyright symbol rendering in SMILES and fixed broken image link

Vincent F. Scalfani

1.0

2020-09-24

Moved proposed extensions to a separate document

Vincent F. Scalfani

1.0

2020-09-27

Updated links and references

Vincent F. Scalfani

1.0

2021-05-14

Updated purpose and motivation to reflect IUPAC task group efforts; Added aromatic te; Added support for element symbols through Og, element numbers through [#118], and [#0] as undefined; Clarified Hydrogen Hn, where n is a single digit number 0-9; General formatting for consistency; Clarified atom charge property syntax and added support for repeated symbols up to 15 and a 0 charge; Isotope changes: A 0 isotope is now undefined, leading 0 (e.g., 02) is invalid, and removed D,T symbols; Clarified Wildcard atom section with #0 and more examples.; Removed lowercase as sp2 outside of rings; Reworked and simplified SMILES flavors section into the Canonical SMILES section; Atom class change: no leading 0 allowed; Incorporated Nonstandard forms of SMILES section into main text. These are now considered invalid.; Ring rnum change: no leading 0 allowed and clarified when a conflict arises; Clarified that escaped up/down configurations are invalid; Added that leading spaces in SMILES are invalid; Specified a strict bracket atom order

Vincent F. Scalfani

1.0

2021-05-14

Fixed a few typos and added a missing chemical drawing

Vincent F. Scalfani

1.0

2021-09-28

Fixed ring bond typos GitHub Issue #19

Vincent F. Scalfani

1.0

2021-09-28

Clarified branching GitHub Issue #11

Vincent F. Scalfani

1.0

2021-09-29

Adjusted consistency and support of atom property digits GitHub Issue #14

Vincent F. Scalfani

1.0

2023-02-24

Changed ring number 0 as invalid GitHub Issue #27

Vincent F. Scalfani

1.0

2023-10-20

Major revision of aromaticity section GitHub Pull Request #30

Vincent F. Scalfani

1.0

2023-10-20

Revised Atom Valence and Hydrogens section

Vincent F. Scalfani

1.0

2023-10-20

Removed support of %nnn and added a recommendation to not reuse rnum

Vincent F. Scalfani

1.0

2023-10-31

Clarified normal Atom Valence table

Vincent F. Scalfani