Simplified molecular-input line-entry system

The []simplified molecular-input line-entry system is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
The original SMILES specification was initiated in the 1980s. It has since been modified and extended. In 2007, an open standard called OpenSMILES was developed in the open-source chemistry community. Other linear notations include the Wiswesser line notation, ROSDAL, and SYBYL Line Notation.

History

The original SMILES specification was initiated by David Weininger at the USEPA Mid-Continent Ecology Division Laboratory in Duluth in the 1980s. Acknowledged for their parts in the early development were "Gilman Veith and Rose Russo and Albert Leo and Corwin Hansch for supporting the work, and Arthur Weininger and Jeremy Scofield for assistance in programming the system." The Environmental Protection Agency funded the initial project to develop SMILES.
It has since been modified and extended by others, most notably by Daylight Chemical Information Systems. In 2007, an open standard called "OpenSMILES" was developed by the Blue Obelisk open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation, ROSDAL and SLN.
In July 2006, the IUPAC introduced the InChI as a standard for formula representation. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI; it also has a wide base of software support with extensive theoretical backing.

Terminology

The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.
Typically, a number of equally valid SMILES strings can be written for a molecule. For example, CCO, OCC and CC all specify the structure of ethanol. Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the canonicalization algorithm used to generate it, and is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by Daylight Chemical Information Systems, OpenEye Scientific Software, MEDIT, Chemical Computing Group, MolSoft LLC, and the Chemistry Development Kit. A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a database.
The original paper that described the CANGEN algorithm claimed to generate unique SMILES strings for graphs representing molecules, but the algorithm fails for a number of simple cases and cannot be considered a correct method for representing a graph canonically. There is currently no systematic comparison across commercial software to test if such flaws exist in those packages.
SMILES notation allows the specification of configuration at tetrahedral centers, and double bond geometry. These are structural features that cannot be specified by connectivity alone, and therefore SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term isomeric SMILES is also applied to SMILES in which isomers are specified.

Graph-based definition

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.
The resultant SMILES form depends on the choices:

of the bonds chosen to break cycles,
of the starting atom used for the depth-first traversal, and
of the order in which branches are listed when encountered.
Description

Atoms

s are represented by the standard abbreviation of the chemical elements, in square brackets, such as for gold. Brackets may be omitted in the common case of atoms which:

are in the "organic subset" of B, C, N, O, P, S, F, Cl, Br, or I, and
have no formal charge, and
have the number of hydrogens attached implied by the SMILES valence model, and
are the normal isotopes, and
are not chiral centers.

All other elements must be enclosed in brackets, and have charges and hydrogens shown explicitly. For instance, the SMILES for water may be written as either O or . Hydrogen may also be written as a separate atom; water may also be written as O.
When brackets are used, the symbol H is added if the atom in brackets is bonded to one or more hydrogen, followed by the number of hydrogen atoms if greater than 1, then by the sign + for a positive charge or by - for a negative charge. For example, for ammonium. If there is more than one charge, it is normally written as digit; however, it is also possible to repeat the sign as many times as the ion has charges: one may write either or for titanium Ti⁴⁺. Thus, the hydroxide anion is represented by , the hydronium cation is and the cobalt cation is either or .

Bonds

A bond is represented using one of the symbols . - = # $ : / \.
Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. Although single bonds may be written as -, this is usually omitted. For example, the SMILES for ethanol may be written as C-C-O, CC-O or C-CO, but is usually written CCO.
Double, triple, and quadruple bonds are represented by the symbols =, #, and $ respectively as illustrated by the SMILES O=C=O, C#N and $.
An additional type of bond is a "non-bond", indicated with ., to indicate that two parts are not bonded together. For example, aqueous sodium chloride may be written as . to show the dissociation.
An aromatic "one and a half" bond may be indicated with :; see below.
Single bonds adjacent to double bonds may be represented using / or \ to indicate stereochemical configuration; see below.

Rings

Ring structures are written by breaking each ring at an arbitrary point to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms.
For example, cyclohexane and dioxane may be written as C1CCCCC1 and O1CCOCC1 respectively. For a second ring, the label will be 2. For example, decalin may be written as C1CCCC2C1CCCC2.
SMILES does not require that ring numbers be used in any particular order, and permits ring number zero, although this is rarely used. Also, it is permitted to reuse ring numbers after the first ring has closed, although this usually makes formulae harder to read. For example, bicyclohexyl is usually written as C1CCCCC1C2CCCCC2, but it may also be written as C0CCCCC0C0CCCCC0.
Multiple digits after a single atom indicate multiple ring-closing bonds. For example, an alternative SMILES notation for decalin is C1CCCC2CCCCC12, where the final carbon participates in both ring-closing bonds 1 and 2. If two-digit ring numbers are required, the label is preceded by %, so C%12 is a single ring-closing bond of ring 12.
Either or both of the digits may be preceded by a bond type to indicate the type of the ring-closing bond. For example, cyclopropene is usually written C1=CC1, but if the double bond is chosen as the ring-closing bond, it may be written as C=1CC1, C1CC=1, or C=1CC=1. C=1CC-1 is illegal, as it explicitly specifies conflicting types for the ring-closing bond.
Ring-closing bonds may not be used to denote multiple bonds. For example, C1C1 is not a valid alternative to C=C for ethylene. However, they may be used with non-bonds; C1.C2.C12 is a peculiar but legal alternative way to write propane, more commonly written CCC.
Choosing a ring-break point adjacent to attached groups can lead to a simpler SMILES form by avoiding branches. For example, cyclohexane-1,2-diol is most simply written as OC1CCCCC1O; choosing a different ring-break location produces a branched structure that requires parentheses to write.

Aromaticity

rings such as benzene may be written in one of three forms:

In Kekulé form with alternating single and double bonds, e.g. C1=CC=CC=C1,
Using the aromatic bond symbol :, e.g. C1:C:C:C:C:C1, or
Most commonly, by writing the constituent B, C, N, O, P and S atoms in lower-case forms b, c, n, o, p and s, respectively.

In the latter case, bonds between two aromatic atoms are assumed to be aromatic bonds. Thus, benzene, pyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1.
Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as ; thus imidazole is written in SMILES notation as n1ccc1.
When aromatic atoms are singly bonded to each other, such as in biphenyl, a single bond must be shown explicitly: c1ccccc1-c2ccccc2. This is one of the few cases where the single bond symbol - is required.
The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.

Branching

Branches are described with parentheses, as in CCCO for propionic acid and FCF for fluoroform. The first atom within the parentheses, and the first atom after the parenthesized group, are both bonded to the same branch point atom. The bond symbol must appear inside the parentheses; outside is invalid.
Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COccccc1C#N and COcccc1C#N which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.
Branches may be written in any order. For example, bromochlorodifluoromethane may be written as FCF, BrCCl, CBr, or the like. Generally, a SMILES form is easiest to read if the simpler branch comes first, with the final, unparenthesized portion being the most complex. The only caveats to such rearrangements are:

If ring numbers are reused, they are paired according to their order of appearance in the SMILES string. Some adjustments may be required to preserve the correct pairing.
If stereochemistry is specified, adjustments must be made; see below.

The one form of branch which does not require parentheses are ring-closing bonds. Choosing ring-closing bonds appropriately can reduce the number of parentheses required. For example, toluene is normally written as Cc1ccccc1 or c1ccccc1C, avoiding the parentheses required if written as c1cccccc1 or c1cccC.

Stereochemistry

SMILES permits, but does not require, specification of stereoisomers.
Configuration around double bonds is specified using the characters / and \ to show directional single bonds adjacent to a double bond. For example, F/C=C/F is one representation of trans-1,2-difluoroethylene, in which the fluorine atoms are on opposite sides of the double bond, whereas F/C=C\F is one possible representation of cis-1,2-difluoroethylene, in which the fluorines are on the same side of the double bond.
Bond direction symbols always come in groups of at least two, of which the first is arbitrary. That is, F\C=C\F is the same as F/C=C/F. When alternating single-double bonds are present, the groups are larger than two, with the middle directional symbols being adjacent to two double bonds. For example, the common form of -hexadiene is written C/C=C/C=C/C.
, with the eleven double bonds highlighted.
As a more complex example, beta-carotene has a very long backbone of alternating single and double bonds, which may be written CC1CCC/C=C1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/C=C/C=C/C2=C/CCCC2C.
Configuration at tetrahedral carbon is specified by @ or @@. Consider the four bonds in the order in which they appear, left to right, in the SMILES form. Looking toward the central carbon from the perspective of the first bond, the other three are either clockwise or counter-clockwise. These cases are indicated with @@ and @, respectively.
For example, consider the amino acid alanine. One of its SMILES forms is NCCO, more fully written as NCO. L-Alanine, the more common enantiomer, is written as NCO. Looking from the nitrogen–carbon bond, the hydrogen, methyl, and carboxylate groups appear clockwise. D-Alanine can be written as NCO.
While the order is which branches are specified in SMILES is normally unimportant, in this case it matters; swapping any two groups requires reversing the chirality indicator. If the branches are reversed so alanine is written as NCC, then the configuration also reverses; L-alanine is written as NC. Other ways of writing it include CCO, OCC and OCN.
Normally, the first of the four bonds appears to the left of the carbon atom, but if the SMILES is written beginning with the chiral carbon, such as CCO, then all four are to the right, but the first to appear is used as the reference to order the following three: L-alanine may also be written CO.
The SMILES specification includes elaborations on the @ symbol to indicate stereochemistry around more complex chiral centers, such as trigonal bipyramidal molecular geometry.

Isotopes

are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as 1ccccc1 and deuterochloroform is CCl.

Examples

To illustrate a molecule with more than 9 rings, consider cephalostatin-1, a steroidic 13-ringed pyrazine with the empirical formula C₅₄H₇₄N₂O₁₀ isolated from the Indian Ocean hemichordate Cephalodiscus gilchristi:
Starting with the left-most methyl group in the figure:
Note that % appears in front of the index of ring closure labels above 9; see above.

Other examples of SMILES

The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool.

Extensions

is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcard atoms and bonds, which can be used to define substructural queries for chemical database searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphism.
SMIRKS, a superset of "reaction SMILES" and a subset of "reaction SMARTS", is a line notation for specifying reaction transforms. The general syntax for the reaction extensions is REACTANT>AGENT>PRODUCT, where any of the fields can either be left blank or filled with multiple molecules deliminated with a dot, and other descriptions dependent on the base language. Atoms can additionally be identified with a number for mapping, for example in >>].

Conversion

SMILES can be converted back to two-dimensional representations using structure diagram generation algorithms. This conversion is not always unambiguous. Conversion to three-dimensional representation is achieved by energy-minimization approaches. There are many downloadable and web-based conversion utilities.

SMILES related software utilities

NIH online services
* – resolves or generates SMILES from chemical names, CAS Registry Numbers, InChI/InChIKey and many other chemical structure file formats
* for 2D Plots of Chemical Structures
* – online molecule editor
* – JSME online molecule editor that generates SMILES/SMARTS; .
ChemAxon utilities, mostly Java-based, some with free personal use
* - Translate a SMILES formula into graphics with Marvin, hosted by UC Irvine
* – chemical editor/viewer and SMILES generator/converter
* – desktop application for storing/generating/converting/visualizing/searching SMILES structures, particularly batch processing
* other tools and third-party integration;
OELib and descendants
* – a FOSS molecule editor which can read and write SMILES; has Gtk+ and HTML5 frontends
– 3D Coordinate Generation;
– an unofficial InChI website featuring on-line converter from InChI and SMILES to molecular drawings, based on OASA
– A free program for 3D coordinate generation and conformational analysis.
– an open-source cross-platform cheminformatics library with a plugin for IUPAC-compliant molecule and reaction 2D structural formula rendering.
Bioclipse – a free and open source workbench for the life sciences
Scilligence utilities
* – A.NET cheminformatics toolkit to read/write SMILES, generate 2D coordinate from SMILES, and convert SMILES from/into other Chemical file formats.
* – A cross-platform javascript chemical structure editor to generate SMILES and SMARTS.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...