Simplified molecular input line entry specification


Simplified molecular input line entry specification

Infobox file format
name = smiles
extension = .smi
mime =
owner =
creatorcode =
genre = chemical file format
container for =
contained by =
extended from =
extended to =

The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.

The original SMILES specification was developed by Arthur Weininger and David Weininger in the late 1980s. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc. In 2007, an open standard called [http://www.opensmiles.org "OpenSMILES"] was developed by the [http://www.blueobelisk.org Blue Obelisk] open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc).

In August 2006, the IUPAC introduced the InChI as a standard for formula representation. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI; it also has a wide base of software support with extensive theoretical (e.g., graph theory) backing.

Terminology

The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings and the exact meaning is usually apparent from the context. The terms Canonical and Isomeric can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.

Typically, a number of equally valid SMILES can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of ethanol. Algorithms have been developed to ensure the same SMILES is generated for a molecule regardless of the order of atoms in the structure. This SMILES is unique for each structure, although dependent on the canonicalisation algorithm used to generate it, and is termed the Canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure and do not simply manipulate strings as is sometimes thought. Algorithms for generating Canonical SMILES have been developed at [http://www.daylight.com Daylight Chemical Information Systems] , [http://www.eyesopen.com OpenEye Scientific Software] and [http://www.chemcomp.com Chemical Computing Group] . A common application of Canonical SMILES is for indexing and ensuring uniqueness of molecules in a database.

SMILES notation allows the specification of configuration at tetrahedral centers, and double bond geometry. These are structural features that cannot be specified by connectivity alone and SMILES which encode this information are termed Isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term Isomeric SMILES is also applied to SMILES in which isotopes are specified.

Graph-based definition

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.

Examples

Atoms

Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. The hydroxide anion is [OH-] . Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for water is simply O.

Bonds

Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES. For example the SMILES for ethanol can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES, which for cyclohexane and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. Double and triple bonds are represented by the symbols '=' and '#' respectively as illustrated by the SMILES O=C=O (carbon dioxide) and C#N (hydrogen cyanide).

Aromaticity

Aromatic C, O, S and N atoms are shown in their lower case 'c', 'o', 's' and 'n' respectively. Benzene, pyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Bonds between aromatic atoms are, by default, aromatic although these can be specified explicitly using the ':' symbol. Aromatic atoms can be singly bonded to each other and biphenyl can be represented by c1ccccc1-c2ccccc2. Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH] and imidazole is written in SMILES notation as n1c [nH] cc1.

The [http://www.daylight.com Daylight] and [http://www.eyesopen.com OpenEye] algorithms for generating canonical SMILES differ in their treatment of aromaticity.

Branching

Branches are described with parentheses, as in CCC(=O)O for propionic acid and C(F)(F)F for fluoroform. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N ( [http://www.daylight.com/daycgi/depict?434f6328633129636363633143234e see depiction] ) and COc(cc1)ccc1C#N ( [http://www.daylight.com/daycgi/depict?434f6328636331296363633143234e see depiction] ) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.

tereochemistry

Configuration around double bonds is specified using the characters "/" and "". For example, F/C=C/F ( [http://www.daylight.com/daycgi/depict?462f433d432f46 see depiction] ) is one representation of "trans"-difluoroethene, in which the fluorine atoms are on opposite sides of the double bond, whereas F/C=CF ( [http://www.daylight.com/daycgi/depict?462f433d435c46 see depiction] ) is one possible representation of "cis"-difluoroethene, in which the Fs are on the same side of the double bond, as shown in the figure.

Configuration at tetrahedral carbon is specified by @ or @@. L-Alanine, the more common enantiomer of the amino acid alanine can be written as N [C@@H] (C)C(=O)O ( [http://www.daylight.com/daycgi/depict?4e5b434040485d28432943283d4f294f see depiction] ). The @@ specifier indicates that, when viewed from nitrogen along the bond to the chiral center, the sequence of substituents hydrogen (H), methyl (C) and carboxylate (C(=O)O) appear clockwise. D-Alanine can be written as N [C@H] (C)C(=O)O ( [http://www.daylight.com/daycgi/depict?4e5b4340485d28432943283d4f294f see depiction] ). The order of the substituents in the SMILES string is very important and D-alanine can also be encoded as N [C@@H] (C(=O)O)C ( [http://www.daylight.com/daycgi/depict?4e5b434040485d2843283d4f294f2943 see depiction] ).

Isotopes

Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as [14c] 1ccccc1 and deuterochloroform is [2H] C(Cl)(Cl)Cl.

Other examples of SMILES

The SMILES notation is described extensively in the [http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html SMILES theory manual] provided by [http://www.daylight.com/ Daylight Chemical Information Systems] and a number of illustrative examples are presented. Daylight's [http://www.daylight.com/daycgi/depict depict utility] provides users with the means to check their own examples of SMILES and is a valuable educational tool.

Extensions

SMARTS is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcard atoms and bonds, which can be used to define substructural queries for chemical database searching. One common misconception is that SMARTS-based subtructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphism. [http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html SMIRKS] is a line notation for specifying reaction transforms.

Conversion

SMILES can be converted back to 2-dimensional representations using Structure Diagram Generation algorithms (Helson, 1999). This conversion is not always unambiguous. Conversion to 3-dimensional representation is achieved by energy minimization approaches. There are many downloadable and web-based conversion utilities.

See also

* Smiles arbitrary target specification SMARTS language for specification of substructural queries.
* SYBYL Line Notation (another line notation)
* Molecular Query Language - query language allowing also numerical properties, e.g. physicochemical values or distances
* Chemistry Development Kit (2D layout and conversion)
* International Chemical Identifier (InChI), the free and open alternative to SMILES by the IUPAC.
* OpenBabel, JOELib, OELib (conversion)

References

* Anderson, E.; Veith, G.D; Weininger, D. (1987) SMILES: A line notation and computerized interpreter for chemical structures. Report No. EPA/600/M-87/021. U.S. EPA, Environmental Research Laboratory-Duluth, Duluth, MN 55804
* Weininger, D. (1988), SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, [http://dx.doi.org/10.1021/ci00057a005 "J. Chem. Inf. Comput. Sci." 28, 31-36.]
* Weininger, D.; Weininger, A.; Weininger, J.L. (1989) SMILES. 2. Algorithm for generation of unique SMILES notation [http://dx.doi.org/10.1021/ci00062a008 "J. Chem. Inf. Comput. Sci." 29, 97-101.]
* Helson, H.E. (1999) Structure Diagram Generation In Rev. Comput. Chem. edited by Lipkowitz, K. B. and Boyd, D. B. Wiley-VCH, New York, pages 313-398.

External links

pecifications

* [http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html "SMILES - A Simplified Chemical Language"]
* [http://www.opensmiles.org The OpenSMILES home page]
* [http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html "SMARTS - SMILES Extension"]
* [http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html Daylight SMILES tutorial]
* [http://www.dalkescientific.com/writings/diary/archive/2004/01/05/tokens.html Parsing SMILES]

MILES related software utilities

* [http://www.chembiogrid.org/cheminfo/smi23d/ smi23d] – 3D Coordinate Generation
* [http://www.daylight.com/daycgi/depict Daylight Depict]
* [http://cactus.nci.nih.gov/services/gifcreator/ CACTVS at NCI]
* [http://pubchem.ncbi.nlm.nih.gov/edit/index.html PubChem] – online molecule editor
* [http://www.molinspiration.com/jme/index.html JME molecule editor]
* [http://www.acdlabs.com/download/chemsk.html ACD/ChemSketch]
* [http://www.chemaxon.com/product/live_examples.html ChemAxon/Marvin] – online chemical editor/viewer and SMILES generator/converter
* [http://www.chemaxon.com/product/ijc.html ChemAxon/Instant JChem] – desktop application for storing/generating/converting/visualizing/searching SMILES structures, particularly batch processing; personal edition free
* [http://www.hungry.com/~alves/smormoed/ Smormo-Ed] – a molecule editor for Linux which can read and write SMILES
* [http://inchi.info/ InChI.info] – an unofficial InChI website featuring on-line converter from InChI and SMILES to molecular drawings


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Simplified molecular input line entry specification — Le Simplified Molecular Input Line Entry Specification ou SMILES est un langage symbolique de description de la structure des molécules chimiques sous forme de courtes chaînes de caractères ASCII. Les chaînes SMILES peuvent être importées dans la …   Wikipédia en Français

  • Simplified Molecular Input Line Entry Specification — (SMILES) ist ein chemischer Strukturcode, bei dem die Struktur beliebiger Moleküle stark vereinfacht als (ASCII )Zeichenkette wiedergegeben werden. Mehrere Molekül Editoren können SMILES Strings importieren und so 2 dimensionale und 3… …   Deutsch Wikipedia

  • Simplified molecular input line entry specification — SMILES (Simplified Molecular Input Line Entry Specification, англ. спецификация упрощенного представления молекул в строке ввода)  система правил (спецификация) однозначного описания состава и структуры молекулы химического вещества с… …   Википедия

  • Simplified Molecular Input Line Entry Specification — Le Simplified Molecular Input Line Entry Specification ou SMILES est un langage symbolique de description de la structure des molécules chimiques sous forme de courtes chaînes de caractères ASCII. Les chaînes SMILES peuvent être importées dans la …   Wikipédia en Français

  • Spécification d'écriture simplifiée des molécules — Simplified Molecular Input Line Entry Specification Le Simplified Molecular Input Line Entry Specification ou SMILES est un langage symbolique de description de la structure des molécules chimiques sous forme de courtes chaînes de caractères… …   Wikipédia en Français

  • Line notation — is a typographical notation system using ASCII characters, most often used for chemical nomenclature.Chemistry* International Chemical Identifier (InChI) * ROSDAL * Wiswesser Line Notation (WLN) * Simplified molecular input line entry… …   Wikipedia

  • Smiles arbitrary target specification — (SMARTS) is a language for specifying substructural patterns in molecules. The SMARTS line notation is expressive and allows extremely precise and transparent substructural specification and atom typing.SMARTS is related to the SMILES line… …   Wikipedia

  • SYBYL Line Notation — Infobox file format name = sybyl line notation extension = .sln mime = owner = creatorcode = genre = chemical file format container for = contained by = extended from = extended to = The SYBYL line notation or SLN is a specification for… …   Wikipedia

  • Wiswesser Line Notation — (WLN) ist ein komplizierter und auch veralteter linearer Strukturcode für chemische Verbindungen, erfunden 1954 von W. J. Wiswesser. Einzelne Buchstaben geben ein bestimmtes Strukturfragment wieder, eine Zeichenkette gibt die gesamte Struktur… …   Deutsch Wikipedia

  • Notation SMILES — Simplified Molecular Input Line Entry Specification Le Simplified Molecular Input Line Entry Specification ou SMILES est un langage symbolique de description de la structure des molécules chimiques sous forme de courtes chaînes de caractères… …   Wikipédia en Français


Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.