Chemical file format


This article discusses some common molecular file formats, including usage and converting between them.

Distinguishing formats

Chemical information is usually provided as files or streams and many formats have been created, with varying degrees of documentation. The format is indicated in three ways
is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including JChemPaint, Jmol, XDrawChem and MarvinView.

Protein Data Bank Format

The Protein Data Bank Format is commonly used for proteins but it can be used for other types of molecules as well. It was originally designed as, and continues to be, a fixed-column-width format and thus officially has a built-in maximum number of atoms, of residues, and of chains; this resulted in splitting very large structures such as ribosomes into multiple files. However, many tools can read files that exceed those limits. For example, the E. coli 70S ribosome was represented as 4 PDB files in 2009: , , 3I1O and 3I1P. In 2014 they were consolidated into a single file, .
Some PDB files contain an optional section describing atom connectivity as well as position. Because these files are sometimes used to describe macromolecular assemblies or molecules represented in explicit solvent, they can grow very large and are often compressed. Some tools, such as Jmol and KiNG, can read PDB files in gzipped format. The wwPDB maintains the specifications of the PDB file format and its XML alternative, PDBML. There was a fairly major change in PDB format specification in August 2007, and a remediation of many file problems in the existing database. The typical file extension for a PDB file is .pdb, although some older files use .ent or .brk. Some molecular modeling tools write nonstandard PDB-style files that adapt the basic format to their own needs.

GROMACS format

The GROMACS file format family was created for use with the molecular simulation software package GROMACS. It closely resembles the PDB format but was designed for storing output from molecular dynamics simulations, so it allows for additional numerical precision and optionally retains information about particle velocity as well as position at a given point in the simulation trajectory. It does not allow for the storage of connectivity information, which in GROMACS is obtained from separate molecule and system topology files. The typical file extension for a GROMACS file is .gro.

CHARMM format

The CHARMM molecular dynamics package can read and write a number of standard chemical and biochemical file formats; however, the CARD and PSF are largely unique to CHARMM. The CARD format is fixed-column-width, resembles the PDB format, and is used exclusively for storing atomic coordinates. The PSF file contains atomic connectivity information and is required before beginning a simulation. The typical file extensions used are .crd and .psf respectively.

GSD format

The General Simulation Data file format created for efficient reading / writing of generic particle simulations, primarily - but not restricted to - those from HOOMD-blue. The package also contains a python module that reads and writes hoomd schema gsd files with an easy to use syntax.

Ghemical file format

The Ghemical software can use OpenBabel to import and export a number of file formats. However, by default, it uses the GPR format. This file is composed of several parts, separated by a tag.
The proposed MIME type for this format is application/x-ghemical.

SYBYL Line Notation

is a chemical line notation. Based on SMILES, it incorporates a complete syntax for specifying relative stereochemistry. SLN has a rich query syntax that allows for the specification of Markush structure queries. The syntax also supports the specification of combinatorial libraries of CD.
Example SLNs
DescriptionSLN String
BenzeneCH:CH:CH:CH:CH:CH:@1
AlanineNH2CHCOH
Query showing R sidechainR1C:C:C:C:C:C:@1
Query for amide/sulfamideNHC=M1

SMILES

The Simplified Molecular Input Line Entry Specification is a line notation for molecules. SMILES strings include connectivity but do not include 2D or 3D coordinates.
Hydrogen atoms are not represented. Other atoms are represented by their element symbols B, C, N, O, F, P, S, Cl, Br, and I. The symbol "=" represents double bonds and "#" represents triple bonds. Branching is indicated by. Rings are indicated by pairs of digits.
Some examples are
NameFormulaSMILES String
MethaneCH4C
EthanolC2H6OCCO
BenzeneC6H6C1=CC=CC=C1 or c1ccccc1
EthyleneC2H4C=C

XYZ

The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols and cartesian coordinates.

MDL number

The MDL number contains a unique identification number for each reaction and variation. The format is RXXXnnnnnnnn. R indicates a reaction, XXX indicates which database contains the reaction record. The numeric portion, nnnnnnnn, is an 8-digit number.

Other common formats

One of the most widely used industry standards are chemical table file formats, like the Structure Data Format files. They are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited. MOL is another file format from MDL. It is documented in Chapter 4 of CTfile Formats.
PubChem also has XML and ASN1 file formats, which are export options from the PubChem online database. They are both text based.
There are a large number of other formats listed in the table below

Converting between formats

and JOELib are freely available open source tools specifically designed for converting between file formats. Their chemical expert systems support a large atom type conversion tables.
For example, to convert the file epinephrine.sdf in SDF to CML use the command
The resulting file is epinephrine.cml.
A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats. The tools JChemPaint, XDrawChem, Chime, Jmol, Mol2mol and Discovery Studio fit into this category.

The Chemical MIME Project

"Chemical MIME" is a de facto approach for adding MIME types to chemical streams.

This project started in January 1994, and was first announced during the Chemistry workshop at the First WWW International Conference, held at CERN in May 1994.... The first version of an Internet draft was published during May–October 1994, and the second revised version during April–September 1995. A paper presented to the CPEP at the IUPAC meeting in August 1996 is available for discussion.

In 1998 the work was formally published in the JCIM.
File ExtensionMIME TypeProper NameDescription
alcchemical/x-alchemyAlchemy Format
csfchemical/x-cache-csfCAChe MolStruct CSF
cbin, cascii, ctabchemical/x-cactvs-binaryCACTVS format
cdxchemical/x-cdxChemDraw eXchange file
cerchemical/x-ceriusMSI Cerius II format
c3dchemical/x-chem3dChem3D Format
chmchemical/x-chemdrawChemDraw file
cifchemical/x-cifCrystallographic Information File, Crystallographic Information FrameworkPromulgated by the International Union of Crystallography
cmdfchemical/x-cmdfCrystalMaker Data format
cmlchemical/x-cmlChemical Markup LanguageXML based Chemical Markup Language.
cpachemical/x-compassCompass program of the Takahashi
bsdchemical/x-crossfireCrossfire file
csm, csmlchemical/x-csmlChemical Style Markup Language
ctxchemical/x-ctxGasteiger group CTX file format
cxf, cefchemical/x-cxfChemical eXchange Format
emb, emblchemical/x-embl-dl-nucleotideEMBL Nucleotide Format
spcchemical/x-galactic-spcSPC format for spectral and chromatographic data
inp, gam, gaminchemical/x-gamess-inputGAMESS Input format
fch, fchkchemical/x-gaussian-checkpointGaussian Checkpoint Format
cubchemical/x-gaussian-cubeGaussian Cube Format
gau, gjc, gjf, comchemical/x-gaussian-inputGaussian Input Format
gcgchemical/x-gcg8-sequenceProtein Sequence Format
genchemical/x-genbankToGenBank Format
istr,istchemical/x-isostarIsoStar Library of Intermolecular Interactions
jdx, dxchemical/x-jcamp-dxJCAMP Spectroscopic Data Exchange Format
kinchemical/x-kinemageKinetic Images; Kinemage
mcmchemical/x-macmoleculeMacMolecule File Format
mmd, mmodchemical/x-macromodel-inputMacroModel Molecular Mechanics
molchemical/x-mdl-molfileMDL Molfile
smiles, smichemical/x-daylight-smilesSimplified molecular input line entry specificationA line notation for molecules.
sdfchemical/x-mdl-sdfileStructure-Data File
elchemical/x-sketchelSketchEl Molecule
dschemical/x-datasheetSketchEl XML DataSheet
inchichemical/x-inchiThe IUPAC International Chemical Identifier
jsd, jsdrawchemical/x-jsdrawJSDraw native file format
helm, ihelmchemical/x-helmPistoia Alliance HELM stringA line notation for biological molecules
xhelmchemical/x-xhelmPistoia Alliance XHELM XML fileXML based HELM including monomer definitions

Support

For Linux/Unix, configuration files are available as a "chemical-mime-data" package in.deb, RPM and tar.gz formats to register chemical MIME types on a web server. Programs can then register as viewer, editor or processor for these formats so that full support for chemical MIME types is available.