Project Details IUPAC - International Chemical Identifier

Project No.:
2000-025-1-800
Start Date:
01 January 2001
End Date:
15 April 2005

Objective

The objective of the IUPAC Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources thus enabling easier linking of diverse data compilations.

Remarks:
– initiated by the adhoc Committee on Chemical Identity and Nomenclature Systems
– In July 2004, the Identifier was renamed INChI (formerly IChI) to acknowledge the development work at NIST.
– In November 2004, the Identifier was renamed IUPAC International
Chemical Identifier (InChI), to allow trademark, copyright and licensing
issues to be resolved.

Description

Develop a set of algorithms for the standard representation of chemical structures that will be readily accessible to chemists in all countries at no cost. The standard chemical representation could be used as input into existing and newly developed computer programs to generate a IUPAC name and a unique IUPAC identifier.


The aim of the Chemical Identifier project is to establish a unique label, the IUPAC Chemical Identifier (IChI), which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources thus enabling easier linking of diverse data and information compilations.

IChI will not require the establishment of a registry system. Unlike the CAS Registry System, it will not depend on the existence of a database of unique substance records to establish the next number for any new chemical substance being assigned an IChI. It will use a, yet to be defined, set of IUPAC structure conventions, and rules for normalization and canonicalization of the structure representation to establish the unique label. It will thus enable an automatic conversion of a graphical representation of a chemical substance into the unique IChI label which can be performed anywhere in the world and which could be built into desktop chemical structure drawing packages (such as ChemDraw, ISIS/Draw, etc.) and online chemical structure drawing applets (such as ACD/Draw).

IUPAC would define the process flow leading from input structural information to the creation of the Identifier in three steps: definition of chemical structure input requirements, algorithms for generating a unique set of atom labels (canonicalization), and algorithms for conversion of these labels to the Identifier (serialization). Structure input and conversion to the structural format required by the IChI generator would be done by vendor developed software.

The process would be reversible, so that the Identifier output could be used to recreate structural input information. The Identifier would thus serve as the computer equivalent of the IUPAC name for a molecule. This would facilitate searching the Internet and labeling information in electronic documents with the name of the chemical substance in question.

A. D. McNaught
7 February 2001

Progress

> project announcement published in Chem. Int. 23(3) 2001, p.85

Our initial work has focussed on the development of algorithms for converting an input organic chemical structure to a unique (canonical)form. This, in effect, involves the unique numbering of each atom, with equivalent atoms being assigned identical numbers. “Serializing”the result to create a string is the final, straightforward, step in creating an identifier.

As discussed in the Cambridge IUPAC meeting to consider the feasibility of the project in August 2000, most of the ideas employed in this work have been reported in the technical literature. The principal task of this project has been to identify and implement a workable, robust set of procedures that will provide effective IChI processing for a large proportion of organic chemical structures in common use.

At the Cambridge meeting it was agreed to develop a “layered” approach, where different levels of structural information are separately represented in the identifier. Work has consequently proceeded by step-by-step building of the individual layers. Since the order of application of the layers could affect the final labeling, this process is somewhat more complex that might initially appear.

The layers under development are:

  • constitutional – expresses pure connectivity of the atoms
  • stereochemical – includes conventional C-atom sp2 and sp3 stereochemistry
  • isotopic – enables isotopes to be distinguished
  • tautomeric – implements simple forms of rapid H-migration isomerization

Initial implementation and testing of this work have been completed, with the exception of the following two items:

  • Representing stereochemistry in systems with moving(mesomeric) bonds and electrons.
  • Representing H-migration tautomerism in systems containing 5-membered rings.

The first of these items does not seem to have been addressed adequately in the literature, although appropriate processing algorithms have been found in mathematical journals.

We hope to complete these remaining tasks within two months and then to implement the IChI processor as a standalone program that can automatically process standard “MOL-files”. When this is available, assistance will be sought to further test, and possibly refine the IChI name generation process.

Depending on results of these tests and discussions, it will be decided whether improvements or additional features are desirable, and, if so, whether these need to be followed by another round of testing. For instance, it needs to be determined whether the first version should allow a canonical representation of partially-specified stereochemical structures.

Finally, as discussed in the Cambridge meeting, there are no plans to include the following structural representations in the first version:

  • non C-atom sp2 and sp3 stereochemistry
  • ring-chain tautomerism (or any other variety not involving simpleH-migration)
  • non-covalent bonds

 

March 2002 update
The first beta-test version of the program is now available. It runs as a conventional Windows application under 32-bit Microsoft Windows operating systems. Neither the underlying algorithms nor the program have been perfected – this distribution is intended primarily to allow others to participate in the further development.

This program treats only covalently bonded compounds and uses Mol files (and SD files) as input. Along with the executable programs, the distribution package contains documentation and example structure files.

The package can be obtained from Steve Stein by e-mail to steve.stein@nist.gov. Unless requested otherwise, the package will be delivered as a ‘zip’ file in an e-mail attachment to the return address.

A demonstration of Identifier generation within a (Windows) structure-drawing program, working in conjunction with the beta test program, can be obtained from Alan McNaught by e-mail to mcnaughta@rsc.org.

There was a discussion of the project at the “CAS/IUPACConference on Chemical Identifiers and XML for Chemistry” on 1 July 2002 in Columbus, Ohio. On the preceding day (June 30th) at the same location the Project Group met to review progress and consider comments received.

 

July 2002 update
At the Task group meeting in Columbus, OH, on 30 June 2002, Steve Stein reviewed the progress made by NIST in developing the test version of the IUPAC Chemical Identifier. The test version handles simple organic molecules. To date, in all of the testing (almost 70 copies have been distributed) there are no known examples of chemicals that the program does not handle. A number of suggestions (described below) were made regarding testing and output. The overall view was that the project is progressing considerably faster than expected.
> Download report – pdf file (118KB)

A lecture by Steve Stein on the project was given the following day at the CAS/IUPAC Conference on Chemical identifiers and XML for Chemistry.

 

November 2003 update
A combined meeting for two related IUPAC projects, the XMLData Dictionary Project (#2002-022-1-024) and this Chemical Identifier Project (#2000-025-1-800), was held at the National Institute of Standards and Technology (NIST, Gaithersburg, Maryland, US) on November 12-14, 2003.

A report on that meeting is published in Chem.Int. July-Aug 2004.
A full account of the meeting is available at <www.warr.com/inchi.pdf>

 

July 2004 update
A new test version of the IUPAC-NIST Chemical Identifier (INChI)is now available. It replaces the previous test version issued last November. All features planned for inclusion in the final release have now been implemented and the final format for Identifier has been proposed. The new name of the Identifier (formerly IUPAC ChemicalIdentifier, INChI) acknowledges the development work at NIST. The test program accepts input in the form of MOLFiles (or SD files) and CML files. An Application Program Interface (API) for communicating with external programs is under development.

A single INChI is generated for a single input structure, which can contain multiple components. Identifiers can be created for organic compounds with Z/E and sp3 stereochemistry, tautomers, and isotopes as well as salts, organometallic compounds and protonated forms of a compound.

Test programs (for Microsoft Windows), documentation and sample structure files are available upon request from Steve Stein <steve.stein@nist.gov>. The project team very much welcomes comments concerning the INChI and will be glad to assist in its testing or implementation.

 

November 2004 update
To allow trademark copyright and licensing issues to be resolved before distribution of version 1.0, the name of the Identifier was changed to IUPAC International Chemical Identifier (InChI).

April 2005project completed
Version 1 of IUPAC’s International Chemical Identifier (InChI) has now been released; software, documentation, source code and licensing conditions are available from the IUPAC website at www.iupac.org/inchi.

Promotion and extension continue through project 2004-039-1-800.

> see release (Chem. Int. July 2005)
> FAQ (prepared by Nick Day of the Unilever Centre for Molecular Informatics, Cambridge University; https://wwmm.ch.cam.ac.uk/inchifaq/)


Clipping
> That INChI Feeling Reactive Reports, Sep 2004 (issue 40)
> Unique labels for compounds C&EN, 26 Nov 2002
> That ICHI feeling … The Alchemist, 24 Apr 2002
> What’s in a Name? The Alchemist, 21 Mar 2002