What on Earth is InChI?

1IUPAC and InChI - Providing a New Common Language for Chemistry

The IUPAC Chemical Identifier or InChI (“pronounced “en-chee”) is a world-wide computer based standard for chemical structure representation created by the collaboration of chemists around the world under the auspices of IUPAC. The InChI format and algorithm are non-proprietary and the software is open source, with ongoing development done by the chemistry community.

The result of this world-wide collaborative effort has been the development, maintenance, and expansion of the capabilities of the open source, freely available, nonproprietary International Chemical Identifier (InChI), first by NIST and now by the InChI Trust (1), a not‑for‑profit UK charity which is supported by contributions from member organizations.

Over 100 chemical information specialists and computational chemists comprise an internet forum to examine and test the software before each public release; this optimal quality control by a world‑wide user community has led to improvements to and new releases of the software. The reliance on input from many volunteers from around the world enables the project to be staffed by a part-time project director and a programmer.

The usage of InChI is now common enough in chemical structure handling within databases, publications, and by cheminformatics software that it can be considered ubiquitous in these areas. A random and small sample of those who have added InChIs and InChIKeys to their data and information are:

PubChem – 93 Million structures
European Biometrics Institute UniChem – 151 million structures
Royal Society of Chemistry – ChemSpider – 60 million structures
National Cancer Institute – Chemical Structure Lookup Service – 74 million structures
ChemNavigator iResearch Library 371 million structures
Elsevier – Reaxys 29 million structures
ACS/CAS – Chemical Abstracts Service – over 100 million

The InChI and InChIKey have become essential tools for scientists worldwide, providing a new Common Language for Chemistry. The power of these tools allows chemists and computers to communicate more effectively thereby accelerating the pace of scientific research.

2How does InChl contribute to scientific innovation?

For the past 100 years IUPAC has been the gold standard for standardizing chemical nomenclature primarily through publishing the IUPAC Color Books (2). Over time, chemists have created a Tower of Babel of chemical names. Due to the creation of vast amounts of information in electronic form in the past 30+ years, it has become, more often than not, extremely challenging to locate information and data. Having different names for the same chemical makes it very difficult to find all the necessary information chemists need for their work. For example, in a search of PubChem (3) a common drug like Valium (diazepam) has at least 291 different names and Lipitor (atorvastatin) has some 143 names. And in PubChem benzene has 498 Depositor‑Supplied Synonyms.

The IUPAC InChI reduces complexity and provides a unique character string to link the many islands of public and private chemical information available on the internet, i.e., publishing, patents, chemical inventory, chemical trade, regulation and safety.

Open Access (4), Open Data (5), and Open Standards (6) are areas that are expanding rapidly and are facilitating faster and more effective research discovery. Collaborative, interoperable, and global dissemination standards are essential in a more connected world. The need to ensure that chemical data are fully annotated to allow computer-facilitated processing, use, and re-use is critical for the advancement of scientific research.

The availability of this new standard has enabled a wide variety of applications, such as:

Finding compounds in the chemical/patent/general literature via text‑based search engines
Communication between databases
Merging data collections developed using different systems/protocols
Ordering chemicals from suppliers, maintaining chemical inventory or any broad‑based local chemical collection,
Detecting duplicates in collections which are due to different drawing styles for the same structure, passing the identity of a substance to a colleague for use in any of the above.

As Wendy Warr has written (7) such activities and uses are taking place on an almost daily, hundreds of scientific papers in the literature report the use of InChI for merging databases. Tony Williams, the creator of the ChemSpider project wrote (8):

”It is definitely true to declare that without InChI as an enabling technology ChemSpider is unlikely to have progressed at the pace it did, it would be a lot less functional in terms of its capabilities to connect to chemistry on the internet and would not hold its present prominence as one of the primary community resources for chemists online”.

The ability to use InChI to validate and match collections of compounds is proving successful for managing millions of chemicals in large public databases, such as ChemSpider and PubChem, and in proprietary databases, such as Reaxys and SciFinder. This value could also be realized for large collections of chemical data internal to an organization, such as chemical inventories, with potential cost-saving implications for chemical stocks as well as supporting reporting requirements. Use of InChI in chemical records and supporting documentation would potentially facilitate generation of reports, linking between reports wherever chemical names are listed, and communication of critical information downstream, from research to waste disposal.

Accuracy of chemical identities in inventory records is critical for safety communication and planning. As described by Leah McEwen and Ralph Stuart (9, 10), recent highly damaging events in chemical laboratories and classrooms have led to increasing focus on chemical risk assessment in research organizations. Pertinent data for assessing and managing risks are scattered across many agency and industry resources, nationally and internationally, and in many different formats. Reporting requirements vary by sector and region, and cause difficulties for exchanging and evaluating data. Diverse schemes for identifying chemicals in published information are not always resolvable, and mixed substances are often only indexed by the primary component. InChI would support matching and collation of internal records with disparate external information sources, such as hazard classifications, standard operating procedures and emergency response guidelines.

Many articles of trade exist as defined or partially defined mixtures, ranging from simple solutions to consumer product formulations. In practice, all chemicals contain some level of impurity and composition impacts chemical reactivity, unintentionally or designed. A solvent may present a greater hazard than the solute and communicating information about composition is critical for conducting safe, effective, and scientifically meaningful chemistry and other laboratory functions. Most chemical indexing does a poor job of describing mixtures and a project is underway for using InChI to identify multiple components. “MInChI” will provide parsable composition data that could be used track incompatible solvents or known impurities that might disrupt a chemical synthesis, improve storage and waste management, help flag potential hazards in containers or reaction schema, and support analysis of composition properties (11).

InChI is a valuable complement and addition to other compound identifiers (e.g., systematic and trivial names, registry numbers, and various versions of SMILES (12)) in a database and it is not expected nor meant to be a replacement for any identifier already in use. With the implementation of the ISO identification of medicinal products (IDMP) and the related ISO 11238 standards, the addition of an InChI will allow for an easier, effective, and more complete search for information on a particular chemical, be it a drug, a pollutant, or a chemical for other commercial and/or noncommercial use.

The value and benefit of InChI (and InChIKey – see later) to IUPAC and to the general chemical and scientific community is reflected in the support and use by many organizations, some of whom support the InChI Trust project financially, as well as users of the InChI algorithm. These supporters include the National Institutes of Health – National Library of Medicine (NIH/NLM), US Food and Drug Administration, National Institute of Standards and Technology, Royal Society of Chemistry, American Chemical Society-Chemical Abstract Service, MilliporeSigma, Elsevier RELX, Wiley, Springer Nature, Taylor & Francis, Bio-Rad, OpenEye, and ChemAxon.

3How was InChI born?

IUPAC began a major reorganization at the turn of this century, led by Ted Becker (then the IUPAC Secretary General). He and Alan McNaught, realizing the need for new approaches to chemical nomenclature, assembled a broad spectrum of chemists involved in providing and using chemical information at the US National Academy of Sciences in March 2000 to examine this issue. The need for a computerized equivalent of an IUPAC name (i.e., a standard chemical identifier to be represented as character string) was quickly appreciated, and after some exploratory studies that year, recognizing the good fortune that US Government standard organization, NIST, needed such a tool for their internal work, the project took off. Hence the InChI project (13) started under the auspices of two well-known authorities for establishing standards – IUPAC and NIST (14).

The goal of the project was two-fold. The first was to have an accepted world-wide computer based standard for structure representation created by the collaboration of chemists around the world under the auspices of IUPAC. The second was to use this unique representation, a unique character string, to link the many islands of information available on the internet. As part of this second goal it was made clear that the InChI string for a chemical structure was not a replacement to any existing system [such as SMILES or CAS (15)], but rather an addition to any computer record.

The approach, originally developed by Dmitrii Tchekhovskoi, Steve Stein, and Steve Heller at NIST, was to express a chemical structure in terms of separate layers of information (connectivity, stereochemical, isotopic, and tautomeric. The result of this effort was to create the International Chemical Identifier – InChI. In the final representation, the unique connectivity layer is essential, but the user can choose which other layers to keep. A useful overview of is available on Wikipedia (16). Additional details are found in the scientific literature (17-21)

The InChI algorithm converts input structural information into the identifier in a three‑step process: normalization (to remove redundant information), canonicalization (to generate a unique set of atom labels), and serialization (to give a string of characters). The procedure generates a different identifier for every compound, but always gives the same identifier for a particular compound regardless of how the structure is input. Of course, the procedure is equally applicable to both known and, as yet, unknown compounds.

As the project evolved it became clear that chemists did not always agree on what structural information should be incorporated. Thus a “Standard InChI “specifically designed for interoperability was created. This standard InChI did not allow the user to select how to treat tautomerism, stereochemistry, and bonds to metal, thus allowing for easier comparisons between InChIs and between hashed InChIs – the InChIKeys described later in this article. The wider but variable set of options of the “non-standard” InChI still provides additional functionality for specific use cases.

The initial version of the InChI algorithm was released in 2009. From the usage and feedback of users we concluded that this initial version could handle almost all the chemicals with which scientists work with every day. Additional work is underway to improve the treatment of tautomers, organometallics and inorganics, and to handle biopolymers, positional isomers, and chemical mixtures. The latest release, in 2017 (version 1.05) (22), added experimental polymer support and multithreading safety, among other novel features. The first version of RInChI (22), InChI for reactions, was released in 2017. The RInChI organizes InChIs involved in chemical reactions in a unique representation providing one layer each for reactants, products, agents (catalysts, solvents, etc.), and the direction of the reaction. This makes the RInChI a precise, robust, structure‑derived tag for chemical reactions. The use of QR codes for InChIs will similarly enable wider use in the labeling of containers.

4What is a Googleable InChIKey?

As the InChI algorithm creates a string corresponding to the size of a molecule, these strings can get extremely long. Input of a long string such as that for palytoxin into Google, Bing, Yahoo or a similar search engine is a problem as search engines have their own unique ways of taking a long input and deleting characters and restricting the length of a search query. Furthermore, these search engines may have difficulties when treating some non-alphabetical symbols (which may appear in the InChI string). Hence the need for a compressed or hashed version of the InChI string (and expressed with a minimal set of characters) became apparent. To simplify indexing chemical structures in databases, and to make chemical structures readably and easily searchable on the Internet, the widely used SHA-256 algorithm (23) was used to create a hashed version of the string reducing it to a more manageable 27 characters.

The InChIKey is simply a condensed representation of the full InChI string consisting of 27 characters broken down as follows. The first 14 characters encode the core molecular skeleton (formula, connectivity, hydrogen positions and charge. After a hyphen there is a second string of 10 characters, with the first 8 encoding the features complementing core data (stereochemistry, tautomerism, isotopic substitution and metal ligation). The remaining 2 characters are an indication as to whether the original InChI was a Standard InChI and the version number of the InChI software. The last character of the InChIKey is a character indicating the (de)protonation state.

For example, the structure, InChI, and InChIKey for palytoxin provided below show why the hashed InChIKey is the most practical and easy way to search for a structure on the internet.

The success of InChIKey is reflected in these comments from Chis Southan (24):

“While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open‑web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin (Lipitor) retrieves ~200 low‑redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false‑positive rate. The InChIKey indexing has therefore turned Google into a de‑facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records.”

As mentioned by Kutchukian and his colleagues (25)

“One of the challenges of leveraging various databases, and integrating them with internal chemogenomic data, is harmonizing identifiers. For example, different chemical identifiers might be used in each database to represent the same chemical entity. Furthermore, even in an internal database, there might be multiple identifiers that correspond to the same compound. One way to address this challenge is by representing compounds as desalted InChIKeys, and associating biological activity with the InChIKey, as opposed to some arbitrary registry number.”

Therefore, the InChI and InChIKey combine to provide a major advance in cheminformatics.

5How can I generate an InChI or an InChIKey?

An InChI may be generated directly from the InChI software (available at the InChI Trust) or by using any of the common chemical drawing or cheminformatics packages.

Download the Latest Version of InChI Software

In addition, two websites also provide the facility to generate InChIs online:

NCI Chemical Identifier Resolver. This service works as a resolver for different chemical structure identifiers and allows the conversion of a given structure identifier into another representation or structure identifier.
PubChem Server Side Structure Editor v 1.8 includes a facility for generating InChIs as you draw the structure.

6How are InChIs Designed and Constructed?

As chemists often do not know the complete details of a structure, InChI was designed to take this into account and allow for an InChI to be created based upon how much one knew of the details of a structure. This layered structure design offers a number of advantages. If two structures for the same substance are drawn at different levels of detail, the one with the lower level of detail will, in effect, be contained within the other. Specifically, if one substance is drawn with stereo‑bonds and the other without, the layers in the latter will be a subset of the former. The same will hold for compounds treated by one author as tautomers and by another as exact structures with all H‑atoms fixed. In many cases, this can work at a finer level. For example, if one author includes double bond and tetrahedral stereochemistry, but another fully omits the latter, the latter InChI will be contained in the former (with possible exception which may appear if tetrahedral centers affect double bond stereo).

The InChI layered design accounts for molecular

formula (standard Hill order);
connectivity (no formal bond orders, hydrogen positions are indicated instead); if the molecule contains metal(s), connectivity may be expressed separately for metal-disconnected version of molecule and for connected-metal (original) version;
charge and protonation/deprotonation;
stereochemistry of a) double bond (Z/E) and b) tetrahedral (sp3) stereogenic elements;
isotopic enrichment;
tautomerism (on or off)

Also, InChI design accounts for issues concerning metal ligation details (which may be completely omitted or specifically addressed).

This list does not follow the exact sequence of InChI layers which is strictly determined and may sometimes seem really intricate. For example, InChI may contain two stereochemistry layers for an isotopically enriched molecule – the first one describes “isotope-less” stereo and the second one accounts for change of stereo configuration arising due to isotopic substitution. It is important to recognize, however, that InChI strings are intended for use by computers and end users need not understand any of their details (though they are of course documented). In fact, the open nature of InChI and its flexibility of representation, after implementation into software systems, may allow chemists to be even less concerned with the details of structure representation by computers.

General workflow of InChI/Key generation

InChI and InChIKey of caffeine

Note that different protonation states of the same compound will have Standard InChIKeys which differ only by a single character, the protonation flag (unless both states have number of inserted/removed protons > 12). Moreover, as neutral and zwitterionic states of the same molecule do have the same zero number of inserted/removed protons, they will have the same Standard InChIKeys. Still, non-Standard InChIKeys generated from non-Standard InChIs (including FixedH sublayer) will allow one to distinguish between the states. This is exemplified by InChIKeys for various ionization states of L-lysine.

Standard (upper line under each drawing) and FixedH (lower line) InChIKeys for the various ionization states of L-lysine

References

InChI Trust - https://www.inchi-trust.org
IUPAC Color Books - https://iupac.org/what-we-do/books/color-books
PubChem - https://pubchem.ncbi.nlm.nih.gov/
Open Access - https://en.wikipedia.org/wiki/Open_access
Open Data - https://en.wikipedia.org/wiki/Open_data
Open Standards - https://en.wikipedia.org/wiki/Open_standard
Warr, W. A. Many InChIs and quite some feat”, J. Comput.‑Aided Mol. Des. 2015, 29(8), 681‑694 - https://doi.org/10.1007/s1082
ChemSpider - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3537679/
Meeting the Google Expectation for Chemical Safety Information: Chemical Risk Assessment in Academic Research and Teaching. Leah McEwen, Ralph Stuart, Chemistry International Volume 37, Issue 5-6 (Sep 2015). DOI: - https://doi.org/10.1515/ci-2015-0505
Chemical Health and Safety Data Management: Supporting Prudent Practices in Research Laboratories Leah McEwen, Chemistry International Volume 39, Issue 3 (July 2017). DOI: - https://doi.org/10.1515/ci-2017-0308
InChI Extension for Mixture Composition. IUPAC Project No. 2015-025-4-800; Task group chair: Leah McEwen. - https://iupac.org/project/2000‑025‑1‑800
SMILES - https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
International Chemical Identifier. IUPAC Project No. 2000-025-1-800; Task group chair: Alan McNaught. - https://iupac.org/project/2000-025-1-800
NIST - https://chemdata.nist.gov
CAS - https://en.wikipedia.org/wiki/CAS_Registry_Number
International Chemical Identifier Wikipedia Page - https://en.wikipedia.org/wiki/International_Chemical_Identifier
Heller, S. & McNaught, A. The status of the InChI project and the InChI trust. J Cheminform (2010) 2(Suppl 1): P2. - https://doi.org/10.1186/1758-2946-2-S1-P2
Bachrach, S.M. InChI: a user’s perspective. J. Cheminform (2012) 4:34. doi: - https://doi.org/10.1186%2F1758-2946-4-34
Pletnev, I., Erin, A., Blinov, K., Tchekhowskoi, D., Heller, S. InChIKey collision resistance: an experimental testing. J. Cheminform (2012) 4:39. doi: - https://doi.org/10.1186%2F1758-2946-4-39
Grethe, G., Goodman, J., Allen, C. H. G. International chemical identifier for reactions (RInChI). J. Cheminform (2013) 5:45 - https://doi.org/10.1186/1758-2946-5-45
Heller, S. R., McNaught, A., Pletnev, I., Stein, S., Tchekhovskoi, D. InChI, the IUPAC International Chemical Identifier. J. Cheminoform (2015) 7:23. - https://doi.org/10.1186/s13321-015-0068-4
InChI downloads - https://www.inchi-trust.org/downloads/
SHA-2 (Secure Hash Algorithm 2) Wikipedia Page - https://en.wikipedia.org/wiki/SHA-2
Southan, C. InChI in the Wild: an Assessment of InChIKey Searching in Google. J. Cheminform, 2013, 5 (10) - https://www.jcheminf.com/content/5/1/10
PS Kutchukian, C Chang, SJ Fox, E Cook, R Barnard et al. CHEMGENIE: integration of chemogenomics data for applications in chemical biology. Drug Discovery Today, 2017 - https://doi.org/10.1016/j.drudis.2017.09.004

Citation

Boucher, R., Heller, S., Kidd, R., McNaught, A., Pletnev, I. (1 Feb 2018) "What on the Earth is InChI?" IUPAC 100 Stories. Retrieved from https://iupac.org/100/stories/what-on-earth-is-inchi/. (Accessed: day month year)

References

Citation

Subscribe now