Project Details Redesign of Handling of Tautomerism for InChI V2

Project No.:
2012-023-2-800
Start Date:
01 August 2012
End Date:
Division Name:
Chemical Nomenclature and Structure Representation
Division No.:
800

Objective

To establish requirements and guidelines for improving the handling of tautomerism in the next generation of the InChI algorithm to address the shortcomings of the current algorithm. The intended outcome is the world-wide adoption of the new standard with the ensuing more comprehensive and chemically more accurate handling of chemical structures capable of tautomerism in, e.g., large databases, chemistry books and journal articles etc.

> Link to InChI subcommittee

Description

The IUPAC International Chemical Identifier (InChI) algorithm is now well established as a powerful means of denoting the basic chemical structure of a well-defined, small (<1024 atoms) organic molecule as a unique machine-readable character string, suitable for electronic data storage, searching and exchange. The IUPAC Division VIII InChI Subcommittee is now starting work on a complete overhaul of the InChI algorithm, i.e. the beginning of plans for a version 2 of InChI. A crucial part of this work is intended to address the known shortcomings of the current InChI algorithm pertaining to the handling (or lack thereof) of various types of tautomerism.

Important issues intended to be addressed in this context are discussed in, e.g., M. Sitzmann, W.-D. Ihlenfeldt, M. C. Nicklaus, “Tautomerism in large databases”, J Comput Aided Mol Des (2010) 24:521-551; https://dx.doi.org/10.1007/s10822-010-9346-4.

The present project is devoted to analysis of the current handling of the various types of tautomerism in InChI, their deficiencies, their connection with metal disconnection and protonation/deprotonation, comparison of InChI’s current algorithm with approaches published in literature and/or used in other databases and software, and putting together a list of new requirements of how an InChI V2 algorithm should handle tautomerism and related issues.

Progress

June 2015 update – Since tautomeric equilibria are strongly condition-dependent, one task of the Working Group will be to form a consensus of what types and ranges of conditions should be kept in mind when talking, and deciding, about tautomerism in the context of InChI. Discussions about this within the Working Group have started; but the consensus was that there is so little systematic data available in this field that we have essentially nothing to work with. It is important to note that it is not proposed to introduce conditions in the calculation of InChI, just into our decision-making about the rules. With the above in mind, activities designed to alleviate the paucity of data available are well underway.

December 2019 update – The scientific part of the project has been completed. More than 80 tautomeric transforms have been identified from various sources of experimental literature. Two papers forming the scientific background of this project have been submitted for publication: “Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2” and “Tautomer Database: A Comprehensive Resource for Tautomerism Analyses” (preprints available at https://doi.org/10.26434/chemrxiv.10794962.v1 and https://doi.org/10.26434/chemrxiv.10790369.v1). See also L. Guasch, M. Sitzmann, M. C. Nicklaus, “Enumeration of Ring–Chain Tautomers Based on SMIRKS Rules”, J. Chem. Inf. Model. (2014), 54(9): 2423-2432; https://doi.org/10.1021/ci500363p. Their contents constitutes the scientific material that the Working Group will use in its decision what types of tautomerism should be recommended to include in InChI V2.

The vast majority of the SMIRKS rules match at least hundreds (and many match millions) of molecules in large small-molecule databases such as PubChem. Fewer than ten of these rules are currently covered to a large extent by current (V.1.05) Standard InChI. A somewhat higher number are covered in a non-standard InChI generated with additional tautomerism options KET and 15T turned on. In general, about three times as many molecules would be affected if these rules are implemented in their entirety in InChI V2 in comparison to what InChI V1’s current algorithm does.

First coding tests are being initiated to investigate how (subsets of) these new tautomeric rules can best be added to InChI, be it as an extension to the existing InChI code or as a rewrite of the code.

Page last update 3 Dec 2019