Project Details InChI extension for mixture composition

Project No.:
2015-025-4-800
Start Date:
01 July 2016
End Date:
Division Name:
Chemical Nomenclature and Structure Representation
Division No.:
800

Objective

The primary objective of this project is to establish requirements and guidelines for the generation of a unique computer-readable identifier for chemical mixtures for use in chemical stock inventories and information systems. Currently, many chemical identifiers exist, but very few systematically address multi-component systems. Furthermore, most existing identifiers are fixed systems that present cross-referencing challenges between databases designed around different initial applications and editorial principles.

An immediate and compelling use case for a mixtures identifier is to improve documentation for laboratory safety communication and planning. Recent highly damaging events in chemical laboratories and classrooms have lead to increasing focus on chemical information management in laboratory organizations. The diverse teaching and research environment in the academic sector particularly is raising awareness of the complexity of chemical safety information resources and formats available. Documentation of chemicals with current identifiers is a persistent challenge for tracking and managing chemicals across the chemical enterprise, and for communicating critical information downstream, from research to waste disposal and emergency response.

The intended outcome of this project is global adoption of the InChI notation in chemical stock inventories and information systems across commercial, industrial, government, academic and educational sectors to facilitate accurate documentation, handling and exchange of chemical information in support of safer management and use of chemicals. Mixtures and solutions likely make up a significant fraction of most working research laboratory stocks and providing functionality in the InChI system to represent these forms will greatly extend its utility for the chemical community.

Description

This project proposes to encode chemical composition of mixtures within the IUPAC International Chemical Identifier (InChI). This innovation will extend the utility of the InChI as a general purpose identifier of chemicals and chemical systems beyond indexing by individual compounds. The ability to chemically describe all components in a given chemical system is desirable for a variety of purposes. Computer-readable representation of mixture composition could support data management pertaining to reaction planning, property calculation, anticipation of potential hazards, and process optimization. Such chemical information systems are already in development or planning stages, including electronic laboratory notebooks, chemical hazard and risk management procedures, and various data analysis applications. Enabling the InChI algorithm to be implemented at the process development stage will improve data linking and interoperability and establishes the best standard of practice up front.

In the context of developing safer and greener chemical management, additional information about concentration, purity, density and other issues related to mixtures becomes critical for planning and documentation. For common compounds with definitive core chemical structures, there may be multiple levels of purities manufactured and several variations of solutions stocked. Information systems that index by chemical structure will pool data across forms of the primary component without recourse for disambiguation. Other than the catalog numbers of chemical suppliers, no known publicly available identifier system exists that addresses these issues, and these identifiers suffer the same challenges as other internal record schemas. They are tied to functions of registration and record management and are not generally chemically meaningful outside of the original system, including the familiar and ubiquitous CAS Registry Number. These systems are not openly and globally scalable, transferable or computable.

The InChI system is particularly well designed to circumnavigate such limitations at the compound level through dynamic application of clearly defined property layers. Other InChI extension projects for representing salts, polymers and coordination compounds have set the stage for further consideration of multi-component scenarios. This project will scope the minimum additional information required to usefully specify chemical composition of the common types of mixtures used in laboratory research and address issues of terminology and units. The project is not intended to definitively capture all properties relevant to formulations.

Progress

VisitĀ InChI subcommittee page to checkĀ other current InChI projects

January 2019 update – All chemicals exist as mixtures in practice, and mixed substances represent a significant fraction of chemical catalogs and laboratory stocks. Chemical composition impacts reactivity, unintentionally or designed. Communicating information about composition is critical for conducting safe, effective, and scientifically meaningful chemistry and other laboratory functions. The goal of this project is to articulate what can be said, definitively and in an actionable way, what is known about the chemical composition of a given mixed substance. Applying InChI notation in this context enables the development an unambiguous machine-readable linear notation for mixed substances of uniform properties that can resolve to unique components, supporting the practical need to connect data and information on mixtures and individual components and enabling further downstream computation and analysis on properties, composition, etc.

The MInChI specification is now in draft, incorporating the standard InChI notation to express the co-occurrence of molecules in mixed substances, with a mechanism to convey information about their relative proportions. It is not intended as a canonical identifier of mixtures; mixed substances are not uniquely specified concepts, they are inherently variable based on the process of combining other substances. Known components are unambiguously identified, enabling systems to track what is contained in a mixed substance, when compounds also occur in various mixed substances, and which mixtures contain some of the same components (i.e., are related). Some information about their relative proportion is also included (including non-stoichiometric ratios), although units and precision of concentration are context-dependent and not generically interconvertible. A recently launched CODATA initiative for developing digital representations of units of measure and interoperability services will likely enhance the computational utility of MInChI.

The notation consists of three layers, including components, concentration (as mentioned above), and an intermediary layer specifying order and hierarchy of components within the mixture. Specified components are represented by concatenated InChI strings, allowing for ready compound-level discovery, matching and other common functionality on the InChI notation. A so-called Long MInChIKey version is also in development that will include InChIKeys of individual components to further support these functions. The intermediary layer allows for expression of relationships among components (including nested mixtures), indexing of further properties (such as concentration), and possible additional permutations of the relationships among components within mixtures (i.e., via Boolean). The specification and scope of the project are currently being road-tested through a number of pilot implementations in development, including an open-source mixture editing tool from Collaborative Drug Discovery/CDD (project GitHub). This effort will produce an open-source version of the editor and an open library of machine-readable mixture representations. The potential of Mixtures informatics have been discussed on blog posts by the senior developer at CDD working on the project (post1, post2) and a paper will be forthcoming shortly. Bio-Rad is also looking into the utility of MInChI in managing reported mixed substances in its free spectral database: SpectraBase (project outline).

Page last updated 22 January 2019