Project Details InChI extension for mixture composition

Project No.:
2015-025-4-800
Start Date:
01 July 2016
End Date:
Division Name:
Chemical Nomenclature and Structure Representation
Division No.:
800

Objective

The primary objective of this project is to establish requirements and guidelines for the generation of a unique computer-readable identifier for chemical mixtures for use in chemical stock inventories and information systems. Currently, many chemical identifiers exist, but very few systematically address multi-component systems. Furthermore, most existing identifiers are fixed systems that present cross-referencing challenges between databases designed around different initial applications and editorial principles.

An immediate and compelling use case for a mixtures identifier is to improve documentation for laboratory safety communication and planning. Recent highly damaging events in chemical laboratories and classrooms have lead to increasing focus on chemical information management in laboratory organizations. The diverse teaching and research environment in the academic sector particularly is raising awareness of the complexity of chemical safety information resources and formats available. Documentation of chemicals with current identifiers is a persistent challenge for tracking and managing chemicals across the chemical enterprise, and for communicating critical information downstream, from research to waste disposal and emergency response.

The intended outcome of this project is global adoption of the InChI notation in chemical stock inventories and information systems across commercial, industrial, government, academic and educational sectors to facilitate accurate documentation, handling and exchange of chemical information in support of safer management and use of chemicals. Mixtures and solutions likely make up a significant fraction of most working research laboratory stocks and providing functionality in the InChI system to represent these forms will greatly extend its utility for the chemical community.

Description

This project proposes to encode chemical composition of mixtures within the IUPAC International Chemical Identifier (InChI). This innovation will extend the utility of the InChI as a general purpose identifier of chemicals and chemical systems beyond indexing by individual compounds. The ability to chemically describe all components in a given chemical system is desirable for a variety of purposes. Computer-readable representation of mixture composition could support data management pertaining to reaction planning, property calculation, anticipation of potential hazards, and process optimization. Such chemical information systems are already in development or planning stages, including electronic laboratory notebooks, chemical hazard and risk management procedures, and various data analysis applications. Enabling the InChI algorithm to be implemented at the process development stage will improve data linking and interoperability and establishes the best standard of practice up front.

In the context of developing safer and greener chemical management, additional information about concentration, purity, density and other issues related to mixtures becomes critical for planning and documentation. For common compounds with definitive core chemical structures, there may be multiple levels of purities manufactured and several variations of solutions stocked. Information systems that index by chemical structure will pool data across forms of the primary component without recourse for disambiguation. Other than the catalog numbers of chemical suppliers, no known publicly available identifier system exists that addresses these issues, and these identifiers suffer the same challenges as other internal record schemas. They are tied to functions of registration and record management and are not generally chemically meaningful outside of the original system, including the familiar and ubiquitous CAS Registry Number. These systems are not openly and globally scalable, transferable or computable.

The InChI system is particularly well designed to circumnavigate such limitations at the compound level through dynamic application of clearly defined property layers. Other InChI extension projects for representing salts, polymers and coordination compounds have set the stage for further consideration of multi-component scenarios. This project will scope the minimum additional information required to usefully specify chemical composition of the common types of mixtures used in laboratory research and address issues of terminology and units. The project is not intended to definitively capture all properties relevant to formulations.

Progress

Visit InChI subcommittee page to check other current InChI projects

January 2019 update – All chemicals exist as mixtures in practice, and mixed substances represent a significant fraction of chemical catalogs and laboratory stocks. Chemical composition impacts reactivity, unintentionally or designed. Communicating information about composition is critical for conducting safe, effective, and scientifically meaningful chemistry and other laboratory functions. The goal of this project is to articulate what can be said, definitively and in an actionable way, what is known about the chemical composition of a given mixed substance. Applying InChI notation in this context enables the development an unambiguous machine-readable linear notation for mixed substances of uniform properties that can resolve to unique components, supporting the practical need to connect data and information on mixtures and individual components and enabling further downstream computation and analysis on properties, composition, etc.

The MInChI specification is now in draft, incorporating the standard InChI notation to express the co-occurrence of molecules in mixed substances, with a mechanism to convey information about their relative proportions. It is not intended as a canonical identifier of mixtures; mixed substances are not uniquely specified concepts, they are inherently variable based on the process of combining other substances. Known components are unambiguously identified, enabling systems to track what is contained in a mixed substance, when compounds also occur in various mixed substances, and which mixtures contain some of the same components (i.e., are related). Some information about their relative proportion is also included (including non-stoichiometric ratios), although units and precision of concentration are context-dependent and not generically interconvertible. A recently launched CODATA initiative for developing digital representations of units of measure and interoperability services will likely enhance the computational utility of MInChI.

The notation consists of three layers, including components, concentration (as mentioned above), and an intermediary layer specifying order and hierarchy of components within the mixture. Specified components are represented by concatenated InChI strings, allowing for ready compound-level discovery, matching and other common functionality on the InChI notation. A so-called Long MInChIKey version is also in development that will include InChIKeys of individual components to further support these functions. The intermediary layer allows for expression of relationships among components (including nested mixtures), indexing of further properties (such as concentration), and possible additional permutations of the relationships among components within mixtures (i.e., via Boolean). The specification and scope of the project are currently being road-tested through a number of pilot implementations in development, including an open-source mixture editing tool from Collaborative Drug Discovery/CDD (project GitHub). This effort will produce an open-source version of the editor and an open library of machine-readable mixture representations. The potential of Mixtures informatics have been discussed on blog posts by the senior developer at CDD working on the project (post1, post2) and a paper will be forthcoming shortly. Bio-Rad is also looking into the utility of MInChI in managing reported mixed substances in its free spectral database: SpectraBase.

February 2020 update – The MInChI specification drafted by the project has been implemented as a proof of concept by Collaborative Drug Discovery (CDD) as described by Clark et al.[1] Several follow up presentations at InChI workshops and ACS meetings have highlighted the outcomes of this work, including a recent webinar. CDD has recently received a grant to continue development of this application of MInChI for drug discovery, including potential incorporation of outputs from other InChI working groups exploring organometallics, polymers and other types of molecules.

The trajectory for MInChI on a technical level will look to incorporate the specification into the RInChI codebase for a combined executable. Similar approaches are used for developing the layers and include many shared or similar points of information. Planning is underway to fold this work into the next RInChI code release, scheduled to start development mid-year 2020. The CDD prototype code-base for MInChI is open source and can inform implementation of property information such as concentration.

While the technical implementations of RInChI and MInChI share many commonalities, the use cases for these notations likely span divergent chemistries and communities. The MInChI project will continue to explore approaches to notating more complex or specialized forms, such as formulations, buffers and hydrates. Establishing and expanding the user group for MInChI across sectors and into areas such as materials, agriculture, consumables and others will also be a high priority in the coming year.

[1] Clark, A. M.; McEwen, L.R.; Gedeck, P.; Bunin, B. A. (2019), Capturing mixture composition: an open machine-readable format for representing mixed substances. J. Cheminform. 11, 33; https://doi.org/10.1186/s13321-019-0357-4

The MInChI technical sub-team (Alex Clark, Leah McEwen, Gerd Blanke) put together a little web-based demo that generates MInChI strings. The app takes either sketch input or InChIs for components with a form for entering concentration information and any labels for viewing in the app. There is no automated input or export, this is just intended to be a quick demo to provide some illustration of the data model: See http://molmatinf.com/minchidemo/

Watch the CDD Webminar Capturing Mixtures — Bringing Informatics to the World of Practical Chemistry (recorded live 19 Dec 2019) to hear about the task group’s work toward new data structures for capturing chemical mixtures in a machine-readable format, as well as the potential impact this will have on all industries that intersect with chemistry.

Page last updated 15 Feb 2020