Project Details IUPAC SMILES+ Specification

Project No.:
2019-002-2-024
Start Date:
12 April 2019
End Date:
Division Name:
Committee on Publications and Cheminformatics Data Standards
Division No.:
024

Objective

The two most popular machine-readable chemical line notations are the IUPAC InChI (International Chemical Identifier) and SMILES (Simplified molecular-input line-entry system). The InChI is a standardized IUPAC chemical descriptor notation that has become essential for the chemistry community. SMILES are a chemical representation file format and play a distinct and complementary role for automated retrieval of structural information along-side InChI. However, unlike InChI, where the notation is standardized and well documented, there is no current standard specification for the SMILES file format. As a result, there are multiple SMILES dialects, implementations, and extensions available with no up-to-date authoritative descriptions of the format. This situation has proved to be a significant barrier for interoperability and accurate exchange of chemical structure information among researchers, chemical databases, cheminformatics toolkits, and software drawing tools.

This project seeks to establish a formalized recommended up-to-date specification of the SMILES format. The project will develop open reference documentation that articulates standard interpretation of SMILES. As the bulk of chemical data worldwide is organized using chemical structure information, standardizing the SMILES file format will promote accurate exchange of scientific information alongside InChI. The proposed name of the standard is IUPAC SMILES+, where the ‘+’ would allow future approved extensions to a core IUPAC SMILES specification.

Description

1. What is the proposed IUPAC SMILES+ Specification project?
SMILES is one of the most common digital file formats used to exchange molecular information. The original 1988 Daylight SMILES publication and associated Daylight webpages are no longer updated and now contain ambiguities related to the interpretation of SMILES (as new use-cases have developed and evolved with SMILES). An earlier community-driven effort, OpenSMILES, more fully described a SMILES specification and was last updated in 2016. Unfortunately, the OpenSMILES specification has not been widely adopted and documented within cheminformatics toolkits. In our preliminary analysis of toolkit SMILES documentation, only 3 out of 13 toolkits specifically mentioned support for the OpenSMILES specification. As such, it is unclear what specification of SMILES toolkits are using. This lack of clarity about SMILES interpretation leads to difficulties in exchanging and storing chemical information, most notably, data corruption and data loss.
This project seeks to establish a formalized recommended up-to-date specification of the SMILES format. The project will develop open reference documentation that articulates standard interpretation of SMILES. As the bulk of chemical data worldwide is organized using chemical structure information, standardizing the SMILES file format will promote accurate exchange of scientific information alongside InChI. The proposed name of the standard is IUPAC SMILES+, where the ‘+’ would allow future approved extensions to a core IUPAC SMILES specification (e.g., SMARTS).

2. How would the IUPAC SMILES+ specification relate to past SMILES specification efforts like OpenSMILES?
The IUPAC SMILES+ specification is a natural continuation of the great community efforts within the OpenSMILES specification and, we believe, the best way forward to increase the interoperability of SMILES across toolkits and software.

3. Why should IUPAC support a SMILES specification since the community already has InChI?
The SMILES linear notations play a complementary role for exchange of structural information alongside InChI. As the nature of InChI is a descriptor, normalization of structures is inherent to InChI. In contrast, generation of SMILES does not require normalization and, in some applications, this is preferred for accurate information analysis and exchange. Moreover, SMILES can support variability and are human interpretable. SMILES is also part of a larger ecosystem of SMARTS (substructure searching, molecular patterns) and SMIRKS (reaction transforms). There are other differences, but ultimately the community needs both InChI and SMILES. As such, there are several specific reasons why IUPAC should be interested in maintaining a current SMILES specification:
• The continued ubiquitous use of SMILES without up-to-date documentation is limiting the accurate global exchange of chemical information. Accurate chemical representations are core to chemical data exchange.
• In order to enhance the use of InChI, we need to prevent corruption of the complementary SMILES notation.
• An authoritative steward like IUPAC that has committed to maintain a current SMILES specification will encourage toolkit and software providers to adopt the standard. We hypothesize that if toolkit providers have access to an up-to-date SMILES documentation, examples for testing their toolkits, and a forum to contribute developments, feedback, and enhancements, they may be willing to support IUPAC SMILES+.
• InChI and SMILES communities can work together to solve complementary goals.

4. How will we complete the IUPAC SMILES+ specification project?
Phase 1 – Establish dedicated communication channels with key stakeholders. We want to engage appropriate stakeholders from the very beginning of our efforts (e.g., InChI community, cheminformatics toolkit developers, researchers).

Phase 2 – Collect and organize current SMILES documentation and use cases. Create a copy of the current OpenSMILES specification (e.g., a GitHub fork), to start working from. The OpenSMILES specification has a permissible reuse license that allows modification and redistribution. The OpenSMILES community has already invested a significant amount of effort and thought into coming up with a comprehensive yet readable document. This document will serve as a start for the IUPAC SMILES+ specification.

Phase 3 – Identify SMILES edge cases where there are different toolkit interpretations and use this data to identify ambiguities within SMILES. NextMove software has created a SMILES reading benchmark, which will greatly help with identifying edge cases (smilesreading GitHub and Presentation). We have also started evaluating this data to help inform the creation of the IUPAC SMILES+ specification. The analyzed data is available on the Open Science Framework: SMILES Reading Benchmark Data Analysis. Next, compile a subset of SMILES strings including numerous edge cases useful for developing a test suite for incorporation into the IUPAC SMILES+ specification. An initial test suite can be as simple as a carefully curated list of SMILES along with an image depiction of the preferred structure.

Phase 4 – Write version 1 of the IUPAC SMILES+ specification and outline ongoing curation requirements. We will use an open robust versioning system (e.g., markdown with Git, Open Science Framework) where we can engage the community and seek comments for review and approval. This will allow the entire community to then be a part of the process and develop IUPAC SMILES+, particularly where ambiguities may require judicious choice on the preferred interpretation. Our focus will first be on the core specification of SMILES, not on any extensions.

Phase 5 – Explore licensing options of the IUPAC SMILES+ specification with the IUPAC Committee on Publications and Cheminformatics Data Standards (note: the original OpenSMILES documentation is licensed as GNU Free Documentation License 1.2). Then, collaborate with software developers to incorporate support of IUPAC SMILES+ into their software. We will target both open source and proprietary cheminformatics toolkit offerings such as RDKit, CDK, Open Babel, ChemAxon JChem, CACTVS, and OEChem.
Discussions will include how best to implement the specification (e.g., as a specific option or separate conversion tool). Further, work with database providers to update their SMILES based on the IUPAC specification. We will also engage stakeholders such as the GO FAIR Chemistry Implementation Network in these efforts.

Phase 6 – Develop a robust ongoing community procedure and maintenance recommendation plan for updating the IUPAC SMILES+ specification with extensions. Finishing the initial version of the specification is an important step, but it is just the beginning of our work. Our goal is to have a specification that evolves with the needs of the chemistry community, and allows innovation, while maintaining the core formal specification. As we proceed with version 1 of the IUPAC SMILES+ specification, we will also be collecting pointers to (and documentation about) SMILES extensions as well as the use cases that motivated them. Moreover, we will explore various long-term sustainability business models to support ongoing development and stewardship of the IUPAC SMILES+ specification.

Progress

Page last update 13 April 2019