Project Details Prototype analysis of glossary terms to establish biological context by text data mining

Project No.:
2005-049-1-700
Start Date:
01 March 2006
End Date:
31 December 2013
Division Name:
Chemistry and Human Health Division
Division No.:
700

Objective

To extend the usefulness and applicability of the glossaries, it would be worthwhile to explore methods for identifying the various contexts in which the terms appear in the scientific literature.

Description

A prototype project using a text data mining tool, LexiMine, from LexiQuest, an SPSS company, will evaluate the ability to automatically, objectively and exhaustively analyze downloaded journal articles in terms of their syntactical construction. This analysis will generate a concept map of all concepts within the analyzed articles and this will be compared with the list of terms from the glossaries to establish their presence within the literature, their interactions and relationships, both among themselves and with other concepts, and show the link to the original citation in the text. In this manner it will be possible to identify and evaluate the glossary terms for their contextual extensions of their definitions. This can be used to either develop a parallel and complementary glossary that may be published directly or as a web-enabled product, or to augment the existing glossaries and compendium. The activity proposed for this prototype study will involve the selection from one of the three problems listed below, access to and use of any related glossaries and analysis, as described above, in two ACS journals, namely, Biochemistry and Journal of Medicinal Chemistry, for the years 1998-2003, inclusive.

Progress

Project completed.

The project initial findings are summarized in a 2011 publication of the IEEE (International Electrical and Electronics Engineers, Division of Bioinformatics and Biocomputing) and titled “Hypothesis Generation and Evaluation in Clinical Trial Design”. Abstract and full text available on ieee.org, https://dx.doi.org/10.1109/BIBM.2011.62. The project was a collaboration with the National Research Council of Italy (Pisa).

The project evolved from an initial concept to identify how text data-mining for standard terms, as commonly developed by IUPAC Divisions as glossaries, should consider how the rapid technological changes in scientific domains may be impacted by the ongoing use and evolution of such definitions. The area of concentration for this study was in Bioinformatics in Drug Development as this represented a relatively new area that was undergoing such changes and was also an area of interest to Division VII, Chemistry and Human Health, and its sub-committee in Medicinal Chemistry and Drug Development.

The IUPAC funding served as seed money to initiate this activity, initially in collaboration with a commercial group, BioBase, whose business and products involve extensive literature evaluation and text-mining.

As the project evolved, it became apparent that with the significant ongoing changes in the field, the ability to use a glossary-based approach would not be able to adapt in concert with the evolution of terms and their use in this field. An alternative approach, using an ontology-based methodology, was adopted which focused on the development of a data model for the diverse knowledge that was based not on specific terms but rather concepts and relationships. This provided the flexibility to be rapidly adaptable and extensible so that it could enable “learning” as it is used and thus continue to maintain a currency with the actual practice that is not achievable with a glossary-based approach. This prototype has been implemented into a computational platform that is being successfully applied to a range of problems in pharmaceutical research and development and is generalizable to other domains as well (https://www.ipqanalytics.com).

The results of this project suggest that consideration of an ontology-based approach should be considered in IUPAC projects that propose to develop glossaries as it can bridge the gap between formal definitions, actual use and ongoing evolution in a systematic manner.

Short report published in Chem. Int. Nov-Dec 2014, p. 22; DOI: 10.1515/ci-2014-0619

last updated 18 Nov 2014