Research Article: Getting Started in Text Mining

Date Published: January 25, 2008

Publisher: Public Library of Science

Author(s): K. Bretonnel Cohen, Lawrence Hunter, Olga Troyanskaya

Abstract: None

Partial Text: Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature. There are at least as many motivations for doing text mining work as there are types of bioscientists. Model organism database curators have been heavy participants in the development of the field due to their need to process large numbers of publications in order to populate the many data fields for every gene in their species of interest. Bench scientists have built biomedical text mining applications to aid in the development of tools for interpreting the output of high-throughput assays and to improve searches of sequence databases (see [1] for a review). Bioscientists of every stripe have built applications to deal with the dual issues of the double-exponential growth in the scientific literature over the past few years and of the unique issues in searching PubMed/MEDLINE for genomics-related publications. A surprising phenomenon can be noted in the recent history of biomedical text mining: although several systems have been built and deployed in the past few years—Chilibot, Textpresso, and PreBIND (see Text S1 for these and most other citations), for example—the ones that are seeing high usage rates and are making productive contributions to the working lives of bioscientists have been built not by text mining specialists, but by bioscientists. We speculate on why this might be so below.

Text mining systems can easily be as complex as any applications built in computational biology—Figure 2 at [10] shows the levels of analysis that might be built into a representative system—and good software engineering practices can be crucial in building them successfully. An important first step is to define the desired behavior of the system. For example, consider a system that aims to extract gene/disease relations from text. Is the intended output meant for human consumption, or is it to be the input to some later automatic processing step? Is the intended input intended to be fields from a database (e.g., GeneRIFs from Entrez Gene or SUMMARY fields from Swiss-Prot), abstracts, or full-text journal articles? Each presents its own challenges and opportunities. Is the intended output lists of genes and diseases? If so, should the system make it possible to click through to the full texts from which a given gene/disease pair was extracted? Is it enough to simply output the text strings that were found in the text, or must the output be in the form of database identifiers (e.g., Entrez Gene IDs and OMIM IDs for our gene/disease example) if it is to be truly useful? Specifying these requirements early may make it possible to avoid any number of false paths in the development process.

In the introduction, we pointed out that all or most of the demonstrably useful biomedical text mining systems have been built not by text mining specialists, but by computational biologists. Why might this be? Although this has not been systematically investigated, we speculate that it is related to cultural differences between the two groups. Text mining specialists are more likely to build systems that are likely to get them published in computational linguistics conferences. Such systems are not domain-dependent, are usable for a wide variety of tasks, and, if fashionable, rely more on statistical approaches than on knowledge sources. In contrast, computational biologists do not hesitate to build systems that are extremely domain-specific, that do not attempt more than a single highly relevant task (e.g., the RLIMS-P system [7], which targets assertions about phosphorylation and nothing else), and that are not dogmatic about avoiding knowledge-based approaches. Ultimately, biologists seem to be better at one of the crucial first steps identified above: defining the goals of the system, and not hesitating to define those goals based on utility, rather than on presumed publishability in the computational linguistics literature. The key to exploiting this ability for the purpose of building a better text mining system is for the computational biologist to pay particular attention to that initial step. None of this is meant to make the claim that there is no role for computational linguists in biomedical text mining, but rather that at this time there seem to be clear roles for each. Text mining specialists continue to excel at building system components and designing datasets for evaluation; computational biologists currently appear to be much better at producing useful task definitions. Perhaps the most fruitful approaches are characterized by combined efforts that leverage the abilities of each type of scientist.

Text S1 provides coverage of additional technical issues in system design and construction, and includes a number of helpful references. Additionally, text mining and natural language processing have a long history outside of the bioscience world, and have produced a sizable literature that is well worth the computational biologist’s attention. [8] is an excellent starting point. [9] is the standard reference work, and is a good second step. For bioscience-specific text mining, there are a number of review papers and three useful book-length treatments. [4] takes a task-based approach to text mining, and lists a number of additional tools for most of the tasks mentioned in this short tutorial. [10] describes the state of the art, and lists a number of computational-biologist–built applications that provide good examples of high-utility systems. [11] is a collection of chapters on a wide variety of aspects of biomedical text mining. [12] focusses on document retrieval, but also contains stimulating coverage of a number of related topics in text mining. Finally, [13] provides an in-depth treatment of statistical approaches to biomedical text mining.