Research Article: Tough Mining

Date Published: November 17, 2003

Publisher: Public Library of Science

Author(s): Steven Dickman

Abstract: With more than 400,000 new research articles listed each year in PubMed alone, more sophisticated tools are being developed to extract relevant information.

Partial Text: The standard “front end” for biomedical literature search is MEDLINE and its Entrez query system. Huge, well-managed, and nearly exhaustive, MEDLINE and its 11 million references provide incredible ease and facility for anyone who can type a Boolean query. Though not quite a parallel for Google—which runs a kind of popularity contest for Web links in real time—the Entrez search has opened up the literature to anyone with a Web browser. To those who grew up chasing citations and papers through the aisles of a scientific library, Entrez is a dream come true.

Language-processing software tools have been successfully applied in text-mining of nonscientific sources, especially to newswire content. Computer programs can already perform all three levels of text-mining (Figure 1) effectively: retrieving documents relevant to a given subject; extracting lists of entities or relationships among entities; and answering questions about the material, delivering specific facts in response to natural-language queries.

The techniques applied by Clearforest and others fall into two broad categories, statistical and heuristic. Statistical techniques are the next step up from keyword searches. They count words such as genes or gene products appearing close to one another, but apply no linguistic insights, such as whether an adjective modifies a noun. By contrast, heuristic approaches use hand-crafted rules designed for specific datasets: e.g., January, February, March, etc., are months; the word following “Mr.” is a name; and so forth. This approach is labor-intensive but especially useful when there is only a limited amount of data—as is the case with single scientific papers or small groups of papers.

Although the march toward better text-mining systems is building momentum, there are two issues that could stop it in its tracks. The first is access.

The second threat to text-searching programs ever becoming widely useful has more of the ring of linguistics jargon. The so-called “ ontology problem” threatens successful searching based on the very specific nature of biological terminology.

Staying within one’s narrow domain, then, could be a recipe for success, as long as the vocabulary and user questions remain tightly constrained, especially if there is a way to tiptoe around the access problem. That is apparently the case at Wormbase, though the newly available tool there, called Textpresso, is still being built. The motivation for Textpresso was simple, says Hans-Michael Mueller, a postdoctoral fellow in the lab of Paul Sternberg at Caltech in Pasadena, California, where Wormbase—the genetic database for the nematode worm Caenorhabditis elegans—is curated. “We want the user to be able to avoid going to the library to read all those papers [on genes and proteins] that your favorite gene interacts with. That is very tedious.” The other goal is equally recognizable in the biology community: no mere mortal can hope to keep up with the burgeoning literature, even in the relatively narrow field of worm biology.



0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments