Research Article: Literature Retrieval and Mining in Bioinformatics: State of the Art and Challenges

Date Published: June 21, 2012

Publisher: Hindawi Publishing Corporation

Author(s): Andrea Manconi, Eloisa Vargiu, Giuliano Armano, Luciano Milanesi.


The world has widely changed in terms of communicating, acquiring, and storing information. Hundreds of millions of people are involved in information retrieval tasks on a daily basis, in particular while using a Web search engine or searching their e-mail, making such field the dominant form of information access, overtaking traditional database-style searching. How to handle this huge amount of information has now become a challenging issue. In this paper, after recalling the main topics concerning information retrieval, we present a survey on the main works on literature retrieval and mining in bioinformatics. While claiming that information retrieval approaches are useful in bioinformatics tasks, we discuss some challenges aimed at showing the effectiveness of these approaches applied therein.

Partial Text

Nowadays, most of the scientific publications are electronically available on the Web, making the problem of retrieving and mining documents and data a challenging task. To this end, automated document management systems have gained a main role in the field of intelligent information access [1]. Thus, research and development in the area of bioinformatics literature retrieval and mining is aimed at providing intelligent and personalized services to biologists and bioinformaticians while searching for useful information in scientific publications. In particular, the main goal of bioinformatics text analysis is to provide access to unstructured knowledge by improving searches, providing automatically generated summaries, linking publications with structured resources, visualizing contents for better understanding, and guiding researchers to formulate novel hypotheses and to discover knowledge.

Supporting users in handling the huge and widespread amount of Web information is becoming a primary issue. Information retrieval is the task of representing, storing, organizing, and accessing information items. Information retrieval has considerably changed in recent years: initially with the expansion of the World Wide Web and the advent of modern and inexpensive graphical user interfaces and mass storage [2], and then with the advent of modern Internet technologies [3] and of the Web 2.0 [4].

A great deal of biological information accumulated through years is currently available in online text repositories such as Medline. These resources are essential for biomedical researchers in their everyday activities to plan and perform experiments and verify the results.

As already pointed out, the steady work of researchers has brought a huge increase of publications in life sciences. This amount of scientific literature requires an extra work by researchers, typically involved in keeping up-to-date all information related to their favorite research topics. This effort mainly depends on two aspects: the continuous increase of the scientific production and the poor amount of communication among life science disciplines [72]. In this scenario, devising suitable strategies, techniques, and tools aimed at supporting researchers in the task of automatically retrieving relevant information on the Web (in particular, from text documents), has become an issue of paramount importance.

Research and development in the analysis of bioinformatics literature aims to provide bioinformaticians with effective means to access and exploit the knowledge contained in scientific publications. Although the majority of scientific publications are nowadays electronically available, keeping up to date with recent findings remains a tedious task hampered by the difficulty of accessing the relevant literature. Bioinformatics text analysis aims to improve the access to unstructured knowledge by alleviating searches, providing auto-generated summaries, linking publications with structured resources, visualizing content for better understanding, and supporting researchers in the task of formulating novel hypotheses and of discovering knowledge. Research over recent years has improved fundamental methods in bioinformatics text mining, ranging from document retrieval to the extraction of relationships. Consequently, more and more integrative literature analysis tools have been put forward, targeting a broad audience of life scientists. In this paper, after briefly introducing information retrieval, text mining, and literature retrieval and mining, we first recalled the state of the art on literature retrieval and mining in bioinformatics. In the second part of the paper, we discussed some challenges deemed worth of further investigation, with the goal of improving bioinformatics literature-retrieval-and-mining tools and systems. Summarizing, the scientific community is strongly involved in addressing different problems in literature retrieving and mining, and several solutions have been currently proposed and adopted. Nevertheless, they will remain largely ineffective until the scientific community will make further significant steps towards common standards concerning the way existing knowledge is published and shared among researchers—with particular emphasis on the structure of the scientific publications.