Date Published: July 5, 2017
Publisher: Public Library of Science
Author(s): Seung Han Baek, Dahee Lee, Minjoo Kim, Jong Ho Lee, Min Song, Neil R. Smalheiser.
Most of earlier studies in the field of literature-based discovery have adopted Swanson’s ABC model that links pieces of knowledge entailed in disjoint literatures. However, the issue concerning their practicability remains to be solved since most of them did not deal with the context surrounding the discovered associations and usually not accompanied with clinical confirmation. In this study, we aim to propose a method that expands and elaborates the existing hypothesis by advanced text mining techniques for capturing contexts. We extend ABC model to allow for multiple B terms with various biological types.
We were able to concretize a specific, metabolite-related hypothesis with abundant contextual information by using the proposed method. Starting from explaining the relationship between lactosylceramide and arterial stiffness, the hypothesis was extended to suggest a potential pathway consisting of lactosylceramide, nitric oxide, malondialdehyde, and arterial stiffness. The experiment by domain experts showed that it is clinically valid.
The proposed method is designed to provide plausible candidates of the concretized hypothesis, which are based on extracted heterogeneous entities and detailed relation information, along with a reliable ranking criterion. Statistical tests collaboratively conducted with biomedical experts provide the validity and practical usefulness of the method unlike previous studies. Applying the proposed method to other cases, it would be helpful for biologists to support the existing hypothesis and easily expect the logical process within it.
Medical informatics has become a fast growing field with the help of a vast amount of biomedical data. Researchers in medical informatics have thrived to make sense of a huge number of academic publications or unstructured data including clinical notes, certain categories of test results such as echocardiograms and radiology reports. Text mining methods were developed for an effective information extraction, knowledge discovery, and hypothesis generation from the literature [1–12]. In the late 80’s, Swanson’s pioneer studies established the foundation for literature-based discovery (LBD) [13,14]. Developments in text mining and hypothesis discovery systems stemming from the early work of Swanson, became coincident with the emergence of conceptual biology. According to the Swanson’s LBD model, when it is known that A term is related to B term and B term is associated with C term in some ways, the implicit relationship between A and C can be suggested as a new plausible hypothesis. With the model Swanson discovered the relationship between Raynaud’s disease and fish oil , which was validated through the clinical trial afterward . Later, several studies utilized or extended the Swanson’s model to design discovery systems of better performance or generate new hypotheses [2–6].
The field of LBD is experiencing substantial growth in recent years as databases, ontologies, and text mining tools are actively and competitively being developed. Especially, text mining techniques such as NER, event extraction, or dependency parser enable mining more plentiful information and knowledge with a wider diversity from the academic literature, whose amount is too heavy to be manually handled and digested. Investigators have indeed attempted to find not only the hidden relationships between different pairs of entity types like protein-disease or drug-disease [2,8,9], but also biomarkers , and drug indications . One example is that Vos et al. derived new plausible multimorbidity patterns of psychiatric and somatic diseases using automated concept recognition and profiling . But they studied only the pairwise associations of diseases, distinguishing itself from our study in terms of the limited scope and number of biological entities within one hypothesis.
The automatic generation of plausible new hypotheses is a daunting challenge specifically when multiple entities and relationships are interconnected at different levels. In addition, the confirmation step of generated hypotheses ought to be considered to make such a difficult, complicated task of new hypothesis meaningful. To this end, we have presented the new method for new hypothesis development and enrichment, which helps biologists extend their hypotheses or explain a logical process within the validated hypotheses. The method is developed by integrating state-of-the-art text mining techniques and a unique measure of ranking score, which is differentiated from the existing similar systems. We demonstrated how the method can be applied to elaborate on the specific metabolite-related hypothesis. As a result, we found that the proposed method is reliable and practically applicable to the biomedical field through the experiments the domain experts are involved in.