Date Published: March 12, 2018
Publisher: Public Library of Science
Author(s): Jinbo Chen, Uwe Scholz, Ruonan Zhou, Matthias Lange, Timothée Poisot
Abstract: In order to access and filter content of life-science databases, full text search is a widely applied query interface. But its high flexibility and intuitiveness is paid for with potentially imprecise and incomplete query results. To reduce this drawback, query assistance systems suggest those combinations of keywords with the highest potential to match most of the relevant data records. Widespread approaches are syntactic query corrections that avoid misspelling and support expansion of words by suffixes and prefixes. Synonym expansion approaches apply thesauri, ontologies, and query logs. All need laborious curation and maintenance. Furthermore, access to query logs is in general restricted. Approaches that infer related queries by their query profile like research field, geographic location, co-authorship, affiliation etc. require user’s registration and its public accessibility that contradict privacy concerns. To overcome these drawbacks, we implemented LAILAPS-QSM, a machine learning approach that reconstruct possible linguistic contexts of a given keyword query. The context is referred from the text records that are stored in the databases that are going to be queried or extracted for a general purpose query suggestion from PubMed abstracts and UniProt data. The supplied tool suite enables the pre-processing of these text records and the further computation of customized distributed word vectors. The latter are used to suggest alternative keyword queries. An evaluated of the query suggestion quality was done for plant science use cases. Locally present experts enable a cost-efficient quality assessment in the categories trait, biological entity, taxonomy, affiliation, and metabolic function which has been performed using ontology term similarities. LAILAPS-QSM mean information content similarity for 15 representative queries is 0.70, whereas 34% have a score above 0.80. In comparison, the information content similarity for human expert made query suggestions is 0.90. The software is either available as tool set to build and train dedicated query suggestion services or as already trained general purpose RESTful web service. The service uses open interfaces to be seamless embeddable into database frontends. The JAVA implementation uses highly optimized data structures and streamlined code to provide fast and scalable response for web service calls. The source code of LAILAPS-QSM is available under GNU General Public License version 2 in Bitbucket GIT repository: https://bitbucket.org/ipk_bit_team/bioescorte-suggestion
Partial Text: In order to retrieve and explore database content, query interfaces are required. These are, at a simplistic view, brokers between the user’s information needs and the database content that is accessible using declarative query languages like SQL or imperative application programming interfaces (API). Human computer interfaces (HCI) make use of these APIs to provide frontends to interact with the databases.
The benefit of alternative queries is to increase the search sensitivity while keeping the precision high. Using LAILAPS-QSM, we go beyond syntactic alteration or query log mining and substantially increase the potential relevant search results. For example the query “salt stress”results in 4,677 PubMed abstracts. The LAILAPS-QSM suggested, semantic similar query “salinity stress” results in 1,363 abstracts. Both have 474 (8%) abstracts in common. This is clearly a sensitivity increase by replacing one query term but staying close to the original semantic meaning.
In this software paper, we presented the LAILAPS-QSM query suggestion RESTful web service that has been implemented under consideration of best flexibility, reusability and efficiency. Its capabilities to suggest semantically related terms to improve the sensitivity of text queries. In comparison to thesauri or query log based query suggestion implementations, the applied machine learning approach suggests even indirect existing nouns. This performs especially well for trait queries. A nice example to illustrate its useful suggestions are trait query “heading date”. Its query intention is to retrieve data that has content about the average date by which a certain percentage of a crop has formed seed heads. It is an important trait in plant breeding. The suggestions include “tillering”, “panicle”, “ghd7”, and “hd3a”. The first two suggestions are relevant traits, the latter two suggestions are genes which play an important role in rice heading date.