Date Published: March 27, 2017
Publisher: Public Library of Science
Author(s): Marina M. -C. Vidovic, Marius Kloft, Klaus-Robert Müller, Nico Görnitz, Bin Liu.
High prediction accuracies are not the only objective to consider when solving problems using machine learning. Instead, particular scientific applications require some explanation of the learned prediction function. For computational biology, positional oligomer importance matrices (POIMs) have been successfully applied to explain the decision of support vector machines (SVMs) using weighted-degree (WD) kernels. To extract relevant biological motifs from POIMs, the motifPOIM method has been devised and showed promising results on real-world data. Our contribution in this paper is twofold: as an extension to POIMs, we propose gPOIM, a general measure of feature importance for arbitrary learning machines and feature sets (including, but not limited to, SVMs and CNNs) and devise a sampling strategy for efficient computation. As a second contribution, we derive a convex formulation of motifPOIMs that leads to more reliable motif extraction from gPOIMs. Empirical evaluations confirm the usefulness of our approach on artificially generated data as well as on real-world datasets.
Machine learning is emerging as crucial technology in science and industry [1–4]. The optimal choice of a learning method depends on the quality and quantity of the data, on the intrinsic noise characteristics and complexity underlying the data, and on the choice of an appropriate representation embracing the prior knowledge available. Lately, rather sophisticated, non-linear learning machines—such as kernel machines and deep neural networks—have become a gold standard in several application domains, including computational biology, image and speech recognition, and text mining. Unlike linear methods , these non-linear learning methods do not provide an explanation of the underlying prediction out of the box and are therefore generally considered as black boxes [6, 7].
In this section, we discuss the feature explanation techniques on which the proposed method builds upon: positional oligomer importance matrices and motifPOIMs, which are specifically designed for DNA sequences, and their generalization—the feature importance measure (FIRM), which can be used for arbitrary feature sets. An overview of the discussed methods and their respective definitions can be found in Table 1.
The contribution of this section is twofold: first, we devise a feature importance measure, which we call gPOIM, based on POIMs and its generalization (FIRM), and show that there is a simple way of assessing feature importances, enabling the extraction of importances from arbitrary learning machines. Second, we devise a convex version of the motifPOIM approach proposed by  and discuss its properties. Both methods combined form the basis of our motif extraction approach (ML2Motif, cf. Fig 3. ML2Motif follows the same principles as SVM2Motif (cf. Fig 2).
The empirical evaluation has three parts: First, we investigate and discuss the properties of our proposed methods gPOIM and the corresponding convex motif extraction method when compared to their predecessors on artificially generated data. In the second part, we apply ML2Motif (=gPOIM and convex motifPOIM) to find driving motifs in real-world human splice-site data where ground truth motifs are known. Here, we compare motif reconstruction accuracies against state-of-the-art competitors under various experimental settings. Finally, we perform an analysis of the publicly available enhancer dataset and try to find and verify the driving motifs in a real-world setting where no ground truth motifs are given.
Our experimental section shows very promising results, hence a natural question that arises is: What are promising further applications, even beyond sequence analysis and computational biology, and what are the limitations of ML2Motif? The answer must be split into two parts since ML2Motif itself consists of two distinct parts: gPOIM and convex motifPOIM, both need to be discussed separately in this context.
In this work, we have contributed to opening the black box of non-linear learning machines. Our proposed approach, ML2Motif, consists of two techniques: gPOIMs and convex motifPOIMs. ML2Motif nicely extends the DNA motif finding approach SVM2Motif , to cope with arbitrary learning machines and feature representations.