Research Article: aPPRove: An HMM-Based Method for Accurate Prediction of RNA-Pentatricopeptide Repeat Protein Binding Events

Date Published: August 25, 2016

Publisher: Public Library of Science

Author(s): Thomas Harrison, Jaime Ruiz, Daniel B. Sloan, Asa Ben-Hur, Christina Boucher, Vasilis J Promponas.


Pentatricopeptide repeat containing proteins (PPRs) bind to RNA transcripts originating from mitochondria and plastids. There are two classes of PPR proteins. The P class contains tandem P-type motif sequences, and the PLS class contains alternating P, L and S type sequences. In this paper, we describe a novel tool that predicts PPR-RNA interaction; specifically, our method, which we call aPPRove, determines where and how a PLS-class PPR protein will bind to RNA when given a PPR and one or more RNA transcripts by using a combinatorial binding code for site specificity proposed by Barkan et al. Our results demonstrate that aPPRove successfully locates how and where a PPR protein belonging to the PLS class can bind to RNA. For each binding event it outputs the binding site, the amino-acid-nucleotide interaction, and its statistical significance. Furthermore, we show that our method can be used to predict binding events for PLS-class proteins using a known edit site and the statistical significance of aligning the PPR protein to that site. In particular, we use our method to make a conjecture regarding an interaction between CLB19 and the second intronic region of ycf3. The aPPRove web server can be found at

Partial Text

Post-transcriptional control of RNA—which includes splicing, polyadenylation, and RNA editing—can have significant impact on the expression of a gene. One of the key factors that influences and contributes to post-transcriptional control of RNA is the availability and ability of specific proteins to bind to RNA. In short, RNA-binding proteins are those that bind to single- or double-stranded RNA and participate in forming ribonucleoprotein complexes. These complexes, in turn, exhibit a major role in post-transcriptional control of RNA [1, 2]. In this paper, we build a computational method for predicting where and how a family of RNA-binding proteins, the pentatricopeptide repeat (PPR), will bind to RNA. PPR proteins have generated significant interest and are well-known to have widespread existence in eukaryotes–in particular, land plants. Approximately 450 different PPR encoding genes have been found in Arabidopsis thaliana and rice (Oryza sativa). These proteins have vital interactions with RNA transcripts in mitochondria and plastids [3], have been demonstrated to be involved in RNA editing [4], and have shown to silence genes that encode for cytoplasmic male sterility (CMS) in flowering plants [5]. This latter role is of particular importance since male sterile plants are used to generate hybrid seed, which commercial agriculture heavily relies on for higher yield, and hence, highlights the interest of this class of proteins. Our method, which we call aPPRove, builds upon the recent work of Barkan et al. [6] that determines sequence-specific binding rules for PPR proteins.

The results of Barkan et al. [6] present a combinatorial binding code of PPR-RNA interaction that accounts for P and S motif sequences. They proposed a combinatorial binding code adhering to the rules shown in Table 1. This binding code was expanded by the findings of Yagi et al. [8] and Takenaka et al. [7] who discovered binding preferences of L-type sequences. Both found that a proline at position 6 of an L-type sequence is likely to bind to uracil. Furthermore, the results of Takenaka et al. [7] showed that asparagine at position 1′ of L-type sequences likely binds to adenine or uracil if it is paired with isoleucine, leucine, proline, threonine, or methionine at position 6. The model used in the three papers listed above involved aligning the PPR sequences of PLS proteins to the target RNA binding sites such that the terminal S-type sequence is positioned in contact with the nucleotide four base pairs upstream of an edit site on the target transcript. Okuda et al. [14] provides further evidence that PLS-class proteins align in this fashion. The pairing of positions 6 and 1′ in the PPR protein reinforced the previous findings of Fujii et al. [9]. Lastly, the results of Kotera et al. [4] demonstrated that PLS-class proteins are required for RNA editing.

The aim of aPPRove is to build a predictive model of PPR-RNA binding using sequence-specific binding rules. This can be cast as an alignment problem. Let S6 and S1′ be the amino acid sequences defined by position 6 and position 1′ of all adjacent motif sequences in the primary structure of a PPR protein S. If S contains ℓ adjacent motif sequences, S6 and S1′ both have length ℓ − 1. Hence, our problem is solved using a PPR protein S, an RNA transcript R, and a scoring function ρ. More formally,
where N = {A, G, C, U, −} and aa = {all possible amino acids and −}, where − signifies an insertion or deletion. The goal is to find the w top-scoring alignments between R, S6 and S1′ with respect to ρ. The following definition formalizes the problem that aPPRove solves.

aPPRove can be broken down into five main steps: (1) defining the repeat structure of the PPR by the motif sequence and number of repeats, (2) constructing S6 and S1′, (3) building a distribution of random alignments of S6 and S1′ to a database of RNA transcripts, (4) aligning S6 and S1′ to one or more RNA target transcripts, and (5) calculating the statistical significance (p-value) of the w top-scoring alignments of the PPR to target RNA transcripts.

We presented a method that used the primary binding code of PPR proteins to predict how a protein will bind to a target transcript or binding footprint. Our method is unique in that it can be used to detect where and how a PPR protein binds to an RNA as opposed to assessing the likelihood of interaction. Again, we note that the hidden Markov model was parametrized with a dataset involving protein-RNA interactions of only PLS-class proteins, thus aPPRove captures the intricacies of how the PLS class of PPR proteins bind to their target, but it may not accurately portray how a P-class PPR protein will bind to its target. The lack of data regarding P-class PPR protein interactions prevents us from adapting the model specifically for this subfamily of proteins. It is possible that the onset of high throughput methods of quantifying protein-RNA interactions [31] may allow for future progress in modeling the interaction of P-class proteins and their target transcripts. Finally, if there is a known edit site, aPPRove can be used to detect putative binding events. Detecting these events is one of the most beneficial and powerful uses of aPPRove.

Software and data can be accessed from the aPPRove web page ( or from our github repository: The data include edit sites and fasta files for the following editing factors: CLB19 (AEE27887.1), CRR21 (NP_200385.1), CRR22 (NP_172596.1), CRR28 (NP_176180.1), CRR4 (NP_182060.2), LPA66 (AED95742.1), MEF1 (AED96243.1), MEF11 (AEE83509.1), MEF14 (Q9LW33), MEF18 (AED92640.1), MEF19 (AEE74210.1), MEF21 (AEC07025.1), MEF22 (AEE75244.1), MEF26 (Q9SS60), MEF29 (Q9SUH6), MEF3 (Q9LND4), MEF7 (Q9FIB2.1), MEF9 (O04590), OGR1 (ACL79585.1), OTP80 (AED97156.1), OTP81 (AEC08301.1), OTP82 (AEE28239.1), OTP84 (Q7Y211), OTP85 (AEC05651.1), OTP87 (NP_177599.1), PpPPR_77 (BAD67156.2), PpPPR_91 (BAD67154.1), RARE1 (AED91873.1), REME1 (NP_178481.1), SLG1 (Q9FNN9), SLO1 (Q9SJZ3), YS1 (F4J1L5).