Date Published: March 21, 2017
Publisher: Public Library of Science
Author(s): Héloïse Philippon, Alexia Souvane, Céline Brochier-Armanet, Guy Perrière, Olivier Lespinet.
The reliability of molecular phylogenies is strongly dependent on the quality of the assembled datasets. In the case of eukaryotes, the selection of only one protein isoform per genomic locus is mandatory to avoid biases linked to redundancy. Here, we present IsoSel, a tool devoted to the selection of alternative isoforms in the context of phylogenetic reconstruction. It provides a better alternative to the widely used approach consisting in the selection of the longest isoforms and it performs better than Guidance, its only available counterpart. IsoSel is publicly available at http://doua.prabi.fr/software/isosel.
The alternative splicing, a process by which a single coding gene may lead to different transcripts and thus to different protein isoforms, is common in Eukaryotes. For instance, about 20% of plant genes  and 90% of human genes  undergo alternative splicing. In molecular phylogeny, the construction of homologous sequences datasets—usually performed by similarity-based procedures—does not allow distinguishing among the various isoforms and all of them are gathered during the process. However, most of the time only one isoform is kept for phylogenetic analyses, because they carry redundant information. Furthermore, due to the fact that some exons are present in some isoforms and absent in others, aligning them frequently leads to the introduction of many gaps in Multiple Sequence Alignments (MSAs) . Finally, trimming programs like Gblocks  or BMGE  select alignment regions based on their conservation level. So, introducing many isoforms will lead to the overestimation of conservation rates, a same residue being artefactually represented many times in the MSA.
IsoSel minimal input requirement is an unaligned set of protein sequences in Fasta format. The output is a text file containing the scores for each input sequence (Fig 3). Optionally, the user can provide a file in which the information on transcripts locus tag is given. In this case, IsoSel will also create a file in Fasta format that will contain the filtered dataset (i.e., in which only the isoform having the best score for a given gene is kept).
IsoSel is a command line software designed for the automatic selection of protein isoforms in the framework of phylogenetic analyses. Based on the SP score, it allows to obtain datasets that are optimized for tree reconstruction. The only other software that can be compared to IsoSel is Guidance but this program presents some limitations. First, it requires the independent installation of a broad range of tools (namely Perl, BioPerl and Ruby) while IsoSel is self-sufficient and is distributed with all the binaries required for its functioning. Then, it can only be run with the JTT substitution model while IsoSel allows the use of all standard site-homogeneous models. On a practical point of view, Guidance is usually slower than IsoSel when multithreading is enabled (data not shown). This point is probably linked to the fact that Guidance was not designed for alternative isoforms selection but rather as a general tool for assessing MSA quality. With this broader purpose, Guidance has to compute many scores in addition to SP, which lower its performances in terms of speed.