Date Published: January 16, 2018
Publisher: Public Library of Science
Author(s): Giulio Di Minin, Andreas Postlmayr, Anton Wutz, Dina Schneidman
Abstract: Haploid cells are increasingly used for screening of complex pathways in animal genomes. Hemizygous mutations introduced through viral insertional mutagenesis can be directly selected for phenotypic changes. Here we present HaSAPPy a tool for analysing sequencing datasets of screens using insertional mutations in large pools of haploid cells. Candidate gene prediction is implemented through identification of enrichment of insertional mutations after selection by simultaneously evaluating several parameters. We have developed HaSAPPy for analysis of genetic screens for silencing factors of X chromosome inactivation in haploid mouse embryonic stem cells. To benchmark the performance, we further analyse several datasets of genetic screens in human haploid cells for which candidates have been validated. Our results support the effective candidate prediction strategy of HaSAPPy. HaSAPPy is implemented in Python, licensed under the MIT license, and is available from https://github.com/gdiminin/HaSAPPy.
Partial Text: Next generation sequencing (NGS) has facilitated the exploration of animal genomes in a number of areas including recent approaches for genetic screening. In mammals, screening of important biomedical pathways can be performed orders of magnitude faster in cell cultures than in the organism and at a fraction of the cost. Cell based screens have been performed using RNA interference , sequence specific nucleases , and mutagenesis of haploid cells. The latter strategy has been originally implemented in a haploid human leukemia cell line  and recently extended to haploid embryonic stem cells (ESCs). A number of successful screens illustrate the power of pooled mammalian haploid cell screening. Clinically relevant pathways [4,5] including pathogen [3,6–8] and toxin [9–12] mechanisms have been studied in human haploid cells. Haploid ESCs from mouse [13,14] and from human embryos  have been used to characterize mechanisms of drug action [13,14,16] and developmental questions [17,18].
HaSAPPy uses NGS datasets for predicting candidates from insertional mutagenesis screens in haploid cells based on selection for specific phenotypes including genetically encoded reporters, cell survival, and physical isolation of cells (Fig 1A). Gene-trap insertion sites are identified by reads starting with the first base of the genomic sequence flanking the vector insertion  (Fig 1B). Reads are preprocessed for eliminating regions with consistently low base quality and maintaining a maximum of sequence information. Subsequently, adaptor removal is performed and reads that become shorter than a threshold (default 26nt) are discarded. HaSAPPy has been preconfigured with three read mappers including the Burrows-Wheeler transform based Bowtie2 , and nvBowtie (https://github.com/NVlabs/nvbio/tree/master/nvBowtie), and the hash table index structure based NextGenMap  using a test suite . HaSAPPy also accepts pre-aligned datasets in Sequence Alignment/Map (SAM) format and a threshold for alignment quality (MAPQ).
We developed HaSAPPy by analysing a screen for silencing factors in X inactivation , which used an inducible Xist expression system in haploid mouse ESCs for identifying mutations that mediate cell survival by escaping inactivation of the X chromosome. 7 control (SRX1060416) and 7 selected (SRX1060407) datasets with a total of 300 million reads were analysed on workstations with Ubuntu Linux version 14.04, at least 32 gigabytes memory, and graphics processing units (GPUs). Runtimes between 90 minutes and 3 hours were predominated by read mapping and subsequent analyses were also performed on a Macbook (Intel Core i7, 2.3GHz) in 30 minutes. We have evaluated different read mappers and alignment parameters by benchmarking on experimental and generated read datasets using a previously published test suite . As a result HaSAPPy is preconfigured to run with Bowtie2 , nvBowtie, and NextGenMap . The latter two aligners utilize accelerators for speeding up read alignment by taking advantage of recent hardware developments. Although some differences in the number of insertions assigned to candidate genes were observed (Fig 3A, S2 Fig, Table A in S1 Text), our results suggest that GPUs can be effective in speeding up read mapping without a loss in sensitivity consistent with earlier results [25,26].
Analyses of genetic screens require sensitivity to avoid missing individual candidates. Pathways can be represented by single candidates, when redundancy or lethality reduces opportunities of discovery. Conversely, false predictions can lead to costly experimental validation, which limits the number of genes that are followed up. Therefore there is a need for effective candidate ranking. NGS datasets from genetic screens require strategies to identify evidence for selection and reduce technical noise that differ from other sequencing analysis problems. Our study introduces three methodical procedures for candidate prediction in haploid screens.