Date Published: October 10, 2018
Publisher: Public Library of Science
Author(s): Yuval Lieberman, Lior Rokach, Tal Shay, Lars Kaderali.
Single-cell RNA sequencing (scRNA-seq) is an emerging technology for profiling the gene expression of thousands of cells at the single cell resolution. Currently, the labeling of cells in an scRNA-seq dataset is performed by manually characterizing clusters of cells or by fluorescence-activated cell sorting (FACS). Both methods have inherent drawbacks: The first depends on the clustering algorithm used and the knowledge and arbitrary decisions of the annotator, and the second involves an experimental step in addition to the sequencing and cannot be incorporated into the higher throughput scRNA-seq methods. We therefore suggest a different approach for cell labeling, namely, classifying cells from scRNA-seq datasets by using a model transferred from different (previously labeled) datasets. This approach can complement existing methods, and–in some cases–even replace them. Such a transfer-learning framework requires selecting informative features and training a classifier. The specific implementation for the framework that we propose, designated ”CaSTLe–classification of single cells by transfer learning,” is based on a robust feature engineering workflow and an XGBoost classification model built on these features. Evaluation of CaSTLe against two benchmark feature-selection and classification methods showed that it outperformed the benchmark methods in most cases and yielded satisfactory classification accuracy in a consistent manner. CaSTLe has the additional advantage of being parallelizable and well suited to large datasets. We showed that it was possible to classify cell types using transfer learning, even when the databases contained a very small number of genes, and our study thus indicates the potential applicability of this approach for analysis of scRNA-seq datasets.
Single-cell RNA sequencing (scRNA-seq) is an emerging technology that measures, in a single experiment, the expression profile of up to 105 cells, at the level of the single cell . There are currently hundreds of scRNA-seq datasets in the public domain , and the number of new datasets is growing rapidly. Intensive attention has thus been devoted to addressing–by various methods –the unique analytical challenges posed by the analysis of scRNA-seq datasets. The labeling of the cells (e.g., in terms of cell type, cell state, and cell cycle stage) in an scRNA-seq dataset that profiles a non-homogenous cell population is currently performed by one of two approaches, one experimental and the other computational, namely, fluorescence-activated cell sorting (FACS) or clustering the cells based on gene expression data, followed by manual annotation of each cell cluster. Both these approaches have inherent drawbacks. The first approach–FACS–requires an additional experimental step (beyond the actual sequencing experiment) and is limited in throughput, as it is necessary to track the cells, typically by sorting from the cell sorter to multiwell plates. This approach is thus not practical for new scRNA-seq methods, such as drop-seq , in which large numbers of cells are profiled. The second approach–clustering and manual annotation [5,6])–depends not only on a dimensionality reduction method [typically principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE)] and a clustering algorithm used to define distinct cell types but also on the knowledge and arbitrary decisions of the annotator of each cell type. The labeling is therefore subjective. As a result, comparisons of cells of presumably the same cell type between experiments becomes complicated, if not impossible. In addition, the annotator typically uses knowledge of existing cell type markers. However, those known markers are defined and used at the protein level. RNA levels can explain about 40–80% of the variance in protein levels , meaning that reliable protein markers are not necessarily reliable markers at the RNA level. For example, natural killer cells express CD8a RNA, even though they do not carry CD8 protein on their cell surface. An additional drawback is that the inherently low sampling and noise in measurements at the single-cell level makes classification based on a small number of marker genes very inaccurate. Classification based on larger number of genes is much more robust to noise and sampling depth. Thus, although the labeling of cells of known cell types is, by definition, a supervised learning task, it is currently achieved by unsupervised methods with manual input. Recent attempts to address the above-described problems have led to the development of several different approaches for automatic annotation of cell types, including our own, which is presented in this article.
We showed that it is possible to classify single-cell RNA sequencing gene expression data in terms of cell types according to an independent labeled dataset containing similar cell types. CaSTLe, the method we developed for this process, is composed of a robust selection and transformation feature, followed by XGBoost classification. This method was shown to be parallelizable, efficient, and consistent across various test cases. For the multi-class scenario, in 10 out of 12 cases, CaSTLe outperformed a simple benchmark of highest mean features and linear model classification. In 8 out of 12 cases, CaSTLe outperformed a more sophisticated benchmark, the beta-Poisson single cell differentially expressed genes and linear model classifier. The strength and robustness of CaSTLe, compared to the two benchmark methods, was demonstrated by the high accuracy levels achieved for the larger and more imbalanced datasets. For the binary-class scenario, out of 18 cell types that appeared both in the source and target datasets, AUC values above 95% were obtained for 16 cell types. Out of 15 cell types that appeared only in the source dataset, a sensitivity higher than 97% was obtained for all 15 cell types, which means that an erroneous cell-type identification was probably made for as little as < 3% of cells. The performance in this stage was much better than in the multi-class classification, probably since we performed steps 5–9 separately for each cell type. Source: http://doi.org/10.1371/journal.pone.0205499