Date Published: October 31, 2018
Publisher: Public Library of Science
Author(s): Yang Shen, Benjamin Chaigne-Delalande, Richard W. J. Lee, Wolfgang Losert, Niklas K. Björkström.
New cytometric techniques continue to push the boundaries of multi-parameter quantitative data acquisition at the single-cell level particularly in immunology and medicine. Sophisticated analysis methods for such ever higher dimensional datasets are rapidly emerging, with advanced data representations and dimensional reduction approaches. However, these are not yet standardized and clinical scientists and cell biologists are not yet experienced in their interpretation. More fundamentally their range of statistical validity is not yet fully established. We therefore propose a new method for the automated and unbiased analysis of high-dimensional single cell datasets that is simple and robust, with the goal of reducing this complex information into a familiar 2D scatter plot representation that is of immediate utility to a range of biomedical and clinical settings. Using publicly available flow cytometry and mass cytometry datasets we demonstrate that this method (termed CytoBinning), recapitulates the results of traditional manual cytometric analyses and leads to new and testable hypotheses.
Cytometry is a multi-parameter single-cell measurement technique that is widely used in biological and clinical studies [1–6]. One of the main uses of flow cytometry, which has had a major impact across the fields of immunology and medicine, is to differentiate immune cells compositions among cell types or patients. Modern flow cytometers can routinely measure 15–20 cellular markers on millions of cells from dozens of samples in one experiment, and can sort cells into subpopulations based on those markers. Recently mass cytometry has expanded the number of markers that can be measured simultaneously to 100, though the technique is destructive to cells and does not allow for sorting. The conventional way of analyzing flow cytometry data uses a gating strategy which requires the manual selection of regions of interest (ROI) on sequential 2D scatterplots. This type of analysis is very labor intensive and inefficient for such large datasets and also suffers from subjectivity in both the sequence of 2D scatterplots and selection of thresholds (ROI) [3,4,7–10]. Therefore, as both the number of cells analyzed and the number of markers quantified for each cell have grown over the past decade, novel automated and unbiased analysis methods for flow cytometry data are emerging .
We synthesized two point-patterns based on the expression of two virtual markers: maker A and marker B. Ten samples were generated for each point-pattern. The first point-pattern, called pattern A, consists of three point-clusters. Two large clusters each contain 5,000 points and a third relatively small cluster contains about 2,000 points. The three clusters are randomly sampled from Gaussian distributions that centered at point (0, 4), (0, -4) and (4, 0) with standard deviation 2, 2, and 1 respectively. The second point-pattern, called pattern B, also consists of three point-clusters. The two large clusters are generated in the same way as point-pattern A, however, the third smaller point-pattern only contains 200 to 500 points, sampled from a Gaussian distribution centered at point (-4, 6) with standard deviation 1 (S1 Fig).
The complexity of cytometry data has increased significantly in the last few years due to the advancement in experimental techniques that enable measurements of dozens of parameters on each cell for millions of cells . Novel analysis algorithms are being introduced at a rapid pace to deal with this data deluge that identify clusters of cells and project the high dimensional information graphically in innovative ways. However, many biomedical researchers and clinicians do not (yet) have the intuition to interpret the novel graphic representations and translate them into hypotheses and actions. There is also the flaw that nearest neighbors are not meaningful in high dimensions, which is a phenomenon referred to as the “curse of dimensionality” [33,34]. Here we introduce a simpler, alternative approach we term CytoBinning. Our analysis approach combines automation of a more traditional workflow (as advocated in ) and machine learning which links the high dimensional data back to two biomarkers which can be represented as 2D scatter plots. The 2D scatter plot outputs are designed to be directly interpretable by biomedical researchers and clinicians, who have an established intuition for the meaning of these graphics. Thus, we are able to leverage their existing expertise in interpreting these kinds of scatterplots. When the differences in phenotype are small, CytoBinning is able to further focus the researcher or clinician’s attention by identifying, which specific regions of the scatter plot exhibits the most notable differences between two groups of donors, allowing subtle shifts in the immune phenotype to be highlighted.