Date Published: January 16, 2019
Publisher: Public Library of Science
Author(s): Daniel Lichtblau, Catalin Stoean, Marco Magalhaes.
The current research study is concerned with the automated differentiation between histopathological slides from colon tissues with respect to four classes (healthy tissue and cancerous of grades 1, 2 or 3) through an optimized ensemble of predictors. Six distinct classifiers with prediction accuracies ranging from 87% to 95% are considered for the task. The proposed method of combining them takes into account the probabilities of the individual classifiers for each sample to be assigned to any of the four classes, optimizes weights for each technique by differential evolution and attains an accuracy that is significantly better than the individual results. Moreover, a degree of confidence is defined that would allow the pathologists to separate the data into two distinct sets, one that is correctly classified with a high level of confidence and the rest that would need their further attention. The tandem is also validated on other benchmark data sets. The proposed methodology proves to be efficient in improving the classification accuracy of each algorithm taken separately and performs reasonably well on other data sets, even with default weights. In addition, by establishing a degree of confidence the method becomes more viable for use by actual practitioners.
The best possibility to cure cancer lies currently in its detection from early stages , , . It is therefore advised that individuals with an increased risk of developing cancer based on history take screening tests from an early age and repeat such tests at certain intervals. In some countries, there are recommendations to take such screening tests for all adults after a certain age, depending on the cancer type , . Also, there have been important investments worldwide in acquiring advanced microscopy hardware for hospitals. This leads to an increasing amount of histological slides that have to be analyzed. Computational approaches can support the medical professionals through autonomous learning and direct diagnosis establishment especially by providing a second opinion ,  or even determining evidently benign cases in order to allow the human experts to concentrate on the more problematic slides , , .
The image data set comes from the University Hospital of Craiova, Romania, and contains 357 images at 800×600 pixels with 62 healthy (G0) records, 96 of the first grade (G1), 99 of the second grade (G2) and 100 of the third grade (G3). The grades for the samples were established by two pathologists that reached a consensus diagnosis. This diminishes, but does not remove, the possibility that there may be classification errors in some cases where the pathologists must distinguish difficult diagnostic categories. Examples of samples from each class can be observed in Fig 1. Based on the name of the project that put forward the data, it will be further referred in the article as the IMEDIATREAT data set. It was initially introduced in  and is available for download .
Although the traditional sequence of preprocessing—segmentation—feature extraction—feature selection—classification is still preferred by many studies in image analysis, fully automated classification of cancer histological images has currently emerged as an alternative human-independent methodology. This implies further intervention is not required for pre-annotation of the regions of interest from the pathologists and therefore exempts the human experts from the additional effort of assisting the machine. Recently, more uncommon means of diagnosis have been proposed, like studying the movement of the eyes of the pathologists , but that still needs the pathologist to do the classification task. For a very recent and broad literature review about clinical information extraction, including that from histological slides, see .
A combination of 6 machine learning techniques (5 state-of-the-art methods and a relatively new approach) is employed for the histopathological image classification task. The images are transformed using the AlexNet CNN  into numerical vectors, using the pre-trained weights and without fine tuning. Each vector has a size of 1024 numerical features extracted from the CNN.
Two experiments are undertaken.
The current research is focused on improving the classification accuracy on the IMEDIATREAT collection. It is comprised of 357 histopathological slides that are separated into four different classes: healthy or cancer of grades 1, 2 or 3. There are six classifiers that are tested on the data set: they learn from a training set consisting of 2/3 of the entire data collection and are tested on the remaining 1/3 samples. The classification accuracies of these six approaches are superior or similar to the previous attempted techniques on the same data set. However, the research goes further and combines the six classifiers, establishing weights for their significance in order to further boost the accuracy. The values for the weights for the classifiers are discovered (that is, optimized) via a differential evolution approach, using a fitness function that further penalizes the errors where the classes are wrong by more than one grade.