Date Published: November 6, 2018
Publisher: Public Library of Science
Author(s): John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, Eric Karl Oermann, Aziz Sheikh
Abstract: BackgroundThere is interest in using convolutional neural networks (CNNs) to analyze medical imaging to provide computer-aided diagnosis (CAD). Recent work has suggested that image classification CNNs may not generalize to new data as well as previously believed. We assessed how well CNNs generalized across three hospital systems for a simulated pneumonia screening task.Methods and findingsA cross-sectional design with multiple model training cohorts was used to evaluate model generalizability to external sites using split-sample validation. A total of 158,323 chest radiographs were drawn from three institutions: National Institutes of Health Clinical Center (NIH; 112,120 from 30,805 patients), Mount Sinai Hospital (MSH; 42,396 from 12,904 patients), and Indiana University Network for Patient Care (IU; 3,807 from 3,683 patients). These patient populations had an age mean (SD) of 46.9 years (16.6), 63.2 years (16.5), and 49.6 years (17) with a female percentage of 43.5%, 44.8%, and 57.3%, respectively. We assessed individual models using the area under the receiver operating characteristic curve (AUC) for radiographic findings consistent with pneumonia and compared performance on different test sets with DeLong’s test. The prevalence of pneumonia was high enough at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%) that merely sorting by hospital system achieved an AUC of 0.861 (95% CI 0.855–0.866) on the joint MSH–NIH dataset. Models trained on data from either NIH or MSH had equivalent performance on IU (P values 0.580 and 0.273, respectively) and inferior performance on data from each other relative to an internal test set (i.e., new data from within the hospital system used for training data; P values both <0.001). The highest internal performance was achieved by combining training and test data from MSH and NIH (AUC 0.931, 95% CI 0.927–0.936), but this model demonstrated significantly lower external performance at IU (AUC 0.815, 95% CI 0.745–0.885, P = 0.001). To test the effect of pooling data from sites with disparate pneumonia prevalence, we used stratified subsampling to generate MSH–NIH cohorts that only differed in disease prevalence between training data sites. When both training data sites had the same pneumonia prevalence, the model performed consistently on external IU data (P = 0.88). When a 10-fold difference in pneumonia rate was introduced between sites, internal test performance improved compared to the balanced model (10× MSH risk P < 0.001; 10× NIH P = 0.002), but this outperformance failed to generalize to IU (MSH 10× P < 0.001; NIH 10× P = 0.027). CNNs were able to directly detect hospital system of a radiograph for 99.95% NIH (22,050/22,062) and 99.98% MSH (8,386/8,388) radiographs. The primary limitation of our approach and the available public data is that we cannot fully assess what other factors might be contributing to hospital system–specific biases.ConclusionPneumonia-screening CNNs achieved better internal than external performance in 3 out of 5 natural comparisons. When models were trained on pooled data from sites with different pneumonia prevalence, they performed better on new pooled data from these sites but not on external data. CNNs robustly identified hospital system and department within a hospital, which can have large differences in disease burden and may confound predictions.
Partial Text: There is significant interest in using convolutional neural networks (CNNs) to analyze radiology, pathology, or clinical imaging for the purposes of computer-aided diagnosis (CAD) [1–5]. These studies are generally performed utilizing CNN techniques that were pioneered on well-characterized computer vision datasets, including the ImageNet Large Scale Visual Recognition Competition (ILSVRC) and the Modified National Institute of Standards and Technology (MNIST) database of hand-drawn digits [6,7]. Training CNNs to classify images from these datasets is typically done by splitting the dataset into three subsets: train (data directly used to learn parameters for models), tune (data used to choose hyperparameter settings, also commonly referred to as “validation”), and test (data used exclusively for performance evaluation of models learned using train and tune data). CNNs are trained to completion with the first two, and the final set is used to estimate the model’s expected performance on new, previously unseen data.
We have demonstrated that pneumonia-screening CNNs trained on data from individual or multiple hospital systems did not consistently generalize to external sites, nor did they make predictions exclusively based on underlying pathology. Given the significant interest in using deep learning to analyze radiological imaging, our findings should give pause to those considering rapid deployment of such systems without first assessing their performance in a variety of real-world clinical settings. To our knowledge, no prior studies have assessed whether radiological CNNs generalized to external datasets. We note that the issue of not generalizing externally is distinct from typical train/test performance degradation, in which overfitting to training data leads to lower performance on testing data: in our experiments, all results are reported on held-out test data exclusively in both internal and external comparisons. Performance of the jointly trained MSH–NIH model on the joint test set (AUC 0.931) was higher than performance on either individual dataset (AUC 0.805 and 0.733, respectively), likely because the model was able to calibrate to different prevalences across hospital systems in the joint test set but not individual test sets. A simple calibration-based non-CNN model that used hospital system pneumonia prevalence only to make predictions and ignored image features achieved AUC 0.861 because of the large difference in pneumonia prevalence between the MSH and NIH test sets. Calibration plots confirmed that a model trained on NIH data was poorly calibrated to MSH and vice versa.
Pneumonia-screening CNNs achieved better internal than external performance in 3 out of 5 natural comparisons. When models were trained on pooled data from sites with different pneumonia prevalence, they performed better on new pooled data from these sites but not on external data. CNNs robustly identified hospital system and department within a hospital, which can have large differences in disease burden and may confound predictions.