Date Published: November 20, 2018
Publisher: Public Library of Science
Author(s): Pranav Rajpurkar, Jeremy Irvin, Robyn L. Ball, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis P. Langlotz, Bhavik N. Patel, Kristen W. Yeom, Katie Shpanskaya, Francis G. Blankenberg, Jayne Seekins, Timothy J. Amrhein, David A. Mong, Safwan S. Halabi, Evan J. Zucker, Andrew Y. Ng, Matthew P. Lungren, Aziz Sheikh
Abstract: BackgroundChest radiograph interpretation is critical for the detection of thoracic diseases, including tuberculosis and lung cancer, which affect millions of people worldwide each year. This time-consuming task typically requires expert radiologists to read the images, leading to fatigue-based diagnostic error and lack of diagnostic expertise in areas of the world where radiologists are not available. Recently, deep learning approaches have been able to achieve expert-level performance in medical image interpretation tasks, powered by large network architectures and fueled by the emergence of large labeled datasets. The purpose of this study is to investigate the performance of a deep learning algorithm on the detection of pathologies in chest radiographs compared with practicing radiologists.Methods and findingsWe developed CheXNeXt, a convolutional neural network to concurrently detect the presence of 14 different pathologies, including pneumonia, pleural effusion, pulmonary masses, and nodules in frontal-view chest radiographs. CheXNeXt was trained and internally validated on the ChestX-ray8 dataset, with a held-out validation set consisting of 420 images, sampled to contain at least 50 cases of each of the original pathology labels. On this validation set, the majority vote of a panel of 3 board-certified cardiothoracic specialist radiologists served as reference standard. We compared CheXNeXt’s discriminative performance on the validation set to the performance of 9 radiologists using the area under the receiver operating characteristic curve (AUC). The radiologists included 6 board-certified radiologists (average experience 12 years, range 4–28 years) and 3 senior radiology residents, from 3 academic institutions. We found that CheXNeXt achieved radiologist-level performance on 11 pathologies and did not achieve radiologist-level performance on 3 pathologies. The radiologists achieved statistically significantly higher AUC performance on cardiomegaly, emphysema, and hiatal hernia, with AUCs of 0.888 (95% confidence interval [CI] 0.863–0.910), 0.911 (95% CI 0.866–0.947), and 0.985 (95% CI 0.974–0.991), respectively, whereas CheXNeXt’s AUCs were 0.831 (95% CI 0.790–0.870), 0.704 (95% CI 0.567–0.833), and 0.851 (95% CI 0.785–0.909), respectively. CheXNeXt performed better than radiologists in detecting atelectasis, with an AUC of 0.862 (95% CI 0.825–0.895), statistically significantly higher than radiologists’ AUC of 0.808 (95% CI 0.777–0.838); there were no statistically significant differences in AUCs for the other 10 pathologies. The average time to interpret the 420 images in the validation set was substantially longer for the radiologists (240 minutes) than for CheXNeXt (1.5 minutes). The main limitations of our study are that neither CheXNeXt nor the radiologists were permitted to use patient history or review prior examinations and that evaluation was limited to a dataset from a single institution.ConclusionsIn this study, we developed and validated a deep learning algorithm that classified clinically important abnormalities in chest radiographs at a performance level comparable to practicing radiologists. Once tested prospectively in clinical settings, the algorithm could have the potential to expand patient access to chest radiograph diagnostics.
Partial Text: Chest radiography is the most common type of imaging examination in the world, with over 2 billion procedures performed each year . This technique is critical for screening, diagnosis, and management of thoracic diseases, many of which are among the leading causes of mortality worldwide . A computer system to interpret chest radiographs as effectively as practicing radiologists could thus provide substantial benefit in many clinical settings, from improved workflow prioritization and clinical decision support to large-scale screening and global population health initiatives.
The ROC curves for each of the pathologies on the validation set are illustrated in Fig 1, and AUCs with CIs are reported in Table 1; statistically significant differences in AUCs were assessed with the Bonferroni-corrected CI (1 − 0.05/14). The CheXNeXt algorithm performed as well as the radiologists for 10 pathologies and performed better than the radiologists on 1 pathology. It achieved an AUC of 0.862 (95% CI 0.825–0.895) for atelectasis, statistically significantly higher than radiologists’ AUC of 0.808 (95% CI 0.777–0.838). The radiologists achieved statistically significantly higher AUC performance on cardiomegaly, emphysema, and hiatal hernia, with AUCs of 0.888 (95% CI 0.863–0.910), 0.911 (95% CI 0.866–0.947), and 0.985 (95% CI 0.974–0.991), respectively, whereas CheXNeXt’s AUCs were 0.831 (95% CI 0.790–0.870), 0.704 (95% CI 0.567–0.833), and 0.851 (95% CI, 0.785–0.909), respectively. There were no statistically significant differences in the AUCs for the other 10 pathologies.
The results presented in this study demonstrate that deep learning can be used to develop algorithms that can automatically detect and localize many pathologies in chest radiographs at a level comparable to practicing radiologists. Clinical integration of this system could allow for a transformation of patient care by decreasing time to diagnosis and increasing access to chest radiograph interpretation.
We present CheXNeXt, a deep learning algorithm that performs comparably to practicing board-certified radiologists in the detection of multiple thoracic pathologies in frontal-view chest radiographs. This technology may have the potential to improve healthcare delivery and increase access to chest radiograph expertise for the detection of a variety of acute diseases. Further studies are necessary to determine the feasibility of these outcomes in a prospective clinical setting.