Research Article: Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives

Date Published: February 15, 2018

Publisher: Public Library of Science

Author(s): Sebastian Gehrmann, Franck Dernoncourt, Yeran Li, Eric T. Carlson, Joy T. Wu, Jonathan Welt, John Foote, Edward T. Moseley, David W. Grant, Patrick D. Tyler, Leo A. Celi, Jen-Hsiang Chuang.


In secondary analysis of electronic health records, a crucial task consists in correctly identifying the patient cohort under investigation. In many cases, the most valuable and relevant information for an accurate classification of medical conditions exist only in clinical narratives. Therefore, it is necessary to use natural language processing (NLP) techniques to extract and evaluate these narratives. The most commonly used approach to this problem relies on extracting a number of clinician-defined medical concepts from text and using machine learning techniques to identify whether a particular patient has a certain condition. However, recent advances in deep learning and NLP enable models to learn a rich representation of (medical) language. Convolutional neural networks (CNN) for text classification can augment the existing techniques by leveraging the representation of language to learn which phrases in a text are relevant for a given medical condition. In this work, we compare concept extraction based methods with CNNs and other commonly used models in NLP in ten phenotyping tasks using 1,610 discharge summaries from the MIMIC-III database. We show that CNNs outperform concept extraction based methods in almost all of the tasks, with an improvement in F1-score of up to 26 and up to 7 percentage points in area under the ROC curve (AUC). We additionally assess the interpretability of both approaches by presenting and evaluating methods that calculate and extract the most salient phrases for a prediction. The results indicate that CNNs are a valid alternative to existing approaches in patient phenotyping and cohort identification, and should be further investigated. Moreover, the deep learning approach presented in this paper can be used to assist clinicians during chart review or support the extraction of billing codes from text by identifying and highlighting relevant phrases for various medical conditions.

Partial Text

The secondary analysis of data from electronic health records (EHRs) is crucial to better understand the heterogeneity of treatment effects and to individualize patient care [1]. With the growing adoption rate of EHRs [2], researchers gain access to rich data sets, such as the Medical Information Mart for Intensive Care (MIMIC) database [3, 4], and the Informatics for Integrating Biology and the Bedside (i2b2) datamarts [5–10]. These data sets can be explored and mined in numerous ways [11]. EHR data comprise both structured data such as International Classification of Diseases (ICD) codes, laboratory results and medications, and unstructured data such as clinician progress notes. While structured data do not require complex processing prior to statistical tests and machine learning tasks, the majority of data exist in unstructured form [12]. Natural language processing methods can extract this valuable data, which in conjunction with analyzing structured data can lead to a better understanding of health and diseases [13] and to a more accurate phenotyping of patients to compare tests and treatments [14–16]. Patient phenotyping is a classification task for determining whether a patient has a medical condition or for pinpointing patients who are at risk for developing one. Further, intelligent applications for patient phenotyping can support clinicians by reducing the time they spend on chart reviews, which takes up a significant fraction of their daily workflow [17, 18].

Accurate patient phenotyping is required for secondary analysis of EHRs to correctly identify the patient cohort and to better identify the clinical context [36, 37]. Studies employing a manual chart review process for patient phenotyping are naturally limited to a small number of preselected patients. Therefore, NLP is necessary to identify information that is contained in text but may be inconsistently captured in the structured data, such as recurrence in cancer [20, 38], whether a patient smokes [5], classification within the autism spectrum [39], or drug treatment patterns [40]. However, unstructured data in EHRs, for example progress notes or discharge summaries, are not typically amenable to simple text searches because of spelling mistakes, synonyms, and ambiguous terms [41]. To help address these issues, researchers utilize dictionaries and ontologies for medical terminologies such as the unified medical language system (UMLS) [42] and the systematized nomenclature of medicine—clinical terms (SNOMED CT) [43].

We show an overview of the F1-scores for different models and phenotypes in Fig 2. For almost all phenotypes, the CNN outperforms all other approaches. For some of the phenotypes such as Obesity and Psychiatric Disorders, the CNN outperforms the other models by a large margin. A χ2 test confirms that the CNN’s improvements over both the filtered and the full cTAKES models are statistically significant at a 0.01 level. There is only a minimal improvement when using the filtered cTAKES model, which requires much more effort from clinicians, over the full cTAKES model. The χ2 test confirms that there is no statistically significant improvement of this method on our data with a p-value of 0.86. We also note that the TF-IDF transformation of the CUIs yielded a small average improvement in AUC of 0.02 (σ = 0.03) over all the considered models.

Our results show that CNNs provide a valid alternative approach to the identification of patient conditions from text. However, we notice a strong variation in the results between phenotypes with AUCs between 73 and 100, and F1-scores between 57 and 97, even with consistent annotation schemes. Some concepts such as Chronic Pain are especially challenging to detect, even with 321 positive examples in the data set. This makes it difficult to compare our results to other reported metrics in the literature, since studies typically consider different concepts for detection. This problem is further amplified by the sparsity of available studies that investigate unstructured data [13], and the lack of standardized datasets for this task. We hope that the release of our annotations will support work towards a more comparable performance in text-based phenotyping.




0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments