Research Article: Can machine-learning improve cardiovascular risk prediction using routine clinical data?

Date Published: April 4, 2017

Publisher: Public Library of Science

Author(s): Stephen F. Weng, Jenna Reps, Joe Kai, Jonathan M. Garibaldi, Nadeem Qureshi, Bin Liu.


Current approaches to predict cardiovascular risk fail to identify many people who would benefit from preventive treatment, while others receive unnecessary intervention. Machine-learning offers opportunity to improve accuracy by exploiting complex interactions between risk factors. We assessed whether machine-learning can improve cardiovascular risk prediction.

Prospective cohort study using routine clinical data of 378,256 patients from UK family practices, free from cardiovascular disease at outset. Four machine-learning algorithms (random forest, logistic regression, gradient boosting machines, neural networks) were compared to an established algorithm (American College of Cardiology guidelines) to predict first cardiovascular event over 10-years. Predictive accuracy was assessed by area under the ‘receiver operating curve’ (AUC); and sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) to predict 7.5% cardiovascular risk (threshold for initiating statins).

24,970 incident cardiovascular events (6.6%) occurred. Compared to the established risk prediction algorithm (AUC 0.728, 95% CI 0.723–0.735), machine-learning algorithms improved prediction: random forest +1.7% (AUC 0.745, 95% CI 0.739–0.750), logistic regression +3.2% (AUC 0.760, 95% CI 0.755–0.766), gradient boosting +3.3% (AUC 0.761, 95% CI 0.755–0.766), neural networks +3.6% (AUC 0.764, 95% CI 0.759–0.769). The highest achieving (neural networks) algorithm predicted 4,998/7,404 cases (sensitivity 67.5%, PPV 18.4%) and 53,458/75,585 non-cases (specificity 70.7%, NPV 95.7%), correctly predicting 355 (+7.6%) more patients who developed cardiovascular disease compared to the established algorithm.

Machine-learning significantly improves accuracy of cardiovascular risk prediction, increasing the number of patients identified who could benefit from preventive treatment, while avoiding unnecessary treatment of others.

Partial Text

Globally, cardiovascular disease (CVD) is the leading cause of morbidity and mortality. In 2012, there were 17.5 million deaths from CVD with 7.4 million deaths due to coronary heart disease (CHD) and 6.7 million deaths due to stroke [1]. Established approaches to CVD risk assessment, such as that recommended by the American Heart Association/American College of Cardiology (ACC/AHA), predict future risk of CVD based on well-established risk factors such as hypertension, cholesterol, age, smoking, and diabetes. These risk factors have recognised aetiological associations with CVD and feature within most CVD risk prediction tools (e.g. ACC/AHA [2], QRISK2 [3], Framingham [4], Reynolds [5]. There remain a large number of individuals at risk of CVD who fail to be identified by these tools, while some individuals not at risk are given preventive treatment unnecessarily. For instance, approximately half of myocardial infractions (MIs) and strokes will occur in people who are not predicted to be at risk of cardiovascular disease [6].

Compared to an established AHA/ACC risk prediction algorithm, we found all machine-learning algorithms tested were better at identifying individuals who will develop CVD and those that will not. Unlike established approaches to risk prediction, the machine-learning methods used were not limited to a small set of risk factors, and incorporated more pre-existing medical conditions. Neural networks performed the best, with predictive accuracy improving by 3.6%. This is an encouraging step forward. For example, the addition of emerging biochemical risk factors, such as high sensitivity C-reactive protein, has recently achieved less than 1% improvement in CVD risk prediction [31].

Compared to an established risk prediction approach, this study has shown machine-learning algorithms are better at predicting the absolute number of cardiovascular disease cases correctly, whilst successfully excluding non-cases. This has been demonstrated in a large and heterogeneous primary care patient population using routinely collected electronic health data.




0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments