Research Article: Evaluating the predictability of medical conditions from social media posts

Date Published: June 17, 2019

Publisher: Public Library of Science

Author(s): Raina M. Merchant, David A. Asch, Patrick Crutchley, Lyle H. Ungar, Sharath C. Guntuku, Johannes C. Eichstaedt, Shawndra Hill, Kevin Padrez, Robert J. Smith, H. Andrew Schwartz, Sreeram V. Ramagopalan.


We studied whether medical conditions across 21 broad categories were predictable from social media content across approximately 20 million words written by 999 consenting patients. Facebook language significantly improved upon the prediction accuracy of demographic variables for 18 of the 21 disease categories; it was particularly effective at predicting diabetes and mental health conditions including anxiety, depression and psychoses. Social media data are a quantifiable link into the otherwise elusive daily lives of patients, providing an avenue for study and assessment of behavioral and environmental disease risk factors. Analogous to the genome, social media data linked to medical diagnoses can be banked with patients’ consent, and an encoding of social media language can be used as markers of disease risk, serve as a screening tool, and elucidate disease epidemiology. In what we believe to be the first report linking electronic medical record data with social media data from consenting patients, we identified that patients’ Facebook status updates can predict many health conditions, suggesting opportunities to use social media data to determine disease onset or exacerbation and to conduct social media-based health interventions.

Partial Text

Over two billion people regularly share information about their daily lives over social media, often revealing who they are, including their sentiments, personality, demographics, and population behavior. [1–4] Because such content is constantly being created outside the context of health care systems and clinical studies, it can reveal disease markers in patients’ daily lives that are otherwise invisible to clinicians and medical researchers.

We evaluated whether consenting patients’ Facebook posts could be used to predict their diagnoses evident in their electronic medical record (EMR). This study was approved by the University of Pennsylvania Institutional Review Board.

We identified that: 1) all 21 medical condition categories were predictable from Facebook language beyond chance (multi-test corrected p < .05), 2) 18 categories were better predicted from a combination of demographics and Facebook language than by demographics alone (multi-test corrected p < .05), and 3) 10 categories were better predicted by Facebook language than by the standard demographic factors (age, sex, and race). These results are depicted in Fig 2 which shows the accuracies of the three predictive models across all 21 diagnoses categories.   Source: