Date Published: June 7, 2018
Publisher: Public Library of Science
Author(s): Kelly Walters, Dimitri A. Christakis, Davene R. Wright, Arsham Alamian.
Amazon’s Mechanical Turk (MTurk) is frequently used to administer health-related surveys and experiments at a low cost, but little is known about its representativeness with regards to health status and behaviors.
A cross-sectional survey comprised of questions from the nationally-representative 2014 Behavioral Risk Factor Surveillance System (BRFSS) and 2014 National Health and Nutrition Examination Survey (NHANES) was administered to 591 MTurk workers and 393 masters in 2016. Health status (asthma, depression, BMI, and general health), health behaviors (influenza vaccination, health insurance, smoking, and physical activity), and demographic characteristics of the two MTurk populations (workers and masters) were compared to each other and, using Poisson regression, to a nationally-representative BRFSS and NHANES samples.
Workers and master demographics were similar. MTurk users were more likely to be aged under 50 years compared to the national sample (86% vs. 55%) and more likely to complete a college degree than the national sample (50% vs. 26%). Adjusting for covariates, MTurk users were less likely to be vaccinated for influenza, to smoke, to have asthma, to self-report being in excellent or very good health, to exercise, and have health insurance but over twice as likely to screen positive for depression relative to a national sample. Results were fairly consistent among different age groups.
MTurk workers are not a generalizable population with regards to health status and behaviors; deviations did not follow a trend. Appropriate health-related uses for MTurk and ways to improve upon the generalizability of MTurk health studies are proposed.
Mechanical Turk is a crowdsourcing platform developed by Amazon through which, broadly speaking, “requestors” may hire “workers” to complete “human intelligence tasks” for a small cost . In this case, crowdsourcing refers to accomplishing a task by opening it up to the public. Given the diversity of the worker sample, large number of workers, quick turnaround time and low cost of work, an increasing number of researchers have identified Mechanical Turk (MTurk) workers as an effective participant pool for surveys. Evidence from behavioral science experiments conducted on MTurk suggests that workers can produce results that are just as valid and reliable as field and laboratory experiments.[2, 3] Quality is moderated by the requester; after submitting a task (e.g., a survey), requestors may approve or reject the work completed by the worker based on its quality, and Amazon tracks workers’ rejection rates, providing motivation and accountability for workers to perform their tasks with diligence.
In total, 1,086 surveys were initiated, and 102 were dropped from analysis due to duplicate IP addresses (n = 25), lack of completion of the survey (n = 65) or failing to correctly pass the attention check (n = 12), leaving 984 individual respondents. There were no statistically significant differences detected between workers (n = 591) and masters (n = 393), except with respect to gender (workers = 41% female, masters = 48% female, p = 0.046). Thus, all subsequent analyses were performed grouping workers and masters together, hereafter collectively known as Turkers. The geographic distribution of respondents was similar to that of the US population, with California, Texas, Illinois, Florida, North Carolina and New York having the highest proportions of respondents.
Because Mechanical Turk has been shown to produce data quickly and at a low cost, it has been used extensively to conduct health research. [15–26] It is therefore important to understand any biases within and the external validity of this population. This study found that Turkers were generally younger, of lower socioeconomic status, and less racially/ethnically diverse than the national population. These demographics are consistent with other recent studies examining Turkers crowdsourced as a study population.[3, 6, 27] But in this sample, Turkers differed significantly from nationally representative samples in almost every health-related variable that was measured, even after controlling for demographic covariates. Even if sample weights could be employed to make the demographics representative, Turkers’ health behaviors are not representative of the national population for the purposes of health research, independent of demographic differences, and MTurk surveys should be clear that findings are not generalizable. Most notable was the large difference in depressive symptoms between the two groups—Turkers were more than twice as likely to exhibit depressive symptoms than the national sample; relative risks were >3 among 30–49 year olds. This is consistent with previous findings that Turkers are more likely to experience anxiety or depression than other traditional community or epidemiological samples. Researchers will therefore find a readily available population for depression-related health tasks, but publications should report on other sample health characteristics so readers can assess potential biases in data.
While MTurk may be an expedient means to recruit survey respondents, its workers are not a generalizable population with regards to health status and health behaviors. Sample weights may need to be employed in data analysis of MTurk surveys to ensure representativeness, but even if demographic representativeness is achieved, Turkers’ health behaviors and health status may not be representative of the U.S. population as measured by large national health surveillance surveys. In particular, our findings raise questions about the validity of MTurk surveys that relate to health conditions that affect older populations, which are not prevalent among MTurk workers.