Research Article: Measuring the Prevalence of Problematic Respondent Behaviors among MTurk, Campus, and Community Participants

Date Published: June 28, 2016

Publisher: Public Library of Science

Author(s): Elizabeth A. Necka, Stephanie Cacioppo, Greg J. Norman, John T. Cacioppo, Jelte M. Wicherts.


The reliance on small samples and underpowered studies may undermine the replicability of scientific findings. Large sample sizes may be necessary to achieve adequate statistical power. Crowdsourcing sites such as Amazon’s Mechanical Turk (MTurk) have been regarded as an economical means for achieving larger samples. Because MTurk participants may engage in behaviors which adversely affect data quality, much recent research has focused on assessing the quality of data obtained from MTurk samples. However, participants from traditional campus- and community-based samples may also engage in behaviors which adversely affect the quality of the data that they provide. We compare an MTurk, campus, and community sample to measure how frequently participants report engaging in problematic respondent behaviors. We report evidence that suggests that participants from all samples engage in problematic respondent behaviors with comparable rates. Because statistical power is influenced by factors beyond sample size, including data integrity, methodological controls must be refined to better identify and diminish the frequency of participant engagement in problematic respondent behaviors.

Partial Text

Concerns have been raised in recent years about the replicability of published scientific studies and the accuracy of reported effect sizes, which are often distorted as a function of underpowered research designs [1–4]. The typical means of increasing statistical power is to increase sample size. Although increasing sample size was once seen as an impractical solution due to funding, logistic, and time constraints, crowdsourcing websites such as Amazon’s Mechanical Turk (MTurk) are increasingly making this solution a reality. Within a day, data from hundreds of MTurk participants can be collected inexpensively (MTurk participants are customarily paid less than minimum wage; [5–9]). Further, data collected on MTurk have been shown to be generally comparable to data collected in the laboratory and the community for many psychological tasks, including cognitive, social, and judgment and decision making tasks [10–13]. This has generally been taken as evidence that data from MTurk are of high quality, reflecting an assumption that laboratory-based data collection is a gold standard in scientific research.

Table 2 presents frequency estimates based on self-admission (FS condition) and assessments of other participants’ behavior (FO condition).

Underpowered research designs can misrepresent true effect sizes, making it difficult to replicate published research even when reported results are true. Recognition of the costs of underpowered research designs has led to the sensible recommendation that scientists make sample size decisions with regard to statistical power (e.g., [38]). In response, many researchers have turned to crowdsourcing sites such as MTurk as an appealing solution to the need for larger samples in behavioral studies. MTurk appears to be a source of high quality and inexpensive data, and effect sizes obtained in the laboratory are comparable to those obtained on MTurk. Yet this is seemingly inconsistent with reports that MTurk participants engage in behaviors which could reasonably be expected to adversely influence effect sizes, such as participant crosstalk (e.g., through forums) and participating in similar studies more than once. One possibility is that laboratory participants are equally likely to engage in behaviors which have troubling implications for the integrity of the data that they provide.