Date Published: February 26, 2018
Publisher: Public Library of Science
Author(s): Joseph M. Northrup, Brian D. Gerber, Yong Deng.
Understanding patterns of species occurrence and the processes underlying these patterns is fundamental to the study of ecology. One of the more commonly used approaches to investigate species occurrence patterns is occupancy modeling, which can account for imperfect detection of a species during surveys. In recent years, there has been a proliferation of Bayesian modeling in ecology, which includes fitting Bayesian occupancy models. The Bayesian framework is appealing to ecologists for many reasons, including the ability to incorporate prior information through the specification of prior distributions on parameters. While ecologists almost exclusively intend to choose priors so that they are “uninformative” or “vague”, such priors can easily be unintentionally highly informative. Here we report on how the specification of a “vague” normally distributed (i.e., Gaussian) prior on coefficients in Bayesian occupancy models can unintentionally influence parameter estimation. Using both simulated data and empirical examples, we illustrate how this issue likely compromises inference about species-habitat relationships. While the extent to which these informative priors influence inference depends on the data set, researchers fitting Bayesian occupancy models should conduct sensitivity analyses to ensure intended inference, or employ less commonly used priors that are less informative (e.g., logistic or t prior distributions). We provide suggestions for addressing this issue in occupancy studies, and an online tool for exploring this issue under different contexts.
Understanding species distributions, and the environmental factors that influence occurrence is fundamental to ecology. Our knowledge of many well-studied topics in ecology, including niche partitioning, trophic interactions and metapopulation dynamics, depend on knowing which species occur in an area and why. Furthermore, occurrence patterns are critical for making conservation and management decisions; placement of reserve boundaries, or assessments of whether development will impact threatened and endangered species depend entirely on knowing whether a target species is present. Research on the patterns and drivers of species occurrence has been ongoing for many years (see  for a brief discussion), with major advancements over the past two decades (see  for a review). These advancements have stemmed from a combination of enhanced computational power, the advent of geographical information systems (GIS), and the development of a diversity of field-sampling and statistical modeling approaches that allow for detailed assessments of species habitat-relationships and the ensuing distribution patterns.
We first demonstrate the influence of a Normal prior by simulating example occupancy datasets (using ψ = 0.9, p = 0.2, n =10, where n is the number of occasions) at a varying number of sites (50, 100, 200, and 400). For each dataset, we first fit the data in a maximum likelihood framework using the statistical program MARK  via the R package ‘RMark’  in the programming language R . Next, to illustrate the influence of the prior relative to the likelihood, we fit these models in a Bayesian framework (above model without covariates; logit(ψi) = α and prior α ∼ Normal(0, τ) using JAGS  via the ‘rjags’ package (; see S1 and S2 Files for example code). In JAGS, the uncertainty parameter for the Normal distribution is specified as the precision (τ), which is 1/σ2, where σ is the standard deviation of the Normal distribution. We fit the Bayesian model with normally distributed priors with σ values of 0.25, 0.5, 1, 2, 5, 10, 100, 500, and 1,000; algorithms were run for 10,000 Markov chain Monte Carlo (MCMC) iterations, removing the first 5,000 as a period of burn-in. Lastly, we fit the same simulated data using the t and Logistic distributions and compared posterior distributions with maximum likelihood estimates (MLE). We investigated convergence in both paradigms by fitting the models with random initial values, checking for estimate consistency. The parameters from the Bayesian analysis were also investigated for convergence by visually examining posterior distribution trace plots to ensure proper mixing and by calculating the Gelman-Rubin diagnostic  to ensure values were close to 1, which they always were. We compare the likelihood results, which are not influenced by the assumed prior distribution, with the Bayesian results by plotting posterior distributions of ψ for each dataset and prior, along with MLE. Assuming convergence and a sufficiently large number of samples, the discrepancy between the posterior mode (i.e., most probable value) and the MLE is a consequence of the assumed prior. We note that our focus is different from many simulation studies, where the aim is to evaluate the discrepancies between estimated and true parameter values. Here, we are strictly interested in unintended consequences of prior specifications and its influence on parametric inference.
For the simulated datasets with normally distributed priors, when the prior standard deviation was small (i.e., σ < 2) the posterior mode was always smaller than the MLE (Fig 3), as these priors drew the posterior towards a probability of 0.5. With a standard deviation of 2, the posterior mode was approximately the MLE. At intermediate values of the standard deviation (between 5 and 10), the posterior mode was close to the MLE, but the proximity was influenced by the number of surveyed sites (Fig 3). As σ became large (>100), the posterior became bimodal, with one mode close to the MLE and the other close to 1 (Fig 2). Importantly, having a large number of sampled sites only mitigated the influence of the prior in a relatively narrow band of values. Generally, the nature of the influence of the prior on the posterior and subsequent ecological inference depends on a combination of effects including: 1) the true underlying detection and occupancy probabilities; 2) the number of sampled sites; 3) the number of surveys per site; and, 4) the linear combination of coefficients and covariates. Importantly, the linear combination (i.e., α + xiβ′) is the quantity that is transformed, and thus in some cases very large magnitude values for coefficients, when combined with certain values of covariates, could lead to scenarios where the transformation is not impactful (e.g., when there is a strong effect of a covariate that ranges over a very small set of values). However, in other cases, the use of Normal prior distributions with a large σ can seriously affect parametric estimates. Occupancy models using the Logistic and t distribution priors estimated posterior modes corresponding to the MLE under all sample sizes (Fig 2).
The results that we present above, combined with the potential prevalence of this issue in the literature, raise concerns about the inference made in regards to species-habitat relationship and resulting distribution patterns. Our literature review, though relatively basic, indicates that this issue might be widespread. Further, most species in a given area are rare , meaning that researchers likely are fitting models for species with little data (though this depends on the size of the sampling site relative to the species distribution), which will allow for priors to be more influential. However, the true magnitude of the issue is unknown because the circumstances that allow a seemingly uninformative prior distribution to be in fact informative, can vary, depending on the data. As illustrated in our empirical example, there are scenarios under which the specification of the prior will have negligible impact on inference; however, there also will be times when specifying a prior that does not impact inference will be difficult and require iterative model fitting, whereby models are fit, and posteriors are plotted to assess potential influence of the prior and this process is repeated until inference appears to be unimpacted by the prior. The potential implications of this issue for conservation and management-based studies are significant. We note that camera trapping and the use of occupancy models has become common for studying rare or cryptic, threatened and endangered species . In these studies, both sample sizes and detection probabilities tend to be low, two aspects that can lead to potential issues if informative priors are used. Even small overestimates in occupancy for such species can have major implications for conservation and management action.