Research Article: The curse of dimensionality: Animal-related risk factors for pediatric diarrhea in western Kenya, and methods for dealing with a large number of predictors

Date Published: April 26, 2019

Publisher: Public Library of Science

Author(s): Julianne Meisner, Stephen J. Mooney, Peter M. Rabinowitz, Mohammad Ali.


Pediatric diarrhea, a leading cause of under-five mortality, is predominantly infectious in etiology. As many putative causal agents are zoonotic, animal exposure is a likely risk factor. To evaluate the effect of animal-related factors on moderate to severe childhood diarrhea in rural Kenya, where animal contact is common, Conan et al. studied 73 matched case-control pairs from 2009-2011, collecting rich exposure data on many dimensions of animal contact. We review the challenges associated with analyzing moderately-sized datasets with a large number of predictors and present two alternative methodological approaches.

We conducted a simulation study to demonstrate that forward stepwise selection results in overfit models when data are high-dimensional, and that p values reported directly from the data used to train these models are misleading. We described how automated methods of variable selection, attractive when the number of predictors is large, can result in overadjustment bias. We proposed an alternative a priori regression approach not subject to this bias. Applied to Conan et al.’s data, this approach found a non-significant but positive trend for household’s sharing of water sources with livestock or poultry, child’s presence for poultry slaughter, and child’s habit of playing where poultry sleep or defecate. For many predictors evaluated few pairs were discordant, suggesting matching compromised the power of this analysis. Finally, we proposed latent variable modeling as a complimentary approach and performed Item Response Theory modeling on Conan et al.’s data, with animal contact as the latent trait. We found a moderate but non-significant effect (OR 1.21, 95% CI 0.78, 1.87, unit = 1 standard deviation).

Automated methods of model selection are appropriate for prediction models when fit and evaluated on separate samples. However when the goal is inference, these methods can produce misleading results. Furthermore, case-control matching should be done with caution.

Partial Text

Diarrheal disease is the leading cause of pediatric malnutrition and the second leading cause of under five mortality, with over 1.7 billion childhood cases and more than 500,000 under-five deaths each year [1]. Pediatric diarrhea is predominantly caused by bacterial, viral, and parasitic infections, and animals are known reservoirs of several important diarrheal pathogens, including Campylobacter spp., Salmonella spp., and Escherichia coli [2]. In low-resource settings, where animal keeping is a predominant source of income, animal contact is common in both pediatric and adult populations.

R version 3.4.4 was used for all analyses [8].

Of the 624 variables in the cleaned dataset, 530 had a standard deviation greater than 0. Most variables with SD = 0 were pig-related variables, which were mostly missing.

Using simulation, we have demonstrated the risks of automated variable selection in datasets with a high P:N ratio, and the risks of presenting p values from a forward stepwise selected model without validation. Across the 100 simulated datasets, the variables selected for inclusion in each dataset varied markedly. Furthermore under the null model, forward stepwise selection performed on a training dataset produced significant findings that were not replicated in a test dataset. We propose two alternative approaches appropriate to this setting: an a priori model selection approach and latent variable modeling.

Rich datasets, like that generated by the GEMS-ZED substudy, provide the opportunity to answer many research questions in a single analysis. When the data were collected on a vulnerable or hard-to-access population, or the research questions are particularly sensitive, this opportunity should not be overlooked. However, there are analytical challenges attendant to the analyses of these datasets, which will not present themselves as errors in statistical software. To ensure the scientific value—and thus the public health impacts—of such analyses are optimized, we propose an a priori approach to variable selection, in complement with latent variable modeling if it is reasonable to hypothesize the presence of a latent trait and its effects are of scientific interest. We also urge caution in selection of matching variables; in particular, matching variables must be strong risk factors for the outcome of interest (determined by subject-matter knowledge), and care should be taken before matching for variables that may be strongly associated with the exposure of interest. If such variables are truly strong confounders, sample size should be increased accordingly, if at all possible.