Research Article: Efficient estimation of generalized linear latent variable models

Date Published: May 1, 2019

Publisher: Public Library of Science

Author(s): Jenni Niku, Wesley Brooks, Riki Herliansyah, Francis K. C. Hui, Sara Taskinen, David I. Warton, Jin Li.


Generalized linear latent variable models (GLLVM) are popular tools for modeling multivariate, correlated responses. Such data are often encountered, for instance, in ecological studies, where presence-absences, counts, or biomass of interacting species are collected from a set of sites. Until very recently, the main challenge in fitting GLLVMs has been the lack of computationally efficient estimation methods. For likelihood based estimation, several closed form approximations for the marginal likelihood of GLLVMs have been proposed, but their efficient implementations have been lacking in the literature. To fill this gap, we show in this paper how to obtain computationally convenient estimation algorithms based on a combination of either the Laplace approximation method or variational approximation method, and automatic optimization techniques implemented in R software. An extensive set of simulation studies is used to assess the performances of different methods, from which it is shown that the variational approximation method used in conjunction with automatic optimization offers a powerful tool for estimation.

Partial Text

High-dimensional multivariate abundance data, which consist of records (e.g. species counts, presence-absence records, and biomass) of a large number of interacting species at a set of units or sites, are routinely collected in ecological studies. When analyzing multivariate abundance data, the interest is often in visualization of correlation patterns across species, hypothesis testing of environmental effects, and making predictions for abundances. Classical methods for analysing such data, including algorithmic-based approaches such as non-metric multidimensional scaling (nMDS) and correspondence analysis (CA), are based on distance matrices computed on some pre-specified dissimilarity measure [1]. As such, they often make wrong assumptions for key properties of the data at hand (e.g. mean-variance relationship), which can potentially lead to misleading inferential results [2, 3].

Consider a sample of observations consisting of responses for m species collected at n sites, such that yij denotes the response for species j = 1, …, m at site i = 1, …, n. A generalized linear latent variable model (GLLVM) regresses the mean response, denoted here as μij, against a vector of d ≪ m latent variables, ui = (ui1, …, uid)′, along with the vector of covariates xi = (xi1, …, xik)′. That is,
where βj and γj are vectors of species specific coefficients related to the covariates and latent variables, respectively. It is the term ui′γj which captures the residual correlation across species not accounted for by the observed covariates xi. Moreover, a key advantage of this type of model is that it is capable of flexibly handling correlation across response variables in a parsimonious manner, with the number of parameters characterizing the correlation structure growing linearly in the number of responses m. This allows GLLVMs to be feasibly fitted to datasets with relatively large m, as often arises in practice [8].

Two advances are made in this paper, which enable faster, more reliable fitting of GLLVMs than previous implementations of Laplace or variational approximations. First, we write software to make use of automatic differentiation software in the TMB package [22]. Secondly, we make strategic choices for the starting values of the parameters in the GLLVM, in order to improve speed and stability of the estimation algorithms. Our simulations presented later demonstrate that these changes improve speed by an order of magnitude, as well as improving reliability by increasing the accuracy of the estimates.

We performed a series of simulation studies to compare the performance of different model fitting algorithms with and without automatic differentiation using TMB, using either the Laplace approximation or variational approximation, and with different starting value strategies (res, res3, zero, random). For fitting algorithms without automatic differentiation, we implemented both the Laplace and variational approximations in plain R code by manually defining their respective approximate likelihoods and their gradient functions. Details of the simulation design are given below.

In this article, we studied two closed form approximations (the Laplace approximation and variational approximation) for the marginal log-likelihood of a generalized linear latent variable model. We showed how the closed form approximations can be implemented efficiently using automatic optimization techniques implemented in R with the help of the package TMB. In addition, a new method for choosing the starting values for our estimation algorithms was proposed. The performances of the two approximation methods and different starting values strategies were compared using several simulation studies for overdispersed count and binary data, which are often encountered in biological and ecological studies. Results indicated that for both response types the variational approximation implementations tended to outperform the Laplace approximation implementations, both in terms of computation speed and estimation and inferential accuracy. These findings are congruent with the results of Hui et al. [7], where the performance of the variational approximation method was compared to the Laplace approximation method and the MCEM algorithm for count and binary data, and also to Gauss-Hermite Quadrature in the case of binary data. However, more comprehensive comparisons between the variational approximation method and other estimation methods, eg. the Gauss-Hermite Quadrature, would be useful and interesting in the future.




Leave a Reply

Your email address will not be published.