Introduction
Geostatistics refers to the study of a spatially distributed variable of interest, which in theory is defined at every point over a bounded study region of interest. Statistical modelling and analysis for spatially oriented point-referenced outcomes play a crucial role in diverse scientific applications such as earth and environmental sciences, ecology, epidemiology, and economics. With the advent of Markov chain Monte Carlo (MCMC) algorithms, Bayesian hierarchical models have gained massive popularity in analyzing such point-referenced or, geostatistical data. These models involve latent spatial processes characterized by spatial process parameters, which besides lacking substantive relevance in scientific contexts, are also weakly identified and hence, impedes convergence of MCMC algorithms. Thus, even for moderately large datasets (~ or higher), the computation for MCMC becomes too onerous for practical use.
We introduce the R package spStack
that implements
Bayesian inference for a class of geostatistical models, where we
obviate the issues mentioned by sampling from analytically available
posterior distributions conditional upon some candidate values of the
spatial process parameters and, subsequently assimilate inference from
these individual posterior distributions using Bayesian predictive
stacking. Besides delivering competitive predictive performance as
compared to fully Bayesian inference using MCMC, our proposed algorithm
is embarrassingly parallel, thus drastically improves runtime and
elevating the utility of the package for a diverse group of
practitioners with limited computational resources at their disposal.
This package, to the best of our knowledge, is the first to implement
stacking for Bayesian analysis of spatial data.
Technical details surrounding the methodology can be found in the
articles Zhang, Tang, and Banerjee (2024) which discuss the case
where the distribution of the point-referenced outcomes are Gaussian,
and, in Pan et al. (2024) where the case of
non-Gaussian outcomes is explored. The code for this package is written
primarily in C/C++ with additional calls to FORTRAN routines for
optimized linear algebra operations. We leverage the
F77_NAME
macro to interface with legacy FORTRAN functions
in conjunction with efficient matrix computation libraries such as BLAS (Basic Linear Algebra
Subprograms) and LAPACK (Linear
Algebra Package) to implement our stacking algorithm.
The remainder of the vignette evolves as follows - the next two sections discuss Bayesian hierarchical spatial models for Gaussian and non-Gaussian outcomes, which is followed by a section providing brief details on predictive stacking and a section dedicated for illustration of functions in the package.
Bayesian Gaussian spatial regression models
Let be a be a set of spatial locations yielding measurements with known values of predictors at these locations collected in the full rank matrix . A customary geostatistical model is where is the vector of slopes, is a zero-centered spatial Gaussian process on with spatial correlation function characterized by process parameters , is the spatial variance parameter (“partial sill”) and are i.i.d. with variance (“nugget”) capturing measurement error. The spatial process is assumed to be independent of the measurement errors . Let denotes the realization of the spatial process on and correlation matrix . We build a conjugate Bayesian hierarchical spatial model, where we fix the noise-to-spatial variance ratio , the process parameters and the hyperparameters , , and . In this package, we use the Matern covariogram specified by spatial decay parameter and smoothness parameter i.e., , given by We utilize a composition sampling strategy to sample the model parameters from their joint posterior distribution which can be written as We proceed by first sampling from its marginal posterior, then given the samples of , we sample and subsequently, we sample conditioned on the posterior samples of and (Banerjee 2020). More details can be found in Zhang, Tang, and Banerjee (2024).
The function spLMexact()
delivers samples from this
posterior distribution conditional on fixed hyperparameters. For
predictive stacking, use the function spLMstack()
.
Bayesian non-Gaussian spatial regression models
Analyzing non-Gaussian spatial data typically requires introducing spatial dependence in generalized linear models through the link function of an exponential family distribution. Let be the outcome at location endowed with a probability law from the natural exponential family, which we denote by for some positive parameter and unit log partition function . Fixed effects regression and spatial dependence, e.g., , is introduced in the natural parameter, where is a vector of predictors referenced with respect to , is a vector of slopes measuring the trend, is a zero-centered spatial process on specified by a scale parameter and a spatial correlation function with consisting of spatial-temporal decay and smoothness parameters.
Unlike in Gaussian likelihoods, inference is considerably encumbered by the inability to analytically integrate out the random effects and reduce the dimension of the parameter space. Iterative algorithms such as Markov Chain Monte Carlo (MCMC), thus attempt to sample from a very high-dimensional posterior distribution, and convergence is often hampered by high auto-correlations and weakly identified spatial process parameters .
This model is implemented using the function
spGLMexact()
when using fixed hyperparameters, and
spGLMstack()
when using predictive stacking.
We consider the following three cases -
Point-referenced Poisson count data: Here and . This is accessed by setting
family = "poisson"
in the above functions.Point-referenced binomial count data: Here for each and . This is accessed by setting
family = "binomial"
in the above functions.Point-referenced binary data: Here and . This is accessed by setting
family = "binary"
in the above functions.
Following Bradley and Clinch (2024), we introduce a Bayesian hierarchical spatial model as where denotes the discrepancy parameter. We fix the spatial process parameters , the boundary adjustment parameter and the hyperparameters , , and . The term is known as the fine-scale variation term which is given a conditional generalized conjugate multivariate distribution () as prior. For details, see Pan et al. (2024).
Bayesian non-Gaussian spatial-temporal regression model
We consider a rich family of Bayesian spatial-temporal model with spatially-temporally varying regression coefficients. Suppose , with location and time , denote a spatial-temporal coordinate in .
Let be a fixed set of distinct space-time coordinates in , where , which we simply denote by , is the vector of observed outcomes, each distributed as a member of the natural exponential family with log partition function . Suppose, is a vector of predictors, is the corresponding vector of slopes (fixed effects), is () consisting of predictors in that are posited to have spatially-temporally varying regression coefficients , where each is a spatially-temporally varying coefficient for the predictor , is a fine-scale variation term and is the discrepancy parameter (see above). We introduce spatially-temporally varying coefficients in as where is a multivariate Gaussian process with a separable cross-covariance function , characterized by process parameters which controls the within-process spatial-temporal correlation, and which controls the between-process covariance matrix . Given , the vector , where for , follows a multivariate Gaussian distribution with mean and covariance matrix .
This model is implemented by the function stvcGLMexact()
under fixed hyperparameters, and stvcGLMstack()
when using
predictive stacking. We also implement different specifications for
in this package as follows.
Independent spatial-temporal process: We consider Gaussian spatial-temporal processes where is the variance parameter corresponding to process . This corresponds to the covariance matrix with , where denotes covariance kernel parameters for the th process, and . This is accessed by setting the option
process.type = "independent"
in the above functions.Independent shared spatial-temporal process: This corresponds to the above with and for all . This is accessed by setting the option
process.type = "independent.shared"
in the above functions.Multivariate spatial-temporal process: We can introduce dependence among the elements of the vector using where is an multivariate Gaussian process with matrix-valued cross-covariance function and is an positive definite random matrix. This corresponds to the spatial-temporal covariance matrix with . We place an inverse-Wishart prior on the scale parameter with shape and positive definite scale matrix , given by . This is accessed by setting the option
process.type = "multivariate"
in the above functions.
Predictive stacking
Following Yao et al. (2018), we consider a set of candidate models based on a grid of values of the parameters in for the Gaussian case, and for the non-Gaussian case, as will be supplied by the user. We build a set of candidate models based on the Cartesian product of the collection of values for each individual parameter as . Then, for each , we sample from the posterior distribution under the model and find leave-one-out predictive densities . Then we solve the optimization problem to find the optimal stacking weights . After obtaining the optimal stacking weights, posterior inference of any quantity of interest subsequently proceed from the stacked posterior,