Contents

Sparse Gaussian Processes

One central limitation of Gaussian Processes (GPs) is their poor scaling with the number of datapoints. GPs involve a cubic \(\mathcal{O}(N^3)\) time complexity in the number of datapoints \(N\) both for computing the posterior, for making predictions, as well as the marginal likelihood, for training the model’s hyperparameters. This is because they both quantities involve the inverse of an \(N \times N\) covariance matrix. If we want to scale GPs to larger datasets we cannot consider correlations between all datapoints because of the cubic cost of inverting the associated covariance. It is clear that we’ll have to give up some aspects of the true model in order to scale GPs to larger datasets.

One idea for achieving this is to make some approximations about the structure of the covariance matrix, such as an independence assumption between certain variables, in order to make the matrix inversion cheaper. Over the years, there have been numerous approaches have been devised for scaling GPs, by making some approximation to sparsify the covariance matrix and leverage the sparse structure to train, predict and sample faster. Examples of such early approximations are Deterministic Training Conditionals (DTC), [SWL03] Fully Independent Training Conditionals (FITC), [SG05] Partially Independent Training Conditionals (PITC). [QCR05] These methods can be regarded from a unified perspective [QCR05] as approximations which assume to the exact GP model, which however has a sparse covariance structure imposed by a set of inducing- or pseudo-points, making training and prediction with it cheaper.

More recently, it has been realised that instead of approximating the model and performing exact inference with it, it is possible and more desirable to retain the exact model and instead approximate its posterior. The Variational Free Energy (VFE) [Tit09] approximation allows us to do precisely this, retaining the true GP and approximating its posterior using a pseudopoint approximation. This method uses the Free Energy (also called the ELBO) to approximate the posterior, but other methods have also been developed, based for example on Expectation Propagaion (EP), whilst a unifying perspective of posterior approximation methods has also emerged. [BYT17]

References

BYT17

Thang Bui, Josiah Yan, and Richard Turner. A unifying framework for gaussian process pseudopoint approximations using power expectation propagation. The Journal of Machine Learning Research, 2017.

QCR05(1,2)

Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 2005.

SWL03

Matthias W. Seeger, Christopher K. I. Williams, and Neil D. Lawrence. Fast forward selection to speed up sparse gaussian process regression. Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 2003.

SG05

Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems, 2005.

Tit09

Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, 567–574. PMLR, 2009.