# Variational Inference Revisited¶

Variational inference (VI) is one of the standard methods for working with analytically intractable models, such as non-conjugate probabilistic models. VI approximates an analytically intractable distribution with another distribution from a family of tractable distributions. A typical choice which practitioners and researchers make is to choose a Gaussian family of distributions. An $$N$$-dimensional multivariate Gaussian requires $$N$$ parameters to specify its mean and $$\frac{N(N+1)}{2}$$ parameters to specify its covariance. Opper and Archembeau [OA09] however have shown that for a certain class of probabilistic models, those which have a Gaussian prior and a factorising likelihood, the optimal posterior can be written in terms of the prior using only $$N$$, rather than $$\frac{N(N+1)}{2}$$, parameters. This means that the optimal approximate posterior requires only need $$2N$$ parameters in total. Here we derive the family of Gaussian variational posteriors introduced by Opper and Archembeau, and apply it to two examples of non-conjugate Gaussian Process (GP) models [WR06] for heavy-tailed regression and classification.

## Prior asssumptions¶

Consider a model with latent variables $$\mathbf{z} = (z_1, \dots, z_N)$$, distributed according to a Gaussian prior and observed variables $$\mathbf{y} = (y_1, \dots, y_N)$$ which, conditionally on $$\mathbf{z}$$, are distributed according to a factorising likelihood

$$$p(\mathbf{y}, \mathbf{z} | \theta) = p(\mathbf{z} | \theta) \prod_{n=1}^N p(y_n | z_n, \theta) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{K}) \prod_{n=1}^N p(y_n | z_n, \theta).$$$

In general, unless the likelihood is Gaussian, such a model will be analytically intractable, and will require approximations both for inferring the distribution over the latent variables $$\mathbf{z}$$ as well as learning good hyperparameters $$\theta$$. Variational inference (VI) is a standard approximation for a setting such as this.

## Optimal variational posterior¶

We are interested in both inferring the distribution over the latent variables $$\mathbf{z}$$, as well as in learning good hyperparameters $$\theta$$. We can achive this by approximating the posterior $$p(\mathbf{z} | \mathbf{y}, \theta)$$ with a Gaussian distribution $$q(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}, \mathbf{\Sigma})$$, and maximising the Variational Free Energy (VFE) objective

\begin{align} \mathcal{F}(q, \theta) = \int q(\mathbf{z}) \log \frac{p(\mathbf{y}, \mathbf{z} | \mathbf{x}, \theta)}{q(\mathbf{z})} d\mathbf{z} \leq \log p(\mathbf{y} | \theta), \end{align}

also known as the Evidence Lower Bound (ELBO), with respect to both the variatinal parameters $$\boldsymbol{\mu}$$ and $$\boldsymbol{\Sigma}$$, as well as the hyperparameters $$\theta$$. It would appear that parameterising $$\boldsymbol{\Sigma}$$ would require $$\frac{N(N+1)}{2}$$ parameters. However, Opper and Archembeau [OA09] show that the optimal posterior covariance can be written down with only $$N$$ parameters.

Lemma (Free energy for factorising model with a Gaussian prior) Consider a model with a Gaussian prior and a factorising likelihood, written as

$$$p(\mathbf{y}, \mathbf{z} | \theta) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{K}) \prod_{n=1}^N p(y_n | z_n, \theta).$$$

Assuming a Gaussian variational posterior $$q(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}, \mathbf{K})$$, the free energy $$\mathcal{F}(q, \theta)$$ of the model is

$$$\mathcal{F}(q, \theta) = \sum_{n=1}^N \mathbb{E}_{q_n}\left[\log p(y_n | z_n, \theta)\right] - \frac{1}{2} \text{Tr}(\mathbf{K}^{-1}\boldsymbol{\Sigma}) - \boldsymbol{\mu}^\top\mathbf{K}^{-1}\boldsymbol{\mu} + \frac{1}{2} \log |\boldsymbol{\Sigma}| - \frac{1}{2} \log |\mathbf{K}| + C,$$$

where $$C$$ is a constant which does not depend on $$\boldsymbol{\mu}, \boldsymbol{\Sigma}$$ or $$\theta$$, and $$q_n$$ is the marginal distribution of $$z_n$$ under the variational posterior $$q$$. Further, for fixed $$\boldsymbol{\mu}$$ and $$\theta$$, the $$\boldsymbol{\Sigma}$$ which minimises $$\mathcal{F}(q, \theta)$$ has the form

$$$\boldsymbol{\Sigma} = \left(\mathbf{K}^{-1} + \boldsymbol{\Lambda} \right)^{-1},$$$

where $$\boldsymbol{\Lambda}$$ is a diagonal matrix.

Derivation: Free energy for factorising model with a Gaussian prior
\begin{split}\begin{align} \mathcal{F}(q, \theta) &= \int q(\mathbf{z}) \log \frac{p(\mathbf{y}, \mathbf{z} | \theta)}{q(\mathbf{z})} d\mathbf{z} \\ &= \underbrace{\int q(\mathbf{z}) \log p(\mathbf{y} | \mathbf{z}, \theta) d\mathbf{z}}_{=~ \sum_{n=1}^N \mathbb{E}_{q_n}\left(\log p(y_n | z_n, \theta)\right)} + \int q(\mathbf{z}) \log p(\mathbf{z} | \theta) d\mathbf{z} - \int q(\mathbf{z}) \log q(\mathbf{z}) d\mathbf{z}, \end{align}\end{split}

where we have used the factorising structure of the model. Substituting the expressions $$q(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}, \boldsymbol{\Sigma})$$ and $$p(\mathbf{z} | \theta) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{K})$$ we obtain

\begin{split}\begin{align} \mathcal{F}(q, \theta) &= \sum_{n=1}^N \mathbb{E}_{q_n}\left[\log p(y_n | z_n, \theta)\right] \\ &~~~~~~~~~~~~~~~~~~+ \frac{1}{2} \int q(\mathbf{z}) \left(- \log |\mathbf{K}| - \mathbf{z}^\top \mathbf{K}^{-1}\mathbf{z}\right)d\mathbf{z} \\ &~~~~~~~~~~~~~~~~~~- \frac{1}{2} \int q(\mathbf{z}) \left(-\log |\boldsymbol{\Sigma}| - (\mathbf{z} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{z} - \boldsymbol{\mu})\right) d\mathbf{z} \\ &~~~~~~~~~~~~~~~~~~+ \text{ const.} \\ &= \sum_{n=1}^N \mathbb{E}_{q_n}\left[\log p(y_n | z_n, \theta)\right] - \frac{1}{2} \log |\mathbf{K}| - \text{Tr}(\mathbf{K}^{-1} \boldsymbol{\Sigma}) - \frac{1}{2} \boldsymbol{\mu}^\top \mathbf{K}^{-1} \boldsymbol{\mu} + \frac{1}{2} \log |\boldsymbol{\Sigma}| + \text{ const.} \end{align}\end{split}

where the constant terms do not depend on $$\boldsymbol{\mu}, \boldsymbol{\Sigma}$$ or $$\boldsymbol{\theta}$$. Further, we can take a derivative with respect to $$\boldsymbol{\Sigma}$$, and set the resulting expression to zero to solve for the optimal $$\boldsymbol{\Sigma}$$. We note that the $$n^{th}$$ log-likelihood term depends only on the $$n^{th}$$ diagonal entry of $$\boldsymbol{\Sigma}$$, and none of the non-diagonal entries. Thus, taking a derivative with respect to $$\boldsymbol{\Sigma}_{ij}$$ we have

\begin{align} \frac{d}{d\boldsymbol{\Sigma}_{ij}} \mathcal{F}(q, \theta) &= \underbrace{\delta_{ij} \frac{d}{d\boldsymbol{\Sigma}_{ii}} \mathbb{E}_{q_i}\left(\log p(y_i | z_i, \theta)\right)}_{- \boldsymbol{\Lambda}_{ij}} - \mathbf{K}_{ij}^{-1} + \boldsymbol{\Sigma}^{-1}_{ij} \end{align}

where $$\delta_{\cdot, \cdot}$$ is the Kronecker delta function. Defining the matrix $$\boldsymbol{\Lambda}$$ as shown above, setting the derivative equal to zero and rearraning, we obtain

\begin{align} \boldsymbol{\Sigma} = \left(\mathbf{K}^{-1} + \boldsymbol{\Lambda}\right)^{-1}. \end{align}

For given $$\boldsymbol{\mu}$$ and $$\boldsymbol{\theta}$$, the optimal precision is the prior precision plus a diagonal term. Therefore, a sensible parameterisation for $$\mathbf{\Sigma}$$ is

$$$\boldsymbol{\Sigma} \stackrel{\text{def}}{=} \left(\mathbf{K}^{-1} + \boldsymbol{\Lambda}\right)^{-1},$$$

where we let $$\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \dots, \lambda_N)$$ be diagonal matrix with $$N$$ parameters which we constrain to be positive. Using $$N$$ further parameters for the mean $$\boldsymbol{\mu}$$, we can parameterise a family of Gaussian variational posteriors which contains the optimal Gaussian variational posterior, using only $$2N$$ parameters. This is a significant improvement in the memory cost over naive covariance parameterisations, such as parameterising a Cholesky factor of the $$\boldsymbol{\Sigma}$$ matrix.

## Implementation¶

We now implement this approximate posterior, in the context of Gaussian process (GP) modelling. In these experiments, we will use a GP with a zero mean and an Exponentiated Quadratic (EQ) covariancce, implemented by the EQCovariance class below.

class EQCovariance(tf.keras.Model):

def __init__(self,
log_coeff,
log_scales,
dim,
dtype,
trainable,
name='eq_covariance',
**kwargs):

super().__init__(name=name, dtype=dtype, **kwargs)

# Convert parameters to tensors
log_coeff = tf.convert_to_tensor(log_coeff, dtype=dtype)
log_scales = tf.convert_to_tensor(log_scales, dtype=dtype)

# Reshape parameter tensors
log_coeff = tf.squeeze(log_coeff)
log_scales = tf.reshape(log_scales, (-1,))

# Set input dimensionality
self.dim = dim

# Set EQ parameters
self.log_scales = tf.Variable(log_scales, trainable=trainable)
self.log_coeff = tf.Variable(log_coeff, trainable=trainable)

def __call__(self,
x1,
x2,
diag=False,
epsilon=None):

# Convert to tensors
x1 = tf.convert_to_tensor(x1, dtype=self.dtype)
x2 = tf.convert_to_tensor(x2, dtype=self.dtype)

# Get vector of lengthscales
scales = self.scales

if not diag:

x1 = x1[:, None, :]
x2 = x2[None, :, :]

scales = self.scales[None, None, :] ** 2

# Compute quadratic, exponentiate and multiply by coefficient
quad = - 0.5 * (x1 - x2) ** 2 / scales
eq_cov = self.coeff ** 2 * tf.exp(quad)

if epsilon is not None:
eps = tf.eye(eq_cov.shape[0], dtype=self.dtype)
eq_cov = eq_cov + epsilon * eps

return eq_cov

@property
def scales(self):
return tf.math.exp(self.log_scales)

@property
def coeff(self):
return tf.math.exp(self.log_coeff)


We can now implement a general non-conjugate GP, which should work with arbitrary factorising likelihood functions. The NonConjugateGP class below takes in the training data, a covariance function and the training data. It supports computing the VFE and sampling the approximate posterior.

class NonConjugateGP(tfk.Model):

def __init__(self,
x_train,
y_train,
covariance,
likelihood,
dtype,
epsilon=1e-5,
name="non_conjugate_gp",
**kwargs):

super().__init__(name=name, dtype=dtype, **kwargs)

self.dim = dim
self.N = x_train.shape[0]

check_shape([x_train, y_train], [(self.N, dim), (self.N,)])

# Set training data and inducing point initialisation
self.x_train = x_train
self.y_train = y_train

# Set mean, covariance and likelihood functions
self.covariance = covariance
self.likelihood = likelihood

# Set variational parameters
self.mu = tf.Variable(
tf.zeros(shape=(self.N,),
dtype=self.dtype)
)

self.log_lamda = tf.Variable(
-4. * tf.ones(shape=(self.N,), dtype=self.dtype)
)

# Set epsilon for jitter
self.epsilon = epsilon

@property
def lamda(self):
return tf.exp(self.log_lamda)

def posterior(self, x_pred, num_samples):

Kpp = self.covariance(x_pred, x_pred, epsilon=self.epsilon)
Ktp = self.covariance(self.x_train, x_pred)
Kpt = self.covariance(x_pred, self.x_train)
Ktt = self.covariance(self.x_train, self.x_train, epsilon=self.epsilon)

Sigma = self.q_covariance
Sigma_chol = tf.linalg.cholesky(Sigma)

qz = tfd.MultivariateNormalTriL(
loc=self.mu,
scale_tril=Sigma_chol
)

samples = qz.sample(sample_shape=(num_samples,))
samples = tf.transpose(samples, [1, 0])

mcond = Kpt @ tf.linalg.solve(Ktt, samples)
mcond = tf.transpose(mcond, [1, 0])

Kcond = Kpp - Kpt @ tf.linalg.solve(Ktt, Ktp)
Kcond_chol = tf.linalg.cholesky(Kcond)

qcond = tfd.MultivariateNormalTriL(
loc=tf.zeros_like(mcond[0]),
scale_tril=Kcond_chol
)

samples = mcond + qcond.sample(sample_shape=(num_samples,))
transformed_samples = self.likelihood(samples)

return samples, transformed_samples

def free_energy(self, num_samples=1):

cond_lik = tf.cast(tf.convert_to_tensor(0.), dtype=self.dtype)

q = tfd.MultivariateNormalFullCovariance(
loc=self.mu,
covariance_matrix=self.q_covariance
)

p = tfd.MultivariateNormalFullCovariance(
loc=tf.zeros(shape=self.y_train.shape, dtype=self.dtype),
covariance_matrix=self.p_covariance
)

# Compute log likelihood contribution from each sample
for i in tf.range(num_samples):

z = q.sample()

cond_lik = cond_lik + tf.reduce_sum(
self.likelihood.log_prob(self.y_train, z)
)

cond_lik = cond_lik / num_samples

# Compute KL divergence contribution
kl = tf.reduce_sum(q.kl_divergence(p))

free_energy = cond_lik - kl

return free_energy, cond_lik, kl

@property
def q_covariance(self):

eye = tf.eye(self.N, dtype=self.dtype)

Ktt = self.covariance(
self.x_train,
self.x_train,
epsilon=self.epsilon
)

Ktt_chol = tf.linalg.cholesky(Ktt)
iKtt = tf.linalg.cholesky_solve(Ktt_chol, eye)

Lamda = tf.linalg.diag(self.lamda)

iSigma_tril = tf.linalg.cholesky(iKtt + Lamda)
W = tf.linalg.triangular_solve(iSigma_tril, eye)
Sigma = tf.matmul(W, W, transpose_a=True)

return Sigma

@property
def p_covariance(self):

p_cov = self.covariance(
self.x_train,
self.x_train,
epsilon=self.epsilon
)

return p_cov


## Experiments¶

Now that we have the NonConjugateGP model in place, we can use it with arbitrary factorising likelihoods. First, we consider a non-Gaussian regression task with Cauchy noise, followed by a binary classification task.

### Cauchy regression¶

The likelihood function for a Cauchy noise model has the form

$$$p(y_n | z_n, \sigma^2) = \pi\sigma ~ \frac{1}{1 + \left(\frac{y_n - z_n}{\sigma}\right)^2},$$$

which together with a Gaussian prior leads to a non-conjugate model. The Cauchy likelihood has heavy tails, much heavier than those of a Gaussian likelihood. Because of this, Cauchy likelihoods are much more robust to outliers than Gaussian likelihoods. Below we generate data from a GP with an EQ covariance and a Cauchy noise model, which results in a few heavy outliers.

tf.random.set_seed(0)
np.random.seed(0)

dtype = tf.float64

num_grid = 400
num_train = 20

x_min = -5.
x_max = 5.

dim = 1
log_coeff = 0.
log_scales = dim * [0.]
log_noise = -2.
trainable = False

covariance = EQCovariance(
log_coeff,
log_scales=log_scales,
dim=dim,
dtype=dtype,
trainable=trainable,
)

x = tf.cast(tf.linspace(x_min, x_max, num_grid), dtype=dtype)[:, None]

Ktt = covariance(x, x) + 1e-9 * tf.eye(num_grid, dtype=dtype)
Lxx = tf.linalg.cholesky(Ktt)
z = tf.linalg.matvec(Lxx, tf.random.normal(shape=(num_grid,), dtype=dtype))
noise = tf.cast(tf.exp(log_noise), dtype=dtype)

idx = np.random.choice(np.arange(num_grid), size=(num_train,), replace=False)
idx = tf.convert_to_tensor(idx)[:, None]
x_train = tf.gather_nd(x, idx)
z_train = tf.gather_nd(z, idx)
y_train = tfp.distributions.Cauchy(loc=z_train, scale=noise*tf.ones_like(z_train)).sample()

plt.figure(figsize=(12, 4))

plt.plot(x[:, 0], z, color="black", zorder=1)
plt.scatter(x_train[:, 0], y_train, s=100, linewidth=3, marker="+", c="tab:red", zorder=2)

plt.xlim([-5., 5.])
plt.ylim([-3.5, 3.5])

plt.xticks(np.linspace(-5., 5., 5), fontsize=18)
plt.yticks(np.linspace(-3., 3., 3), fontsize=18)

plt.xlabel("$x$", fontsize=24)
plt.ylabel("$y$", fontsize=24)

plt.show()


We now implement a CauchyLikelihood class, which really is just a wrapper around the Cauchy distribution, together with a learnable noise level.

class CauchyLikelihood(tfk.Model):

def __init__(self,
log_noise,
dtype,
name="binary_classification_likelihood",
**kwargs):

super().__init__(name=name, dtype=dtype, **kwargs)

# Set log noise parameter
self.log_noise = tf.cast(
tf.convert_to_tensor(log_noise),
dtype=self.dtype
)

self.log_noise = tf.Variable(self.log_noise)

def call(self, z):
return z

def log_prob(self, y, z):
return tfd.Cauchy(loc=z, scale=self.noise).log_prob(y)

@property
def noise(self):
return tf.exp(self.log_noise)


Now we can initialise the full model, and train it on the synthetic data we generated earlier. We will fix the covariance parameters to their ground truth values, and only train the variational parameters. This is a good way to assess the quality of the posterior approximation without confounding this with the quality of the hyperparameters learnt by the model.

# We decorate a single gradient descent step with tf.function. On the first
# call of single_step, tensorflow will compile the computational graph first.
# After that, all calls to single_step will use the compiled graph which is
# much faster than the default eager mode execution. In this case, the gain
# is roughly a x20 speedup (with a CPU), which can be checked by commenting
# out the decorator and rerunning the training script.

@tf.function
def single_step(model, optimiser):

free_energy, cond_lik, kl = model.free_energy()
loss = - free_energy / x.shape[0]

return free_energy, cond_lik, kl

# =============================================================================
# Define the model
# =============================================================================

# Model parameters
trainable_covariance = False
dim = 1
log_coeff = 1.
log_scales = dim * [0.]
log_noise = -2.

covariance = EQCovariance(
log_coeff=log_coeff,
log_scales=log_scales,
dim=dim,
dtype=dtype,
trainable=trainable_covariance
)

Cauchy_likelihood = CauchyLikelihood(
log_noise=log_noise,
dtype=dtype
)

non_conjugate_gp = NonConjugateGP(
x_train=x_train,
y_train=y_train,
covariance=covariance,
likelihood=Cauchy_likelihood,
dtype=dtype
)

# =============================================================================
# Train the model
# =============================================================================

# Number of steps to train for
num_steps = 50000

# Initialise optimiser

# Set progress bar and suppress warnings
progress_bar = tqdm(range(1, num_steps+1))
tf.get_logger().setLevel('ERROR')

# Set tensors for keeping track of quantities of interest
train_vfe = []
train_cond_lik = []
train_kl = []

# Train model
for i in progress_bar:

free_energy, cond_lik, kl = single_step(
model=non_conjugate_gp,
optimiser=optimiser
)

if i % 1000 == 0:

progress_bar.set_description(
f"VFE {free_energy:.1f}, "
f"Cond-lik. {cond_lik:.1f}, "
f"KL {kl:.1f}"
)

train_vfe.append(free_energy)
train_cond_lik.append(cond_lik)
train_kl.append(kl)

VFE -19.6, Cond-lik. 2.3, KL 21.9: 100%|██████████| 50000/50000 [00:30<00:00, 1617.46it/s]


Before plotting the posterior, let’s quickly verify that the model has trained to convergence, by looking at the curve of the training objective.

# Helper for computing moving average
def moving_average(array, n):

cumsum = np.cumsum(array)
cumsum[n:] = cumsum[n:] - cumsum[:-n]

moving_average = cumsum[n - 1:] / n

return moving_average

# Plot training objective
plt.figure(figsize=(12, 2))

plt.plot(
moving_average(tf.stack(train_vfe).numpy(), n=100),
color="black",
alpha=0.5
)

plt.xlim([-50, 50000])
plt.ylim([-70, -10])

plt.xticks(fontsize=18)
plt.yticks(np.linspace(-60, -20, 3), fontsize=18)

plt.xlabel("# optimisation steps", fontsize=24)
plt.ylabel("VFE", fontsize=24)

plt.show()


Below we see that the posterior for the noiseless function $$z$$ is reasonable. Notably, the model gracefully handles the outliers around the $$x = 1$$ region, where instead of overfitting to these points, the model recognises they are outliers and simply increases its uncertainty in $$z$$.

num_samples = 500

sample_z, _ = non_conjugate_gp.posterior(
x_pred=x,
num_samples=num_samples)

median = tfp.stats.percentile(sample_z, q=50, axis=0)

# Plot model fit
plt.figure(figsize=(12, 4))

plt.plot(
x[:, 0],
z,
color="black",
zorder=1
)

plt.plot(
x[:, 0],
median,
color="tab:blue",
label="Median $z$"
)

plt.plot(
x[:, 0],
sample_z.numpy().T,
color="tab:blue",
alpha=0.01,
zorder=2
)

plt.scatter(
x_train[:, 0],
y_train,
s=100,
linewidth=3,
marker="+",
c="tab:red",
zorder=3
)

plt.xlim([-5., 5.])
plt.ylim([-3.5, 3.5])

plt.xticks(np.linspace(-5., 5., 5), fontsize=18)
plt.yticks(np.linspace(-3., 3., 3), fontsize=18)

plt.xlabel("$x$", fontsize=24)
plt.ylabel("$y$", fontsize=24)

plt.legend(fontsize=20, ncol=2)

plt.show()


### Binary classification¶

The likelihood for a binary classification task with a sigmoid activation has the form

$$$p(y_n | z_n) = p_n^{y_n} (1 - p_n)^{1 - y_n}, ~\text{ where }~ p_n = \frac{1}{1 + e^{-z_n}}.$$$

Again, together with a Gaussian prior leads to a non-conjugate model which requires approximations. Below we generate some synthetic data by first sampling $$z$$ from a GP with an EQ covariance, passing $$z$$ through a sigmoid to obtain the class probablities and sampling the class to which each point belongs.

tf.random.set_seed(0)
np.random.seed(0)

dtype = tf.float64

num_grid = 400
num_train = 50

x_min = -5.
x_max = 5.

dim = 1
log_coeff = 1.
log_scales = dim * [0.]
trainable = False

covariance = EQCovariance(
log_coeff,
log_scales=log_scales,
dim=dim,
dtype=dtype,
trainable=trainable,
)

x = tf.cast(tf.linspace(x_min, x_max, num_grid), dtype=dtype)[:, None]

Ktt = covariance(x, x) + 1e-9 * tf.eye(num_grid, dtype=dtype)
Lxx = tf.linalg.cholesky(Ktt)

z = tf.linalg.matvec(Lxx, tf.random.normal(shape=(num_grid,), dtype=dtype))
p = tf.nn.sigmoid(z)

idx = np.random.choice(np.arange(num_grid), size=(num_train,), replace=False)
idx = tf.convert_to_tensor(idx)[:, None]

x_train = tf.gather_nd(x, idx)
p_train = tf.gather_nd(p, idx)
y_train = tfp.distributions.Bernoulli(probs=p_train).sample()

# Plot ground truth and sampled data
plt.figure(figsize=(12, 4))

plt.plot(
x[:, 0],
p,
color="black",
zorder=1
)

plt.scatter(
x_train[:, 0],
tf.zeros_like(y_train),
s=100,
linewidth=3,
marker="+",
c=y_train,
cmap="RdYlGn",
zorder=2
)

plt.xlim([-5., 5.])
plt.ylim([-0.1, 1.1])

plt.xticks(np.linspace(-5., 5., 5), fontsize=18)
plt.yticks(np.linspace(0., 1., 3), fontsize=18)

plt.xlabel("$x$", fontsize=24)
plt.ylabel("$p(x)$", fontsize=24)

plt.show()


We now implement a BernoulliLikelihood which, just like the CauchyLikelihood, is a wrapper around a tensorflow distribution.

class BernoulliLikelihood(tfk.Model):

def __init__(self,
dtype,
name="binary_classification_likelihood",
**kwargs):

super().__init__(name=name, dtype=dtype, **kwargs)

def log_prob(self, y, z):
return tfd.Bernoulli(logits=z).log_prob(y)

def call(self, z):
return tf.nn.sigmoid(z)


We can now initialise and train the model. Because we made the NonConjugateGP class agnostic to the likelihood model, the only part that’s different to the previous model initialisation is that now we have passed a different likelihood.

# We decorate a single gradient descent step with tf.function. On the first
# call of single_step, tensorflow will compile the computational graph first.
# After that, all calls to single_step will use the compiled graph which is
# much faster than the default eager mode execution. In this case, the gain
# is roughly a x20 speedup (with a CPU), which can be checked by commenting
# out the decorator and rerunning the training script.

@tf.function
def single_step(model, optimiser):

free_energy, cond_lik, kl = model.free_energy()
loss = - free_energy / x.shape[0]

return free_energy, cond_lik, kl

# =============================================================================
# Define the model
# =============================================================================

# Model parameters
trainable_covariance = False
dim = 1
log_coeff = 1.
log_scales = dim * [0.]

covariance = EQCovariance(
log_coeff=log_coeff,
log_scales=log_scales,
dim=dim,
dtype=dtype,
trainable=trainable_covariance
)

binary_classification_likelihood = BernoulliLikelihood(
dtype=dtype
)

non_conjugate_gp = NonConjugateGP(
x_train=x_train,
y_train=y_train,
covariance=covariance,
likelihood=binary_classification_likelihood,
dtype=dtype
)

# =============================================================================
# Train the model
# =============================================================================

# Number of training steps
num_steps = 50000

# Initialise optimiser

# Set progress bar and suppress warnings
progress_bar = tqdm(range(1, num_steps+1))
tf.get_logger().setLevel('ERROR')

# Set tensors for keeping track of quantities of interest
train_vfe = []
train_cond_lik = []
train_kl = []

# Train model
for i in progress_bar:

free_energy, cond_lik, kl = single_step(
model=non_conjugate_gp,
optimiser=optimiser
)

if i % 1000 == 0:

progress_bar.set_description(
f"VFE {free_energy:.1f}, "
f"Cond-lik. {cond_lik:.1f}, "
f"KL {kl:.1f}"
)

train_vfe.append(free_energy)
train_cond_lik.append(cond_lik)
train_kl.append(kl)

VFE -23.7, Cond-lik. -17.8, KL 5.9: 100%|██████████| 50000/50000 [00:42<00:00, 1187.53it/s]


Again, we verify that the model has trained to convergence by looking at the training objective.

# Plot training objective
plt.figure(figsize=(12, 2))

plt.plot(
moving_average(tf.stack(train_vfe).numpy(), n=100),
color="black",
alpha=0.5
)

plt.xlim([-50, 50000])
plt.ylim([-60, -20])

plt.xticks(fontsize=18)
plt.yticks(np.linspace(-60, -20, 3), fontsize=18)

plt.xlabel("# optimisation steps", fontsize=24)
plt.ylabel("VFE", fontsize=24)

plt.show()


Last we plot the model fit. The model has learnt a reasonable posterior for $$p = \frac{1}{1 + e^{-z}}$$. Note that even in regions where several data are available, the model maintains quite a bit of uncertainty in $$p$$. This is likely because a few binary observations in some region are not enough to pin down $$p$$ with enough certainty.

# Number of posterior samples to draw
num_samples = 500

q = tfd.MultivariateNormalFullCovariance(
loc=non_conjugate_gp.mu,
covariance_matrix=non_conjugate_gp.q_covariance
)

sample_z, sample_p = non_conjugate_gp.posterior(
x_pred=x,
num_samples=num_samples
)

mean_p = tf.reduce_mean(
sample_p,
axis=0
)

median = tfp.stats.percentile(sample_p, q=50, axis=0)

# Plot model fit
plt.figure(figsize=(12, 4))

plt.plot(
x[:, 0],
p,
color="black",
zorder=1
)

plt.plot(
x[:, 0],
median,
color="tab:blue",
label="Median $p$"
)

plt.plot(
x[:, 0],
sample_p.numpy().T,
color="tab:blue",
alpha=0.01,
zorder=2
)

plt.scatter(
x_train[:, 0],
tf.zeros_like(y_train),
s=100,
linewidth=3,
marker="+",
c=y_train,
cmap="RdYlGn",
zorder=3
)

plt.xlim([-5., 5.])
plt.ylim([-0.1, 1.2])

plt.xticks(fontsize=18)
plt.yticks(np.linspace(-0., 1., 3), fontsize=18)

plt.legend(fontsize=20, ncol=2, loc="upper right")

plt.xlabel("$x$", fontsize=24)
plt.ylabel("$p(x)$", fontsize=24)

plt.show()


## Conclusion¶

We saw that when performing VI for models with a Gaussian prior and a factorising likelihood, while using a Gaussian variational posterior, then following Opper and Archembeau, [OA09] we can write the optimal variational posterior in terms of the prior and an additional $$2N$$ variational parameters. We saw that parameterising the variational posterior in this way produces sensible fits in two non-conjugate Gaussian process models.

## References¶

OA09(1,2,3)

Manfred Opper and Cédric Archambeau. The variational gaussian approximation revisited. Neural computation, 2009.

WR06

Christopher K Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. MIT press Cambridge, MA, 2006.