Random Fourier features#
One central difficulty with Gaussian Processes (GPs), and more generally all kernel methods such as Support Vector Machines (SVMs), is their computational cost. Exact GP regression scales cubically
RFF relies on the fact that kernels of stationary processes can be expressed as the Fourier transform of a probability density function.
This Fourier transform can then be efficiently approximated by Monte Carlo.
The samples drawn to approximate this integral are called Random Fourier Features and can be used as features for a linear-in-the-parameters Bayesian regression model, which approximates exact GP regression.
The computational cost of Bayesian linear regression is
The RFF approximation#
The starting point for deriving RFF is Bochner’s theorem, which relates stationary kernels with probability distributions over frequencies via the Fourier transform.
Theorem 121 (Bochner’s theorem)
A continuous function of the form
Note that this statement slightly abuses the
Thus, we can get an unbiased estimate of
which has a real-valued integrand, so a Monte Carlo estimate of it will always be real valued.
Next, we would like to manipulate the expression above into an expectation of the form
for all
We can therefore get an unbiased, real valued estimate of
In fact, we can go a bit further by drawing
This is also an unbiased estimate of the kernel, however its variance is lower than in the
Definition 143 (Random Fourier Features)
Given a translation invariant kernel
where we have used the notation
RFF and Bayesian regression#
Now we are in a position to interpret this expression in terms of a Bayesian regression model. Consider the regressor
where
Rates of convergence#
Now there remains the question of how large the error of the RFF estimator is. In other words, how closely does RFF estimate the exact kernel
Lemma 60 (Hoeffding for RFF)
The RFF estimator of
Therefore for any given input pair, the error of the RFF estimator decays exponentially quickly with
Lemma 61 (Uniform convergence of RFF)
Let
Implementation#
We now turn to an implementation of RFF. We consider approximating the same three kernels which were presented in the RFF paper, namely the EQ (or Gaussian) kernel, the Laplace kernel and the Cauchy kernel.
def sample_rff(
x,
kernel,
lengthscale,
coefficient,
num_functions,
num_features,
):
# Dimension of data space
x_dim = x.shape[-1]
omega_shape = (num_functions, num_features, x_dim)
# Handle each of three possible kernels separately
if kernel == 'eq':
omega = np.random.normal(size=omega_shape)
elif kernel == 'laplace':
omega = np.random.standard_cauchy(size=omega_shape)
elif kernel == 'cauchy':
omega = np.random.laplace(size=omega_shape)
# Scale omegas by lengthscale
omega = omega / lengthscale
weights = np.random.normal(
loc=0.,
scale=1.,
size=(num_functions, num_features),
)
phi = np.random.uniform(
low=0.,
high=(2 * np.pi),
size=(num_functions, num_features, 1),
)
features = np.cos(np.einsum('sfd, nd -> sfn', omega, x) + phi)
features = (2 / num_features) ** 0.5 * features * coefficient
functions = np.einsum('sf, sfn -> sn', weights, features)
return functions, features
Sampling from the prior#
We can now visualise functions sampled from the prior induced by RFF.
Each function corresponds to a different sample set
# Kernels to approximate
kernels = ['eq', 'laplace', 'cauchy']
# Kernel parameters, # function samples, # features for each function
lengthscale = 1.
coefficient = 1.
num_functions = 100
num_features = 200
# Input locations
x = np.linspace(-3., 3., 100)[:, None]
# Draw random fourier features and functions
samples = [
sample_rff(
x=x,
kernel=kernel,
lengthscale=lengthscale,
coefficient=coefficient,
num_functions=num_functions,
num_features=num_features,
)
for kernel in kernels
]
Regression with RFF features#
First we generate a toy dataset from a noisy sinusoid.
We place a gap in the data to observe the quality of the uncertainty estimates of the RFF regressor.
We purposefully use a large number of datapoints,
The function rff_posterior
below implements Bayesian linear regression with randomly sampled Fourier features.
For more details on Bayesian linear regression see Chapter 3 of Bishop’s PRML book.[Bishop, 2006]
def rff_posterior(
x_data,
y_data,
x_pred,
kernel,
lengthscale,
coefficient,
num_features,
noise,
):
num_data = x_data.shape[0]
x_full = np.concatenate([x_pred, x_data])
_, features = sample_rff(
x=x_full,
kernel=kernel,
lengthscale=lengthscale,
coefficient=coefficient,
num_functions=1,
num_features=num_features,
)
features = features[0]
features_pred = features[:, :-num_data]
features_data = features[:, -num_data:]
iS = np.eye(features_data.shape[0]) + \
features_data @ features_data.T * noise ** -2
mean_pred = noise ** -2 * \
features_pred.T @ np.linalg.solve(iS, features_data @ y_data)[:, 0]
var_pred = np.einsum(
'fn, fn -> n',
features_pred,
np.linalg.solve(iS, features_pred),
)
return mean_pred, var_pred
Now we perform Bayesian linear regression with RFF features for each of the three kernels.
The two helpers below, implement exact GP regression.
The covariance
helper implements the EQ, Laplace and Cauchy covariance functions and the exact_gp_posterior
helper performs exact GP regression. [Rasmussen, 2003]
def covariance(x1, x2, kernel, lengthscale, coefficient, noise=None):
diff = x1[:, None, :] - x2[None, :, :]
if kernel == 'eq':
l2 = np.sum((diff / lengthscale) ** 2, axis=2)
cov = coefficient ** 2 * np.exp(-0.5 * l2)
elif kernel == 'laplace':
l1 = np.sum(np.abs(diff / lengthscale), axis=2)
cov = coefficient ** 2 * np.exp(-0.5 * l1)
elif kernel == 'cauchy':
l2 = np.sum((diff / lengthscale) ** 2, axis=2)
cov = coefficient ** 2 / (1 + l2)
if noise is not None:
cov = cov + noise ** 2 * np.eye(cov.shape[0])
return cov
def exact_gp_posterior(x_data, y_data, x_pred, kernel, lengthscale, coefficient, noise):
Kdd = covariance(x_data, x_data, kernel, lengthscale, coefficient, noise)
Kpd = covariance(x_pred, x_data, kernel, lengthscale, coefficient, None)
Kpp = covariance(x_pred, x_pred, kernel, lengthscale, coefficient, None)
mean = Kpd @ np.linalg.solve(Kdd, y_data[:, 0])
var = np.diag(Kpp - Kpd @ np.linalg.solve(Kdd, Kpd.T))
return mean, var
Timing the computation shows that the RFF implementation takes miliseconds to execute, as opposed to exact GP regression which takes much longer.
Timing exact GP regression:
CPU times: user 40.1 s, sys: 13.3 s, total: 53.4 s
Wall time: 13.1 s
Timing exact approximate RFF regression:
CPU times: user 2.06 s, sys: 1.55 s, total: 3.61 s
Wall time: 806 ms
RFF has produced sensible regressors in each case, significantly faster than exact GP regression. The approximate posteriors roughly match the exact posteriors, while being significantly quicker to compute.
Variance starvation#
One issue with the RFF is that - like all other finte basis function models - is that it has a limited amount of degrees of freedom. Therefore, in some situations the datapoints may be such that they pin down the RFF model and significantly reducing the variance of the approximate regressor. To illustrate this, we sample a slightly different dataset, with a smaller gap in between the two data clumps.
The variance of the approximate estimator in between the data is (in some cases), signiticantly smaller than that of the exact posterior. So the speedup that the RFF gives does not come without a cost. In certain cases, we can end up with approximate posteriors which are significantly overfitted. This can be alleviated by increasing the number of RFF features. However this increases the computational cost of performing regression and may defeat the purpose of using RFF features in the first place.
Conclusion#
Random Fourier Features are a cheap and efficient way to sample from a Gaussian Process (GP) prior. By using a finite number of features
References#
Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
David Grimmett, Geoffrey Stirzaker. Probability and random processes. Oxford university press, 2020.
Ali Rahimi, Benjamin Recht, and others. Random features for large-scale kernel machines. In NIPS. 2007.
Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer school on machine learning, 63–71. Springer, 2003.
James T. Wilson, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, and Marc Peter Deisenroth. Efficiently sampling functions from gaussian process posteriors. 2020.