Estimation by score matching#
Score matching a useful trick due to Hyvarinen[Hyvärinen and Dayan, 2005] for learning the parameters of intractable probabilistic models. Score matching can be used to train probabilistic models whose likelihood function takes the form
where \(q_\theta(x)\) is a positive function we can evaluate, but \(Z_\theta\) is a normalising constant which we cannot evaluate in closed form. Non-gaussian Markov random fields are examples of such models.
The score matching trick#
The first step for the score-matching trick is to notice that taking the log and then the gradient with respect to \(x\) of both sides eliminates the intractable \(Z_\theta\)
since \(Z_\theta\) does not depend on \(x\). The gradient of the log-likelihood is called the score function
The second step is to find a way to use the score function \(\psi_\theta(x)\) along with some observed data, to estimate the parameters \(\theta\). We can achieve this by defining the following score matching objective.
(Score matching objective)
Given a data distribution \(p_d(x)\) and an approximating distribution \(p_\theta(x)\) with parameters \(\theta\), we define the score matching objective as
where \(\psi_\theta(x) = \nabla p_\theta(x)\) and \(\psi_d(x) = \nabla p_d(x)\) and the derivatives are with respect to \(x\).
We observe that if \(J(\theta) = 0\), then \(\psi_\theta(x) = \psi_d(x)\) almost always. So we might expect that in this case the model distribution \(p_\theta(x)\) and \(p_d(x)\) will also be equal. This intuition is formalised by the following result.
\(\iff\) matching distributions)
(Matching scoresSuppose that the probability density function of \(x\) satisfies \(p_d(x) = p_\theta(x)\) for some \(\theta^*\) and also that if \(\theta^* \neq \theta\) then \(p_\theta(x) \neq p_d(x)\). Suppose also that \(p_\theta(x) > 0\). Then
Proof: Score matching \(\iff\) maximum likelihood
Is implied by: We can see that \(\theta = \theta^* \implies J(\theta) = 0\) by substituting \(p_d(x) = p_\theta(x)\) into \(J(\theta^*)\)
Implies: Going the other direction, we can show that \(J(\theta) = 0 \implies \theta = \theta^*\) by considering
Since \(p_\theta(x) > 0\), the above can hold only if \(\psi_\theta(x) = \psi_{\theta^*}(x)\) for every \(x\). This means that
and since \(p_\theta(x)\) is a normalised probability distribution, we arrive at \(p_\theta(x) = p_\theta(x)\). Now since the \(p_\theta(x)\) is unique for the particular \(\theta^*\), we have that \(\theta = \theta^*\).
This result confirms the intuition that if the score functions (of the data distribution and the model) are equal, then the distributions equal as well. However, we note that this theorem assumes that \(p_d(x) = p_\theta(x)\) for some \(\theta\). In other words, it is assumed that the true model \(p_d(x)\) is within the space of models we have hypothesised. In general this will not be true for true data, but at least the present result confirms that the expected behaviour is recovered in this idealised setting.
The last challenge is that in its definition \(J(\theta)\) depends explicitly on \(\psi_d(x)\), a function we do not have access to. In particular, expanding the term in the norm, we see that
The first term is computable because we have access to \(\psi_\theta(x)\). The latter term does not depend on \(\theta\) and is therefore irrelevant for the optimisation of \(J(\theta)\). However the middle term both depends on \(\theta\) as well as the inaccessible score function of the data \(\psi_d(x)\). Therefore this term must be considered in the optimisation of \(J(\theta)\) but is also not directly computable. By using integration by parts, one can show[Hyvärinen and Dayan, 2005] that this term can be rewritten in a way such that \(J(\theta)\) can be estimated empirically.
\(J\))
(Equivalent form ofLet \(\psi_\theta(x)\) be a score function which is differentiable with respect to \(x.\) Then, under some weak regularity conditions on \(\psi_\theta(x),\) the score-matching function \(J\) can be writtten as
where the \(i\)-subscript denotes the \(i^{th}\) entry of vector being indexed and \(\partial_i\) denotes the partial derivative with respect to \(x_i\). The constant term is independent of \(\theta\).
Proof: Equivalent form of \(~J\)
Writing out \(J\)
we see that the last term in the brackets evaluates to a constant that is independent of \(\theta\). Using the fact that
and applying integration by parts, we obtain
Substituting this into the expression for \(J\) we arrive at
The proof has used integration by parts to replace the intractable \(\psi_d(x)\) term by a \(p_d(x)\) term. This expression can be easily estimated if samples of \(x\) are available, by replacing the expectation with respect to \(p_d(x)\) by the empirical average over the samples. In this way, we can estimate the parameters \(\theta\) of a non-normalised model, without resorting to computing estimates of the density \(p_d(x)\).