# Estimation by score matching#

Score matching a useful trick due to Hyvarinen[Hyvärinen and Dayan, 2005] for learning the parameters of intractable probabilistic models. Score matching can be used to train probabilistic models whose likelihood function takes the form

where \(q_\theta(x)\) is a positive function we can evaluate, but \(Z_\theta\) is a normalising constant which we cannot evaluate in closed form. Non-gaussian Markov random fields are examples of such models.

## The score matching trick#

The first step for the score-matching trick is to notice that taking the log and then the gradient with respect to \(x\) of both sides eliminates the intractable \(Z_\theta\)

since \(Z_\theta\) does not depend on \(x\). The gradient of the log-likelihood is called the score function

The second step is to find a way to use the score function \(\psi_\theta(x)\) along with some observed data, to estimate the parameters \(\theta\). We can achieve this by defining the following score matching objective.

(Score matching objective)

Given a data distribution \(p_d(x)\) and an approximating distribution \(p_\theta(x)\) with parameters \(\theta\), we define the score matching objective as

where \(\psi_\theta(x) = \nabla p_\theta(x)\) and \(\psi_d(x) = \nabla p_d(x)\) and the derivatives are with respect to \(x\).

We observe that if \(J(\theta) = 0\), then \(\psi_\theta(x) = \psi_d(x)\) almost always. So we might expect that in this case the model distribution \(p_\theta(x)\) and \(p_d(x)\) will also be equal. This intuition is formalised by the following result.

\(\iff\) matching distributions)

(Matching scoresSuppose that the probability density function of \(x\) satisfies \(p_d(x) = p_\theta(x)\) for some \(\theta^*\) and also that if \(\theta^* \neq \theta\) then \(p_\theta(x) \neq p_d(x)\). Suppose also that \(p_\theta(x) > 0\). Then

##
Proof: Score matching \(\iff\) maximum likelihood

**Is implied by:**
We can see that \(\theta = \theta^* \implies J(\theta) = 0\) by substituting \(p_d(x) = p_\theta(x)\) into \(J(\theta^*)\)

**Implies:**
Going the other direction, we can show that \(J(\theta) = 0 \implies \theta = \theta^*\) by considering

Since \(p_\theta(x) > 0\), the above can hold only if \(\psi_\theta(x) = \psi_{\theta^*}(x)\) for every \(x\). This means that

and since \(p_\theta(x)\) is a normalised probability distribution, we arrive at \(p_\theta(x) = p_\theta(x)\). Now since the \(p_\theta(x)\) is unique for the particular \(\theta^*\), we have that \(\theta = \theta^*\).

This result confirms the intuition that if the score functions (of the data distribution and the model) are equal, then the distributions equal as well. However, we note that this theorem assumes that \(p_d(x) = p_\theta(x)\) for some \(\theta\). In other words, it is assumed that the true model \(p_d(x)\) is within the space of models we have hypothesised. In general this will not be true for true data, but at least the present result confirms that the expected behaviour is recovered in this idealised setting.

The last challenge is that in its definition \(J(\theta)\) depends explicitly on \(\psi_d(x)\), a function we do not have access to. In particular, expanding the term in the norm, we see that

The first term is computable because we have access to \(\psi_\theta(x)\). The latter term does not depend on \(\theta\) and is therefore irrelevant for the optimisation of \(J(\theta)\). However the middle term both depends on \(\theta\) as well as the inaccessible score function of the data \(\psi_d(x)\). Therefore this term must be considered in the optimisation of \(J(\theta)\) but is also not directly computable. By using integration by parts, one can show[Hyvärinen and Dayan, 2005] that this term can be rewritten in a way such that \(J(\theta)\) can be estimated empirically.

\(J\))

(Equivalent form ofLet \(\psi_\theta(x)\) be a score function which is differentiable with respect to \(x.\) Then, under some weak regularity conditions on \(\psi_\theta(x),\) the score-matching function \(J\) can be writtten as

where the \(i\)-subscript denotes the \(i^{th}\) entry of vector being indexed and \(\partial_i\) denotes the partial derivative with respect to \(x_i\). The constant term is independent of \(\theta\).

##
Proof: Equivalent form of \(~J\)

Writing out \(J\)

we see that the last term in the brackets evaluates to a constant that is independent of \(\theta\). Using the fact that

and applying integration by parts, we obtain

Substituting this into the expression for \(J\) we arrive at

The proof has used integration by parts to replace the intractable \(\psi_d(x)\) term by a \(p_d(x)\) term. This expression can be easily estimated if samples of \(x\) are available, by replacing the expectation with respect to \(p_d(x)\) by the empirical average over the samples. In this way, we can estimate the parameters \(\theta\) of a non-normalised model, without resorting to computing estimates of the density \(p_d(x)\).