Gumbel distribution

The Gumbel distribution has enjoyed a fair amount of attention by the machine learning community. It has a number of useful properties, including the Gumbel trick, and has proved very useful in applications such as A* sampling and the Concrete distribution. Here we discuss some of the properties of the Gumbel, which are useful for these applications and beyond.

Gumbels and exponentials

We say that a random variable \(\Gamma\) is Gumbel distributed, with parameter \(\kappa\), if its CDF is

\[\begin{align} p(\Gamma \leq \gamma) = e^{-e^{- (\gamma - \kappa)}}. \end{align}\]

The when \(\kappa = 0\), we call the distribution a standard Gumbel. The Gumbel distribution is closely related to the exponential distribution. We say that a random variable \(\Xi\) is exponentially distributed, with parameter \(\lambda\), if its CDF is

\[\begin{align} p(\Xi \leq \xi) = 1 - e^{-\lambda \xi}. \end{align}\]

When \(\lambda = 1\), we call the distribution a standard exponential. Now, to draw a standard Gumbel \(\Gamma\) or a standard exponential \(\Xi\) random variable, we can first sample a uniformly distributed variable \(U \sim \mathcal{U}[0, 1]\) and apply the corresponding inverse CDFs to \(U\)

\[\begin{split}\begin{align} \Gamma &= - \log \left(- \log U\right), \\ \Xi &= - \log U. \end{align}\end{split}\]

A standard Gumbel random variable is therefore distributed identically to the negative logarithm of a standard exponential random variable

\[\begin{align} \Gamma &= - \log \Xi. \end{align}\]

More generally, a Gumbel with parameter \(\kappa\) is identically distributed to an exponential \(\Xi\) with parameter \(\lambda = \log \kappa\) since

\[\begin{equation} p(\Gamma \leq \gamma) = e^{-e^{- (\gamma - \kappa)}} = e^{-\lambda e^{- \gamma}} \implies p(\Xi \geq e^{-\gamma}) = e^{-\lambda v} \implies p(\Xi \leq e^{-\gamma} = v) = 1 - e^{-\lambda v}. \end{equation}\]

Minimum of exponentials

The exponential distribution has numerous interesting properties. One of these properties is that the minimum of \(K\) exponentially distributed random variables \((\Xi_1, \dots, \Xi_K)\) with parameters \((\lambda_1, \dots, \lambda_K)\) respectively, is also exponentially distributed according to

\[\begin{align} L = \max\left(\Xi_1, \dots, \Xi_K\right) \sim \text{Exp}\left(\lambda_1 + \dots + \lambda_K\right), \end{align}\]

whilst the index of the minimiser is categorically distributed according to

\[\begin{align} I = \text{argmin} \left(\Xi_1, \dots, \Xi_K\right) \sim \text{Cat}\left(\pi_1, \cdots, \pi_K\right), \text{ where } \pi_i = \frac{\lambda_i}{\lambda_1 + \dots + \lambda_K}. \end{align}\]

Further, and somewhat remarkably, the random variables \(L\) and \(I\) are independent. We can derive this result directly, following Bach. Let us consider the joint distribution

\[\begin{split}\begin{align} p(I = i, L \geq l) &= p\left(\Xi_i \geq l \text{ and } \Xi_j \geq \Xi_i \text{ for all } j \neq i\right) \\ &= \int_l^\infty p(\Xi_i = \xi_i) \prod_{j \neq i} p\left(\Xi_j \geq \xi_i \right) \text{d}\xi_i \\ &= \int_l^\infty \lambda_i e^{-\lambda_i \xi_i} \prod_{j \neq i} e^{-\lambda_j \xi_i} \text{d}\xi_i \\ &= \underbrace{\frac{\lambda_i}{\lambda_1 + \dots + \lambda_K}}_{p(I = i)}~~\underbrace{\exp\left(-\left[\sum_{k=1}^K \lambda_k\right] l\right)}_{p(L \geq l)}. \end{align}\end{split}\]

The distributions of \(I\) and \(M\) therefore factorise and the variables are independent, whilst their marginal distributions are

\[\begin{split}\begin{align} p(I = i) &= \frac{\lambda_i}{\lambda_1 + \dots + \lambda_K} \\ p(L \geq l) &= \exp\left[-\left(\sum_{k=1}^K \lambda_k\right) l\right]. \end{align}\end{split}\]

The Gumbel trick

The Gumbel trick is the same property, packaged in a different way. In particular

\[\begin{split}\begin{align} I &= \text{argmin} \left(\Xi_1, \dots, \Xi_K\right) \\ &= \text{argmin} \left(\log \Xi_1, \dots, \log \Xi_K\right) \\ &= \text{argmax} \left(-\log \Xi_1, \dots, -\log \Xi_K\right) \\ &= \text{argmax} \left(\Gamma_1, \dots, \Gamma_K\right) \end{align}\end{split}\]

where each \(\Gamma_i\) is Gumbel distributed with parameter \(\kappa_i = \log \lambda_i\). Using the fact that

\[\begin{align} \Gamma_i = \kappa_i + Z_i, \end{align}\]

where \(Z_i\) is a standard Gumbel random variable, we arrive at

\[\begin{align} I = \text{argmax} \left(\kappa_1 + Z_1, \dots, \kappa_K + Z_K\right). \end{align}\]

Therefore, if we suppose \(\pi_{1:K}\) are the probabilities of a categorical

\[\begin{align} \pi_i \geq 0, \sum_{k=1}^K \pi_k = 1, \end{align}\]

and we take their logarithms \(\kappa_{1:K} = \log \pi_{1:K}\) and define the random variable

\[\begin{align} I = \text{argmax} \left(\kappa_1 + Z_1, \dots, \kappa_K + Z_K\right). \end{align}\]

where \(Z_i\) is a standard Gumbel, then the \(I\) is distribued according to the categorical

\[\begin{align} p\left(I = i\right) = \pi_i. \end{align}\]