Main limit theorems#

In this chapter we introduce the idea of convergence for random variables, which may be in either of the three senses: (1) in mean-square, (2) in probability or (3) in distribution. These three senses are non-equivalent and in fact one implies the next, in the order given above. We present important theorems involving limits of random variables, such as the law of large numbers, the central limit theorem and the large deviation theorem. We also present the continuity theorems for moment generating functions and characteristic functions.

Convergence in mean-square#

We are often interested in the convergence of a sequence of random variables to another random variable. Unlike real sequences, where convergence has one meaning only, the convergence of a sequence of random variables can be defined in different ways. One such way is convergence in mean-square, as defined below.

Definition 46 (Mean-square convergence)

We say that a sequence of random variables \(X_1, X_2, ...\) converges in mean square to a limit variable \(X\) if

\[\begin{align} \mathbb{E}\left[(X_n - X)^2\right] \to 0 \text{ as } n \to \infty, \end{align}\]

and write this as \(X_n \to X\) in mean square as \(n \to \infty\).

Note that for mean-square convergence to be meaningful, \(\mathbb{E}(X)\) as well as all \(\mathbb{E}\left[(X_n - X)^2\right]\) must exist. We can use this definition to state one version of the mean square law of large numbers as follows.

Theorem 44 (Mean square law of large numbers)

Let \(X_1, X_2, ...\) be a sequence of independent random variables each with mean \(\mu\) and variance \(\sigma^2.\) Then

\[\begin{align} \frac{1}{N}\sum_{n = 1}^N X_n \to \mu \text{ as } n \to \infty \text{ in mean square}. \end{align}\]
Proof: Mean-square law of large numbers

The partial sum \(S_N = X_1 + X_2 + ... + X_N\) has mean

\[\begin{align} \mathbb{E}\left(\frac{1}{N}S_N\right) &= \frac{1}{N}\left[\sum_{n = 1}^N X_n\right] = \mu \end{align}\]

and variance

\[\begin{equation} \text{Var}\left(\frac{1}{N} S_N\right) = \frac{1}{N^2}\text{Var}\left(S_N\right) = \frac{1}{N} \sigma^2. \end{equation}\]

Where we used the facts that \(\text{Var}(aX) = a^2\text{Var}(X),\) and that \(\text{Var}(Y_1 + Y_2) = \text{Var}(Y_1) + \text{Var}(Y_2)\) for any independent variables \(Y_1\) and \(Y_2\). Therefore as \(N \to \infty\), \(S_N \to \mu\) in mean square.

Convergence in probability#

A weaker sense in which a sequence of random variables can converge is that of convergence in probability.

Definition 47 (Convergence in probability)

We say that a sequence of random variables \(X_1, X_2, ...\) converges in probability to \(X\) as \(n \to \infty\) if for all \(\epsilon > 0\)

\[\begin{align} \mathbb{P}\left(|X_n - X| > \epsilon\right) \to 0, \text{ as } n \to \infty, \end{align}\]

and write this as \(X_n \to X\) in probability as \(n \to \infty.\)

Convergence in probability is a weaker condition than convergence in mean-square in the sense that the former implies the latter, as stated by the following theorem.

Theorem 45 (Convergence in mean square \(\implies\) convergence in probability)

If \(X_1, X_2, ...\) is a sequence of random variables and \(X_n \to X\) in mean square, then \(X_n \to X\) in probability.

To prove the above we use Chebyshev’s inequality, a useful tool in probability theory, which is presented and proved below.

Proof: Convergence in mean square \(\implies\) convergence in probability

To show this result, we use Chebyshev’s inequality

\[\begin{align} \mathbb{P}(|Z| \geq t) \leq \frac{\mathbb{E}\left(Z^2\right)}{t^2}, \end{align}\]

which proved below. Now set \(Z = X_n - X\) and \(t = \epsilon > 0\) and let \(n \to \infty\) to obtain

\[\begin{align} \mathbb{P}(|X_n - X| \geq \epsilon) \leq \frac{\mathbb{E}\left((X_n - X)^2\right)}{\epsilon^2} \to 0 \text{ as } n \to \infty, \end{align}\]

where we have used the assumption that \(X_n \to X\) in mean square, so \(\mathbb{P}\left(|X_n - X| \geq \epsilon\right) \to 0\) as \(n \to \infty.\)

Theorem 46 (Chebyshev’s inequality)

If \(X\) is a random variable and \(\mathbb{E}\left(X^2\right)\) is finite then

\[\begin{align} \mathbb{P}(|X| \geq t) \leq \frac{\mathbb{E}\left(X^2\right)}{t^2}. \end{align}\]
Proof: Chebyshev’s inequality

Considering that \(|X| \geq t \iff X^2 \geq t^2\) we have

\[\begin{align} \mathbb{P}(|X| \geq t) = \mathbb{P}(X^2 \geq t^2) \leq \frac{\mathbb{E}\left(X^2\right)}{t^2}. \end{align}\]

by the Markov inequality, arriving at Chebyshev’s inequality.

Just as the law of large numbers can be stated in the mean-square sense, it can also be stated in the weaker sense of convergence in probability.

Theorem 47 (Weak law of large numbers)

Let \(X_1, X_2, ...\) be a sequence of independent random variables each with mean \(\mu\) and variance \(\sigma^2.\) Then

\[\begin{align} \frac{1}{N}\sum_{n = 1}^N X_n \to \mu \text{ as } n \to \infty \text{ in probability.} \end{align}\]

This weak version is a direct implication of the fact that convergence in mean-square implies convergence in probability, as shown in the following proof.

Proof: Weak law of large numbers

We have shown that as (N \to \infty), the partial sum

\[\begin{align} S_N = X_1 + X_2 + ... + X_N \end{align}\]

converges to \(\mu\) in mean square. Convergence in mean square implies convergence in probability, therefore \(S_N\) converges to \(\mu\) in probability too.

Unlike the mean-square law of large numbers, the weak law holds even when \(\sigma^2\) is infinite, provided that the \(X_n\) all come from the same distribution.

Central limit theorem#

Theorem 48 (Central limit theorem)

Let \(X_1, X_2, \dots\) be a sequence of independent and identically distributed random variables, each with mean \(\mu\) and variance \(\sigma^2.\) Then the random variable

\[\begin{align} Z_N = \frac{S_N - N\mu}{\sqrt{N}\sigma}, \text{ where } S_N = X_1 + X_2 + ... + X_N \end{align}\]

satisfies, as \(N \to \infty\),

\[\begin{align} \mathbb{P}(Z_N \leq z) \to \int^z_{-\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}u^2} du, \text{ for } x \in \mathbb{R}. \end{align}\]

Theorem 49 (Continuity theorem with mgfs)

Let \(X_1, X_2, ...\) be a sequence of random variables with moment generating functions \(M_1, M_2 ...\) and suppose that as \(n \to \infty\)

\[\begin{align} M_n(t) \to e^{\frac{1}{2}t^2} \text{ for } t \in \mathbb{R} \end{align}\]

Then

\[\begin{align} \mathbb{P}(Z_n \leq z) \to \int^z_{-\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}u^2} du, \text{ for } x \in \mathbb{R}. \end{align}\]
Proof: Central limit theorem

Let \(U_n = X_n - \mu\), so that the \(U_n\) are independent and identically distributed with mean \(0\) and variance \(\sigma^2\). We will show that the mgf of

\[\begin{align} Z_N = \frac{S_N - N\mu}{\sigma\sqrt{N}}, \text{ where } S_N = X_1 + X_2 + ... + X_N \end{align}\]

converges to the mgf of the standard normal. Then, by applying the continuity theorem for mgfs, the distribution function of \(Z_N\) converges to the standard normal distribution. Writing out the mgf of \(Z_N\)

\[\begin{split}\begin{align} M_{Z_N}(t) &= \mathbb{E}\left(\exp(tZ_N)\right)\\ &= \mathbb{E}\left[\exp\left(\frac{t}{\sqrt{N}\sigma} \sum^N_{n=1} U_n \right)\right]\\ &= \left[ M_U\left(\frac{t}{\sigma\sqrt{N}}\right) \right]^N \end{align}\end{split}\]

where we have used the fact that the mgf of a sum of independent random variables (\(U_n\) in this case) is equal to the product of the mgfs of the random variables. Now, taking the Taylor expansion of \(M_{U_n}(t)\) about \(0\)

\[\begin{split}\begin{align} M_{U_n}(t) &= 1 + t\mathbb{E}(U_n) + \frac{1}{2}t^2\mathbb{E}(U_n^2) + o(t^2)\\ &= 1 + \frac{1}{2} \sigma^2 t^2 + o(t^2) \end{align}\end{split}\]

so that

\[\begin{align} M_{Z_N}(t) = \left[ 1 + \frac{1}{2} \frac{t^2}{N} + o\left(\frac{1}{N}\right) \right]^N \to e^{\frac{1}{2}t^2} \text{ as } N \to \infty. \end{align}\]

Therefore \(M_{Z_N}(t) \to e^{\frac{1}{2}t^2}\) as \(N \to \infty\) and by the continuity theorem for mgfs, the distribution function of \(Z_N\) converges to the standard normal distribution:

\[\begin{align} \mathbb{P}(Z_N \leq z) \to \int^z_{-\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}u^2} du, \text{ for } x \in \mathbb{R}. \end{align}\]

Large deviations#

We are sometimes interested in the probability that a sum of i.i.d. random variables \(S_N = X_1 + X_2 + ... + X_N\) will deviate from its mean by an amount proportional to \(N\), i.e. \(\mathbb{P}(S_N - \mu N > aN).\) The large deviation theorem relates the probability of a large deviation to a quantity called the Fenchel-Legendre transform, defined below.

Definition 48 (Fenchel-Legendre transform)

Given a random variable \(X\), with moment generating function \(M\), its Fenchel-Legendre transform is

\[\begin{align} \Lambda^*(a) = \sup\left\{at - \Lambda(t) : t \in \mathbb{R}\right\}, \text{ where } a \in \mathbb{R}, \end{align}\]

where \(\Lambda(t) = \log M(t).\)

The large deviation theorem then takes the following form.

Theorem 50 (Large deviation theorem)

Let \(X_1, X_2, ...\) be independent, identically distributed random variables with mean \(0\) and a common generating function \(M(t) = \mathbb{E}(e^{tX})\) which is finite in some neighbourhood \([-\delta, \delta]\) of the origin. Let \(a > 0\) be such that \(\mathbb{P}(X > a) > 0\). Then \(\Lambda^*(a) > 0\) and

\[\begin{align} \frac{1}{N} \log \mathbb{P}(S_N > aN) \to - \Lambda^*(a), \text{ as } N \to \infty. \end{align}\]

The proof of this theorem can be found in Probability and Random Processes[Grimmett and Stirzaker, 2001] (section 5.11), but is ommitted from the book and these notes. Instead we prove below that

\[\begin{align} \frac{1}{N}\log\mathbb{P}(S_N > aN) \leq -\Lambda^*(a). \end{align}\]
Proof: Partial proof of the large deviation theorem

Let \(X_1, X_2, ...\) be a sequence of independent and identically distributed random variables with zero mean and common moment generating function \(M(t)\), and define \(S_N = X_1 + ... + X_N.\) For any strictly increasing function \(g: \mathbb{R} \to \mathbb{R}\), we have

\[\begin{align} S_N > aN \iff g(S_N) > g(aN). \end{align}\]

The exponential function \(g(x) = e^x\) is strictly increasing and in addition it is non-negative, so

\[\begin{align} \mathbb{P}(S_N > aN) = \mathbb{P}(g(tS_N) > g(taN)) \leq \frac{M(t)^N}{e^{taN}}, \text{ for } t > 0, \end{align}\]

by the Markov inequality. Minimising over \(t\) we obtain

\[\begin{align} \mathbb{P}(S_N > aN) \leq \inf \left\{ \frac{M(t)}{e^{ta}} : \text{ for } t > 0 \right\}^N \end{align}\]

Taking a logarithm of each side, we arrive at the result

\[\begin{align} \frac{1}{N}\log \mathbb{P}(S_N > aN) \leq - \sup \left\{ ta - \Lambda(t) : \text{ for } t > 0 \right\} = - \Lambda^*(a). \end{align}\]

Convergence in distribution#

Definition 49 (Convergence in distribution)

The sequence \(X_1, X_2, ...\) is said to converge in distribution, or to converge weakly, to \(X\) as \(n \to \infty\) if

\[\begin{align} \mathbb{P}(X_n \leq x) \to \mathbb{P}(X \leq x), \text{ for any } x \in C, \end{align}\]

where \(C\) is the set of reals at which the distribution function of \(X\) is continuous. If the above holds, we write \(X_n \implies X.\)

Convergence in distribution is a weaker condition than convergence in probability. If the condition \(x \in \mathbb{R}\) was used instead of \(x \in C\) above, there would exist examples of sequences of random variables that would converge in probability but not in distribution. By requiring that the criterion for convergence holds only where \(\mathbb{P}(X \leq x)\) is continuous, we avoid these cases and make convergence in probability a special case of the weaker criterion of convergence in distribution, as stated by the following theorem.

Theorem 51 (Convergence in probability (\implies) convergence in distribution)

Let \(X_1, X_2, ...\) be a sequence of variables and \(X_n \to X\) in probability, then \(X_n \implies X\).

Below are two proofs for this result. The first is a textbook proof from the book,[Grimmett et al., 1986] based on inequalities.

Proof: (a) Convergence in probability \(\implies\) convergence in distribution

Let \(X_1, X_2, ...\) be a sequence of random variables which converges in probability to another random variable \(X\). Let \(x\) be any point where the distribution function of \(X\) is continuous. Now consider

\[\begin{split}\begin{align} \mathbb{P}(X_n \leq x) &= \mathbb{P}(X_n \leq x, X \leq x + \epsilon) + \mathbb{P}(X_n \leq x, X > x + \epsilon)\\ &\leq \mathbb{P}(X \leq x + \epsilon) + \mathbb{P}(X - X_n > \epsilon)\\ &\leq \mathbb{P}(X \leq x + \epsilon) + \mathbb{P}(|X - X_n| > \epsilon) \end{align}\end{split}\]

Similarly we can switch the roles of \(X_n\) and \(X\) to obtain

\[\begin{split}\begin{align} \mathbb{P}(X \leq x - \epsilon) &= \mathbb{P}(X \leq x - \epsilon, X_n \leq x) + \mathbb{P}(X_n \leq x - \epsilon, X > x)\\ &\leq \mathbb{P}(X_n \leq x) + \mathbb{P}(X_n - X > \epsilon)\\ &\leq \mathbb{P}(X_n \leq x) + \mathbb{P}(|X_n - X| > \epsilon). \end{align}\end{split}\]

Combining these two inequalities we obtain

\[\begin{align} \mathbb{P}(X \leq x - \epsilon) - \mathbb{P}(|X_n - X| > \epsilon) \leq \mathbb{P}(X_n \leq x) \leq \mathbb{P}(X \leq x + \epsilon) + \mathbb{P}(|X - X_n| > \epsilon), \end{align}\]

and taking \(\epsilon\) to \(0\) we obtain

\[\begin{align} \mathbb{P}(X \leq x) \leq \mathbb{P}(X_n \leq x) \leq \mathbb{P}(X \leq x) \iff \mathbb{P}(X_n \leq x) = \mathbb{P}(X \leq x), \end{align}\]

where we have used the facts that \(X_n \to X\) in probability and that \(\mathbb{P}(X \leq x)\) is continuous at \(x.\)

The second is an alternative proof which includes a diagram, that is hopefully more intuitive.

Proof (b) Convergence in probability \(\implies\) convergence in distribution

The set \(Z_n \leq z\) corresponds to the yellow, red, green and pink regions below, and the set \(Z \leq z\) corresponds to the blue, orange, green and pink regions. Therefore the difference \(\mathbb{P}(Z_n \leq z) - \mathbb{P}(Z \leq z)\) is equal to the sum of the probability measure on the blue and orange regions, minus that of the yellow and red regions. We wish to show this difference goes to \(0\) as \(n \to \infty.\)

Since \(|Z_n - Z| \geq \epsilon\) corresponds to \((\text{white + green + blue + yellow})\) and \(\mathbb{P}(|Z_n - Z| \geq \epsilon) \to 0\) as \(n \to \infty\), the yellow and blue areas have \(0\) probability in the \(n \to \infty\) limit. Assuming \(\mathbb{P}(Z \leq z)\) is continuous in \(z\) at \(z = z_0\), then the red and orange areas must also have \(0\) probability in the \(n \to \infty\) limit, otherwise the continuity condition would be contradicted. Therefore

\[\lim_{n \to \infty} \big | \mathbb{P}(Z_n \leq z) - \mathbb{P}(Z \leq z) \big | = 0,\]

and the variables converge in distribution.

../../../_images/conv-in-prob-proof.svg

Fig. 1 Illustration for the argument that convergence in probability implies convergence in distribution. The presence of a darker border indicates that the corresponding region contains the border itself. The cyan region does not contain its border. See text for explanation.#

Theorem 52 (Convergence in distribution to \(c\) \(\implies\) convergence in probability to \(c\))

Let \(X_1, X_2, ...\) be a sequence of variables and which coverges in distribution to a constant \(c\). Then \(X_n\) converges to \(c\) in probability also.

Limits of characteristic functions#

Theorem 53 (Continuity theorem with characteristic functions)

Let \(X, X_1, X_2, ...\) be random variables with characteristic functions \(\phi, \phi_1, \phi_2 \dots .\) Then \(X_N \implies X\) as \(N \to \infty\) if and only if

\[\begin{align} \phi_N(t) \to \phi(t) \text{ as } N \to \infty, \text{ for } t \in \mathbb{R}. \end{align}\]

References#

[GWW86]

G. Grimmett, D.J.A. Welsh, and D. Welsh. Probability: An Introduction. Oxford University Press. Clarendon Press, 1986.

[GS01]

G.R. Grimmett and D.R. Stirzaker. Probability and random processes. Number 391. Oxford university press, 2001.