18.6. The Metropolis-Hastings algorithm#
How can we collect random samples from an arbitrary probability distribution \(p(\pars)\)? In particular, how can we perform this task when we might not even have a closed form expression for the PDF? It is common in Bayesian inference that we find ourselves in the situation that we can evaluate \(p(\pars)\) at any position \(\pars\), but we donβt know beforehand if it will be large or small.
We will now show that it is possible to construct a Markov chain that has \(p(\pars)\) as its stationary distribution. Once this chain has converged it will provide us with samples from \(p(\pars)\).
As we have learned, the evolution of positions is described by the transition density \(T\). The generalization to continuous variables of the stationary-chain equilibrium described in Eq. (18.16) is
This equilibrium condition can be replaced with the more stringent detailed-balance condition
which can be understood from the integral relation
that describes the balance between the transitions of probability density out of \(\pars\) and into \(\pars\). At equilibrium, these integrals must be equal.
The trick to achieve this for a certain \(p(\pars)\) is to write \(T\) as a product of a simple step proposal function \(S(\pars,\pars')\), that provides probability densities for steps in \(\pars\)-space, and a function \(A(\pars,\pars')\) that we will call the acceptance probability
The step proposal function can be almost anything, but a key property is that it should be easy to draw a sample from it. The probability for proposing a new position \(\pars'\) will depend on the current position \(\pars\)
For example, assuming that \(\pars\) is a vector in a \(D\)-dimensional parameter space we can take \(S(\pars,\pars')\) to be a uniform distribution within a \(D\)-dimensional hypercube with side length \(\Delta\)
Alternatively we can use a multivariate Gaussian distribution with a mean that corresponds to the current position \(\pars\) and a covariance matrix that we can specify.
The probability of actually accepting the proposed step from \(\pars\) to \(\pars'\) will then be given by \(0 \leq A(\pars,\pars') \leq 1\).
The detailed balance condition (18.25) can be fulfilled by placing the following condition on the acceptance probability
The expression for \(A\) that fulfils this condition is
and holds the probability by which we decide whether to accept the proposed position \(\pars'\), or remain in the old one \(\pars\). The ratio that appears in the acceptance probability is called the Metropolis-Hastings ratio and will here be denoted \(r\). We note that it gets further simplified for symmetric proposal functions (for which it is sometimes referred to as the Metropolis ratio)
Note how the proposal function \(S(\pars,\pars')\) in combination with the acceptance probability \(A(\pars,\pars') = \text{min} (r,1)\) provides a simple process for generating samples in a Markov chain that has the required equilibrium distribution. It only requires simple draws of proposal steps, and the evaluation of \(p\) at these positions. It is stochastic due to the random proposal of steps and since the decision to accept a proposed step contains randomness via the acceptance probability.
Note
The proposal function must allow the chain to be irreducible, positively recurrent and aperiodic.
The basic structure of the Metropolis (and Metropolis-Hastings) algorithm is the following:
Algorithm 18.1 (The Metropolis-Hastings algorithm)
Initialize the sampling by choosing a starting point \(\boldsymbol{\pars}_0\).
Collect samples by repeating the following:
Given \(\boldsymbol{\pars}_i\), propose a new point \(\boldsymbol{\phi}\), sampled from the proposal distribution \(S( \boldsymbol{\pars}_i, \boldsymbol{\phi} )\). This proposal distribution could take many forms. However, for concreteness you can imagine it as a multivariate normal with mean vector given by \(\boldsymbol{\pars}_i\) and (position-independent) variances \(\boldsymbol{\sigma}^2\) specified by the user.
The propsal function will (usually) give a smaller probability for visiting positions that are far from the current position.
The width \(\boldsymbol{\sigma}\) determines the average step size and is known as the proposal width.
Compute the Metropolis(-Hastings) ratio \(r\) given by Eq. (18.32). For symmetric proposal distributions, with the simpler expression for the acceptance probability, this algorithm is known as the Metropolis algorithm.
Decide whether or not to accept candidate \(\boldsymbol{\phi}\) for \(\boldsymbol{\pars}_{i+1}\).
If \(r \geq 1\): accept the proposal position and set \(\boldsymbol{\pars}_{i+1} = \boldsymbol{\phi}\).
If \(r < 1\): accept the position with probability \(r\) by sampling a uniform \(\mathcal{U}([0,1])\) distribution (note that now we have \(0 \leq r < 1\)).
If \(u \sim \mathcal{U}([0,1]) \leq r\), then \(\boldsymbol{\pars}_{i+1} = \boldsymbol{\phi}\) (accept);
else \(\boldsymbol{\pars}_{i+1} = \boldsymbol{\pars}_i\) (reject).
Note that the chain always grows since you add the current position again if the proposed step is rejected.
Iterate until the chain has reached a predetermined length or passes some convergence tests.
The Metropolis algorithm dates back to the 1950s in physics, but didnβt become widespread in statistics until almost 1980.
It enabled Bayesian methods to become feasible.
Note, however, that nowadays there are much more sophisticated samplers than the original Metropolis one.
Exercises#
Exercise 18.19 (Metropolis sampling of a uniform distribution)
Show that the update rule implemented in Example 18.6 has the uniform distribution \(\mathcal{U}(0,1)\) as its limiting distribution.
Exercise 18.20 (Power-law distributions)
Discrete power-law distributions have the general form \(p(i) \propto i^\alpha\). for some constant \(\alpha\). Unlike exponentially decaying distributions (such as the normal distribution and many others), power-law distributions have fat tails. They are therefore often used to model skewed data sets. Let
Implement a Metropolis algorithm to sample for \(\boldsymbol{\pi}\).
Exercise 18.21 (The Metropolis algorithm for a discrete distribution)
Use the Metropolis algorithm as outlined in Remark 18.1 to construct a transition matrix \(T\) for a Markov chain that gives the discrete limiting distribution
Assume that the step proposal matrix has \(S(i,j)=1/3\) for all \(i,j\).
Compute \(T\) by hand, or numerically using Python. Verify that it is a valid transition matrix and that the above distribution is stationary.
Verify that \(\pi\) and \(T\) fulfills detailed balance.
Note that this distribution is the limiting distribution found in Exercise 18.15. However, the transition matrices are not the same. Try to explain why. Can you construct other transition matrices with the same limiting distribution?
Construct the Markov chain outcomes of random variable \(\{X_n\}\) numerically. Plot the convergence of the discrete distribution for different starting distributions and see if the correct limiting distribution is obtained.
Solutions#
Solution to Exercise 18.19 (Metropolis sampling of a uniform distribution)
Check the acceptance criterion.
Solution to Exercise 18.20 (Power-law distributions)
It is suggested to use \(\{-1,1\}\) as the proposal step (with \(p=0.5\) a step to the left). The acceptence fraction involves a ratio of probabilities, which removes the infinite sum in the denominator. Note also that the acceptance criterion must include the reflective boundary at \(i=1\).
Solution to Exercise 18.21 (The Metropolis algorithm for a discrete distribution)
Construct the \(T\)-matrix
and note that all elements are positive and that the row sums equal one. You can verify by hand that \(\pi = \pi T\) with the given distribution.
Eq. (18.18) gives nine equalities that should be checked.
The stationary distribution \(\pi\) is a left eigenvector of \(T\) with eigenvalue \(+1\). But there is no uniqueness theorem that dictates that there can only be one matrix with this eigenvalue/vector. It is easy to create other transition matrices with the same limiting distribution by just changing the step proposal matrix.
Generate the Markov chain and see that it converges
import numpy as np
t=np.array([[0.33333333, 0.33333333, 0.33333333],
[0.04938272, 0.88888889, 0.0617284 ],
[0.26666667, 0.33333333, 0.4 ]])
N = 40 # Number of steps
x0 = np.array([1,0,0]) # Start distribution
X = []
X.append(x0)
for i in range(1,N):
X.append(np.matmul(X[i-1],t))
X = np.array(X)
print(X)