19.4. Best practices for MCMC sampling#

Summary points from Hogg and Foreman-Mackey#

We extract here some relevant observations from “Data Analysis Recipes: Using Markov Chain Monte Carlo”](https://arxiv.org/abs/1710.06068) by David Hogg and Daniel Foreman-Mackey. The authors are both computational astrophysicists and Foreman-Mackey is the author of the widely used emcee sampler. They are highly experienced in physics scenarios; they are highly opinionated about sampling, but are not statisticians themselves (although they interact with statisticians).

Their wisdom includes:

  • MCMC is good for sampling, but not optimizing. If you want to find the modes of distributions, use an optimizer instead.

  • For MCMC, you only have to calculate ratios of pdfs (as seen from the algorithm).

    • \(\Lra\) don’t need analytic normalized pdfs

    • \(\Lra\) great for sampling posterior pdfs

\[ p(\thetavec|D) = \frac{1}{Z} p(D|\thetavec)p(\thetavec) \]
  • Getting \(Z\) is really difficult because you need global information. (Cf. \(Z\) and partition function in statistical mechanics.)

  • MCMC is extremely easy to implement, without requiring derivatives or integrals of the function (but see later discussion of HMC).

  • Success means a histogram of the samples looks like the pdf.

  • Sampling for expectation values works even though we don’t know \(Z\); we just need the set of \(\{\thetavec_k\}\).

\[\begin{split}\begin{align} E_{p(\thetavec)}[\thetavec] &\approx \frac{1}{N}\sum_{k=1}^{N}\thetavec_k \\ E_{p(\thetavec)}[g(\thetavec)] &\approx \frac{1}{N}\sum_{k=1}^{N}g(\thetavec_k) \longrightarrow \frac{\int d\thetavec\, g(\thetavec)f(\thetavec)}{\int d\thetavec\, f(\thetavec)} \end{align}\end{split}\]

\(\qquad\) where \(f(\thetavec) = Z\, p(\thetavec)\) is unnormalized.

  • Nuisance parameters are very easy to marginalize over: just drop that column in every \(\theta_k\).

  • Autocorrelation is important to monitor and one can tune (e.g., Metropolis step size) to minimize it. More on this later.

  • How do you know when to stop? Heuristics and diagnostics to come!

  • See the Hogg/Foreman-Mackey paper for practical advice on initialization and burn-in.

Figures to make every time you run MCMC (following Hogg and Foreman-Mackey sect. 9)

  • Trace plots

    • The burn-in length can be seen; lets you identify problems with model or sampler; qualitative judge of convergence.

    • Use convergence diagnostic such as Gelman-Rubin.

  • Corner plots

    • If you have a \(D\)-dimensional parameter space, plot all \(D\) diagonal and all \({D\choose 2}\) joint histograms to show low-level covariances and non-linearities.

    • “… they are remarkable for locating expected and unexpected parameter relationships, and often invaluable for suggesting re-parameterizations and transformation that simplify your problem.”

  • Posterior predictive plots

    • Take \(K\) random samples from your chain, plot the prediction each sample makes for the data and over-plot the observed data.

    • “This plot gives a qualitative sense of how well the model fits the data and it can identify problems with sampling or convergence.”

Sampling from correlated distributions#

If our posterior has projections that are slanted, indicating correlations, and we are doing Metropolis-Hastings (MH) sampling, how do we decide on a step size? The problem is that we want to step differently in different directions: a long enough step size to explore the long axis will lead to many rejections in the orthogonal direction. So we do not want an isotropic step proposal!

If we propose steps by

\[ p(\xvec) = \frac{1}{\sqrt{(2\pi)^N |\Sigma|}} e^{-\xvec^{\intercal}\cdot \Sigma^{-1} \cdot \xvec} , \]

then we don’t need to take \(\Sigma \propto \sigma^2 \mathbb{1}_N\)!

  • We have \(N(N-1)\) parameters to “tune” to reduce the correlation time.

  • However, this is increasingly difficult as \(N\) increases.

One improvement is to do a linear transformation of \(\thetavec \longrightarrow \thetavec' = A\thetavec + B\) such that \(\thetavec'\) is uncorrelated with similar \(\sigma_i\)’s in each direction. Thus effectively to rotate the slanted ellipse.

Or one could use an “affine invariant” sampler such as emcee. An affine transformation is an invertible mapping from \(\mathbb{R}^N \rightarrow \mathbb{R}^N\), namely \(\yvec = A\xvec + B\), which is a combination of stretching, rotation, and translation. “Affine invariant” means that the sampler performs equally well on all affine tranformations of a distribution. So emcee figures out how to make optimal steps. It does this by using the many walkers at time \(t\), which have sampled the space, to construct an appropriate affine compatible update step for \(t+1\). This is one reason to make sure there are plenty of walkers.