So far, we have been concerned with the problem of parameter estimation. In studying the linear relationship between two quantities, for example, we discussed how to infer the slope and the offset of the associated straight-line model. Often, however, there is a question as to whether a quadratic, or even cubic, function might be a more appropriate model. In this lecture, we will consider the broad class of scientific problems when there is uncertainty as to which one of a set of alternative models is most suitable. In the Bayesian terminology these can be labeled as Model Selection problems and we will discuss them in some depth.
Dr. A has a theory; Prof. B also has a theory, but with an adjustable parameter \( \lambda \). Whose theory should we prefer on the basis of data \( D \)?— Jefferys (1939), Gull (1988), Sivia (2006)It is clear that we need to evaluate the posterior probabilities for A and B being correct to ascertain the relative merit of the two theories. If the ratio of the posterior probabilities, $$ \begin{equation} \text{posterior ratio} = \frac{p(A |D, I )}{p(B|D,I)} \label{eq:sivia_41} \end{equation} $$ is very much greater than one, then we will prefer A’s theory; if it is very much less than one, then we prefer that of B; and if it is of order unity, then the current data are insufficient to make an informed judgement.
To estimate the odds, let us start by applying Bayes’ theorem to both the numerator and the denominator; this gives $$ \begin{equation} \frac{p(A|D,I)}{p(B|D,I)} = \frac{p(D|A,I) p(A|I)}{p(D|B,I) p(B|I)} \label{eq:sivia_42} \end{equation} $$ because the term \( p(D|I) \) cancels out, top and bottom. As usual, probability theory warns us immediately that the answer to our question depends partly on what we thought about the two theories before the analysis of the data. To be fair, we might take the ratio of the prior terms, on the far right of Eq. \eqref{eq:sivia_42}, to be unity; a harsher assignment could be based on the past track records of the theorists! To assign the probabilities involving the experimental measurements, \( p(D|A,I) \) and \( p(D|B,I) \), we need to be able to compare the data with the predictions of A and B: the larger the mismatch, the lower the corresponding probability. This calculation is straightforward for Dr A, but not for Prof B; the latter cannot make predictions without a value for \( \lambda \).
To circumvent this difficulty, we can use the sum and product rule to relate the probability we require to other pdfs which might be easier to assign. In particular, marginalization and the product rule allow us to express \( p(D | B , I ) \) as $$ \begin{equation} p(D|B,I) = \int d\lambda p(D,\lambda|B,I) = \int d\lambda p(D|\lambda,B,I) p(\lambda|B,I). \label{eq:sivia_43} \end{equation} $$ The first term in the integral \( p(D | \lambda, B , I ) \), where the value of \( \lambda \) is given, is now just an ordinary likelihood function; as such, it is on a par with \( p(D|A,I) \). The second term is B’s prior pdf for \( \lambda \); the onus is, therefore, on the theorist to articulate his or her state of knowledge, or ignorance, before getting access to the data.
To proceed further analytically, let us make some simplifying approximations. Assume that, a priori, Prof B is only prepared to say that \( \lambda \) must lie between the limits \( \lambda_\mathrm{min} \) and \( \lambda_\mathrm{max} \); we can then naively assign a uniform prior within this range: $$ \begin{equation} p(\lambda|B,I) = \frac{1}{\lambda_\mathrm{max}-\lambda_\mathrm{min}} \quad \text{for } \lambda_\mathrm{min} \leq \lambda \leq \lambda_\mathrm{max}, \label{eq:sivia_44} \end{equation} $$ and zero otherwise. Let us also take it that there is a value \( \lambda_0 \) which yields the closest agreement with the measurements; the corresponding probability \( p(D|\lambda_0,B,I) \) will be the maximum of B’s likelihood function. As long as this adjustable parameter lies in the neighbourhood of the optimal value, \( \lambda_0 \pm \delta\lambda \), we would expect a reasonable fit to the data; this can be represented by the Gaussian pdf $$ \begin{equation} p(D|\lambda,B,I) = p(D|\lambda_0,B,I) \exp \left[ − \frac{(\lambda−\lambda_0)^2}{2\delta\lambda^2} \right]. \label{eq:sivia_45} \end{equation} $$ The assignments of the prior \eqref{eq:sivia_44} and the likelihood \eqref{eq:sivia_45} are illustrated in Fig. 2. We may note that, unlike the prior pdf \( p(\lambda|B,I) \), B’s likelihood function need not be normalized with respect to \( \lambda \); in other words, \( p(D|\lambda_0,B,I) \) need not equal \( 1/ \delta\lambda \sqrt{2\pi} \) . This is because the \( \lambda \) in \( p(D|\lambda,B,I) \) appears in the conditioning statement, whereas the normalization requirement applies to quantities to the left of the ‘|’ symbol.
In the evaluation of \( p(D | B , I ) \), we can make use of the fact that the prior does not depend explicitly on \( \lambda \); this enables us to take \( p(\lambda|B,I) \) outside the integral in Eq. \eqref{eq:sivia_43} $$ \begin{equation} p(D|B,I) = \frac{1}{\lambda_\mathrm{max} - \lambda_\mathrm{min}} \int_{\lambda_\mathrm{min}}^{\lambda_\mathrm{max}} d\lambda p(D|\lambda,B,I), \label{eq:sivia_46} \end{equation} $$ having set the limits according to the specified range. Assuming that the sharp cut-offs do not cause a significant truncation of the Gaussian pdf of the likelihood, its integral will be equal to \( \delta\lambda \sqrt{2\pi} \) times \( p(D|\lambda_0,B,I) \). The troublesome term then reduces to $$ p(D|B,I) = \frac{1}{\lambda_\mathrm{max} - \lambda_\mathrm{min}} p(D|\lambda_0,B,I) \delta\lambda \sqrt{2\pi}. $$ Substituting this into Eq. \eqref{eq:sivia_42}, we finally see that the ratio of the posteriors required to answer our original question decomposes into the product of three terms: $$ \begin{equation} \frac{p(A|D,I)}{p(B|D,I)} = \frac{p(A|I)}{p(B|I)} \times \frac{p(D|A,I)}{p(D|\lambda_0,B,I)} \times \frac{\lambda_\mathrm{max} - \lambda_\mathrm{min}}{\delta\lambda \sqrt{2\pi}}. \label{eq:sivia_48} \end{equation} $$ The first term on the right-hand side reflects our relative prior preference for the alternative theories; to be fair, we can take it to be unity. The second term is a measure of how well the best predictions from each of the models agree with the data; with the added flexibility of his adjustable parameter, this maximum likelihood ratio can only favour B. The goodness-of-fit, however, cannot be the only thing that matters; if it was, we would always prefer more complicated explanations. Probability theory tells us that there is, indeed, another term to be considered. As assumed earlier in the evaluation of the marginal integral of Eq. \eqref{eq:sivia_43}, the prior range \( \lambda_\mathrm{max} - \lambda_\mathrm{min} \) will generally be much larger than the uncertainty \( \pm\delta\lambda \) permitted by the data. As such, the final term in Eq. \eqref{eq:sivia_48} acts to penalize B for the additional parameter; for this reason, it is often called an Ockham factor. That is to say, we have naturally encompassed the spirit of Ockham’s Razor: ‘Frustra fit per plura quod potest fieri per pauciora’ or, in English, ‘it is vain to do with more what can be done with fewer’ .
Although it is satisfying to quantify the everyday guiding principle attributed to the thirteenth-century Franciscan monk William of Ockham (or Occam, in Latin), that we should prefer the simplest theory which agrees with the empirical evidence, we should not get too carried away by it. After all, what do we mean by the simpler theory if alternative models have the same number of adjustable parameters? In the choice between Gaussian and Lorentzian peak shapes, for example, both are defined by the position of the maximum and their width. All that we are obliged to do, and have done, in addressing such questions is to adhere to the rules of probability.
While accepting the clear logic leading to Eq \eqref{eq:sivia_48}, many people rightly worry about the question of the limits \( \lambda_\mathrm{min} \) and \( \lambda_\mathrm{max} \). Jeffreys (1939) himself was concerned and pointed out that there would be an infinite penalty for any new parameter if the range was allowed to go to \( \pm\infty \). Stated in the abstract, this would appear to be a severe limitation. In practice, however, it is not generally such a problem: since the analysis is always used in specific contexts, a suitable choice can usually be made on the basis of the relevant background information. Even in uncharted territory, a small amount of thought soon reveals that our state of ignorance is always far from the \( \pm\infty \) scenario. If \( \lambda \) was the coupling constant (or strength) for a possible fifth force, for example, then we could put an upper bound on its magnitude because everybody would have noticed it by now if it had been large enough! We should also not lose sight of the fact that the precise form of Eq \eqref{eq:sivia_48} stems from our stated simplifying approximations; if these are not appropriate, then eqns \eqref{eq:sivia_42} and \eqref{eq:sivia_43} will lead us to a somewhat different formula.
In most cases, our relative preference for A or B is dominated by the goodness of the fit to the data; that is to say, the maximum likelihood ratio in eqn \eqref{eq:sivia_48} tends to overwhelm the contributions of the other two terms. The Ockham factor can play a crucial role, however, when both theories give comparably good agreement with the measurements. Indeed, it becomes increasingly important if B’s theory fails to give a significantly better fit as the quality of the data improves. In that case, \( \delta\lambda \) continues to become smaller but the ratio of best-fit likelihoods remains close to unity; according to Eq. \eqref{eq:sivia_48}, therefore, A’s theory is favoured ever more strongly. By the same token, the Ockham effect disappears if the data are either few in number, of poor quality or just fail to shed new light on the problem at hand. This is simply because the posterior ratio of Eq. \eqref{eq:sivia_48} is then roughly equal to the complementary prior one, since the empirical evidence is very weak; hence, there is no inherent preference for A’s theory unless it is explicitly encoded in \( p(A|I)/p(B|I) \). This property can be verified formally by going back to Eqs. \eqref{eq:sivia_42}, \eqref{eq:sivia_45} and \eqref{eq:sivia_46}, and considering the poor-data limit in which \( \delta\lambda \gg \lambda_\mathrm{max}-\lambda_\mathrm{min} \) and \( p(D|\lambda_0,B,I) \approx p(D|A,I) \).
A simple way to think about the difference between parameter estimation and model selection is to note that, to a good approximation, the former requires the location of the maximum of the likelihood function whereas the latter entails the calculation of its average value. As long as \( \lambda_\mathrm{min} \) and \( \lambda_\mathrm{max} \) encompass the significant region of \( p(D|\lambda,B,I) \) around \( \lambda_0 \), the precise bounds do not matter for estimating the optimal parameter and need not be specified. Since the prior range defines the domain over which the mean likelihood is computed, due thought is necessary when dealing with model selection. Indeed, it is precisely this act of comparing ‘average’ likelihoods rather than ‘maximum’ ones which introduces the desired Ockham balance to the goodness- of-fit criterion. Any likelihood gain from a better agreement with the data, allowed by the greater flexibility of a more complicated model, has to be weighed against the additional cost of averaging it over a larger parameter space.
We Taylor-expand the logarithm \( \log P^* \) around the peak: $$ \log P^*(\theta) = \log P^*(\theta_0) - \frac{1}{2} (\theta - \theta_0)^T \Sigma^{-1} (\theta - \theta_0) + \ldots, $$ where \( \Sigma^{-1} = H \) is the (Hessian) matrix of second derivatives at the maximum $$ H_{ij} = - \left. \frac{\partial^2}{\partial \theta_i \partial \theta_j} \log P^*(\theta)\right|_{\theta=\theta_0}. $$ We then approximate \( P^*(\theta) \) by an unnormalized Gaussian, $$ Q^*(\theta) \equiv P^*(\theta_0) \exp \left[ - \frac{1}{2}(\theta - \theta_0)^T \Sigma^{-1} (\theta - \theta_0) \right], $$ and we approximate the normalizing constant \( Z_P \) by the normalizing constant of this Gaussian, $$ Z_P \approx Z_Q = P^*(\theta_0) \sqrt{\frac{(2\pi)^K}{\det\Sigma^{-1}}}. $$ Predictions can then be made using this approximation \( Q \). Physicists also call this widely-used approximation the saddle-point approximation.
Note, in particular, that if we consider a chi-squared pdf: \( P^*(\theta) = \exp \left( -\frac{1}{2} \chi^2(\theta)\right) \), then we get $$ Z_P \approx \exp \left( -\frac{1}{2} \chi^2(\theta_0)\right) \sqrt{\frac{(4\pi)^K}{\det\Sigma^{-1}}}, $$ where there is a factor \( 2^{K/2} \) that comes from the extra factor \( 1/2 \) multiplying the covariance matrix \( \Sigma^{-1} \) and therefore appearing in all \( K \) eigenvalues.
The fact that the normalizing constant of a Gaussian is given by $$ \int d^K\theta \exp \left[ - \frac{1}{2}\theta^T \Sigma^{-1} \theta \right] = \sqrt{\frac{(2\pi)^K}{\det\Sigma^{-1}}}, $$ can be proved by making an orthogonal transformation into the basis \( u \) in which \( \Sigma \) is transformed into a diagonal matrix. The integral then separates into a product of one-dimensional integrals, each of the form $$ \int du_i \exp \left[ -\frac{1}{2} \lambda_i u_i^2 \right] = \sqrt{\frac{2\pi}{\lambda_i}} $$ The product of the eigenvalues \( \lambda_i \) is the determinant of \( \Sigma^{-1} \).
Note that the Laplace approximation is basis-dependent: if \( \theta \) is transformed to a nonlinear function \( u(\theta) \) and the density is transformed to \( P(u) = P(\theta) |d\theta/du| \) then in general the approximate normalizing constants \( Z_Q \) will be different. This can be viewed as a defect—since the true value \( Z_P \) is basis-independent in this approximation—or an opportunity, because we can hunt for a choice of basis in which the Laplace approximation is most accurate.