4.6. Data, models, and predictions#

“All models are wrong, but models that know when and how they are wrong, are useful…”

—revised version of George Box’s quote (see Overview of modeling)

How do we make statistical predictions?#

Let us consider a common scenario:

  • We have a model \(M\) with parameters \(\pars\).

  • We have some (noisy) data \(\data\) that we can use to calibrate the parameters. Generically label the input values by \(x\) (a vector) and the quantities of interest by \(y = y(x)\).

  • Then we want to predict at other input points \(x^\ast \rightarrow y^\ast\).

How shall we proceed? Let’s first say a bit more about the ingredients.

A statistical model for our data#

From the Bayesian workflow introduced in Section 2.2, a key element of our inference is to formulate a statistical model for our data. Let us start with the data \(\data\) already obtained through a measurement process, e.g., an experiment in a laboratory or an observation of some astronomical event. All data come with uncertainties of various origin, let us denote these as \(\delta \data\). Given some data \(\data\), one might immediately ask what this data can tell us about data we have not yet collected or used in the inference. We call this future data \(\futuredata\). At present, we are uncertain about any future data, and we describe as a (conditional) probability \(\cprob{\futuredata}{\data,I}\). All we have said so far is that predictions are uncertain. The obvious and interesting question is: how uncertain is the prediction? To answer that, we must go from this abstract probability to something that we can evaluate quantitatively. The first step is to develop a theoretical model to analyze the relevant data.

A physical model \(M\) allows quantitative evaluation of the system under study. Any model we employ will always depend on model parameters \(\pars\) with uncertain numerical values. Moreover, all models are wrong, in the sense that there will always be some physics that we have neglected to include or are unaware of today. If we denote mismatch between model predictions and real world observations of the system, i.e., data, as \(\delta M\), we can write

(4.18)#\[ \data = M(\pars) + \delta \data + \delta M. \]

The mismatch term \(\delta M\) is often referred to as a model discrepancy. We are uncertain also about this term, so it must be represented as a random variable that is distributed in accordance with our beliefs about \(\delta M\). It is no trivial task to incorporate model discrepancies in the analysis of scientific models and data, yet it is crucial to avoid overfitting the model parameters \(\pars\) and making overly confident model predictions. We will touch upon \(\delta M\) in two contexts in this text: in treating the truncation error in expansions such as encountered in effective field theories and with a prototypical example (the “Ball-Drop Experiment”) of using Gaussian processes to model \(\delta M\) (see Bayesian approach to model discrepancy and the 📥 Model discrepancy example: The ball-drop experiment notebook).

Note that the model discrepancy remains present even if there is no uncertainty about \(\pars\). In the following we subsume the choice of model and other decisions into the set of background knowledge \(I\).

The statistical model and underlying truth (with an alternative notation)

Let us consider the statistical model from the perspective of our (often implicit) belief as physicists that there is an underlying truth that we approach both by refining our theoretical descriptions and our experimental measurements. The spectacular agreement of theory and experiment for the anomalous magnetic dipole moment of a muon (the “muon \(g-2\)” measurement) is vivid testimony to the existence of this truth.

But at any given time we only approximate the truth from both theory and experiment, and statistically we seek to account for their deficiencies. Let us denote the underlying true theory (what statisticians call “truth”) as \(\ytrue(x)\), where \(x\) is a generic input (i.e., it could be a vector in the input space). The truth should be our model predictions \(\yth\) plus the theory error (model discrepancy):

\[ \ytrue(x_i) = \yth(x_i;\pars) + \delta\yth(x_i) , \label{eq:theory_truth} \]

where we have explicitly noted that the model predictions depend on parameters \(\pars\). At the same time, observation \(y_i\) at input \(x_i\) should be the truth plus the experimental error \(\delta \yexp\):

\[ y_i = \ytrue(x_i) + \delta\yexp(x_i) . \label{eq:expt_truth} \]

Eliminating \(\ytrue\) yields our statistical model for the observations:

\[ y_i = \yth(x_i;\pars) + \delta\yexp(x_i) + \delta\yth(x_i) . \label{eq:stat_model} \]

These equations encode the relationship between the random variables \(y_i\), \(\yth(x_i;\pars)\), \(\delta\yexp(x_i)\) and \(\delta\yth(x_i)\). The underlying \(\ytrue\) describing the observables used for parameter estimation and for new observations could be the same, but in general \(\ytrue\) may be completely different for the predicted observable.

The correspondence to the notation for the same statistical model (4.18) is \(y \rightarrow \data\), \(\delta\yexp(x_i) \rightarrow \delta\data\), \(\yth(x_i;\pars) \rightarrow M(\pars)\), and \(\delta\yth(x_i) \rightarrow \delta M\).

Bayesian parameter estimation#

Quantifying the posterior distribution \(\pdf{\pars}{\data,I}\) for the parameters of a model is called Bayesian parameter estimation, and is a staple of Bayesian inference. This is a probabilistic generalization of parameter optimization and maximum likelihood estimation whereby one tries to find an extremum parameter value of some objective function or data likelihood, respectively. Instead of single values characterizing the distribution (“point estimates”), we seek the full distribution. We will see multiple examples of this in the coming chapters.

To evaluate the posterior for the model parameters we must employ Bayes’ theorem,

(4.19)#\[\pdf{\pars}{\data,I} = \frac{\pdf{\data}{\pars,I}\pdf{\pars}{I}}{\pdf{\data}{I}}.\]

Here, we must formulate a likelihood of the data \(\pdf{\data}{\pars,I}\) and a prior distribution of the model parameters \(\pdf{\pars}{I}\).

The denominator in Eq. (4.19) is sometimes referred to as the marginal likelihood or the evidence and normalizes the left-hand side such that it integrates to unity, i.e., we have

(4.20)#\[\begin{equation} \pdf{\data}{I} = \int_{\Omega} \pdf{\data}{\pars} \pdf{\pars}{I}\, {\rm d}\pars. \end{equation}\]

Often we do not need an absolutely normalized posterior distribution, so we can omit the denominator in Eq. (4.19). Indeed, the latter does not explicitly depend on \(\pars\).

Bayesian parameter estimation can sometimes be very challenging. In the chapter on Bayesian Linear Regression (BLR) we will see an example of where we can perform analytical calculations throughout. However, in most realistic applications the posterior must be evaluated numerically, and most often by sampling using sec:MCMC. This is no silver bullet and to quantify (or characterize) a multi-dimensional posterior, sometimes with a complicated geometry, for an intricate physical model, is by no means guaranteed to succeed. At least not in finite time. Nevertheless, obtaining posterior distributions to represent uncertainties is the gold standard in any inferential analysis.

The posterior predictive distribution#

The distribution of future data conditioned on past data and background information, i.e., \(\pdf{\futuredata}{\data,I}\), is called a posterior predictive distribution (ppd). Assuming that we have a model \(M(\pars)\) for the data-generating mechanism we can express this distribution by marginalizing over the uncertain model parameters \(\pars \in \Omega\)

(4.21)#\[\pdf{\futuredata}{\data,I} = \int_{\Omega} \pdf{\futuredata}{\pars,\data, I}\pdf{\pars}{\data,I}\,{\rm d} \pars.\]

If \(\futuredata\) is conditionally independent of \(\data\), we can replace \(\pdf{\futuredata}{\pars,\data, I}\) by \(\pdf{\futuredata}{\pars, I}\), but there are cases of ppds where this is not true and we must be more careful (e.g., sometimes when we include \(\delta M\)). By performing this integral we account for the uncertainty in the model parameters \(\pars\) when making predictions. In fact, one can marginalize (average) predictions over anything and everything that we are uncertain about as long as we have access to the necessary probability distributions.

Checkpoint question

How would you notate the statement that \(\futuredata\) is conditionally independent of \(\data\) in (4.21)?

Exercises#

Exercise 4.3

Derive Eq. (4.21) using the rules of probability calculus and inference.

Exercise 4.4

Can you think of a situation where you would have to compute the denominator in Eq. (4.19)

Exercise 4.5

In Gothenburg it rains on 60% of the days. The weather forecaster at SMHI attempts to predict whether or not it will rain tomorrow. 75% of rainy days and 55% of dry days are correctly predicted thusfar. Given that forecast for tomorrow predicts rain, what is the probability that it will actually rain tomorrow?

Exercise 4.6

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, there are goats. You pick a door, say No. 1. This door remains closed. Instead, the game-show host, who knows what’s behind all three doors, opens another door where he knows there will be a goat, say No. 3, which indeed has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your initial choice of door? Motivate your answer using Bayes’ theorem. (This is a famous problem known as the Monty Hall problem)

Exercise 4.7

Assume we have three coins in a bag. All three coins feel and look the same, but we are told that: the first coin is biased with a 0.75 probability of obtaining heads when flipped – the second coin is fair, i.e., 0.5 probability of obtaining heads – the third coin is biased towards tails with a 0.25 probability of coming up heads.

Assume that you reach your hand into the bag and pick a coin randomly, then flip it and obtain heads. What is the probability for obtaining heads if you flip it once more?

Solutions#

Here are answers and solutions to selected exercises.