*Aside: Bayesian epistemology

4.7. *Aside: Bayesian epistemology#

Our presentation merits some philosophical remarks on the interpretation of probabilities.

Philosophical remarks on probabilities#

The axioms of probability theory are very useful for manipulating probabilities and obtaining quantitative results. There is, however, an ongoing philosophical discussion centered on the question what is the probability \(\prob\)? There exists several interpretations of probability, and the two main views are: frequentist probability and Bayesian probability. There are numerous flavors of probability interpretations in each category, most of which are consistent with Kolmogorov’s axioms. Generally speaking, there is no difference in the calculus of Bayesian and frequentist probabilities. However, their interpretations differ on a fundamental level, and we will get to that shortly. But why even bother to dwell on this topic? Philosophy is sometimes (rightly) accused of dealing solely with abstractions separated from reality and therefore of no use to someone interested in real-world applications. There is however something fundamentally useful about the philosophy of probability in the sense that all events, propositions, and outcomes in social and natural sciences are bearers of probability. It is therefore of fundamental importance to learn how we can use the mathematical measure of probability to better understand real systems.

Frequentist probability#

The frequentist interpretation of probability was developed by John Venn in the second half of the 19th century. He argued that the probability of an event \(A\) is the relative frequency of its actual occurrence in a series of \(N\) experiments, i.e.,

(4.23)#\[\begin{equation} \prob(A) = \frac{n_A}{N}, \end{equation}\]

assuming \(A\) happened \(n_A\) times. This interpretation is connected with the classical interpretation of probability proposed by Jacob Bernoulli and Pierre-Simon Laplace which dictates an equal distribution of probability among all the possible events in the absence of any evidence otherwise. Obviously, the classical interpretation is applicable in circumstances where all events are equally probable. For example, the classical probability of a fair die landing up with any of the numbers up to (and including) 4 is \(4/6\). However, the world consists of more than equally probable events and the frequentist interpretation was designed to remedy this. At first glance, the classical and frequentist interpretations appear to be the same. However, note that with the classical interpretation we counted all possible outcomes before any experiment was conducted, whereas in the frequentist interpretation we count actual outcomes, and only those. The frequentist interpretation strives to attain some level of objectiveness in the quantification of probability of real-world data.

Consider a coin toss. Before the coin has been tossed, there is nothing we can say regarding the frequentist probability of the coin landing heads up, i.e., \(\prob(H)\), or the opposite, that the coin lands tails up, \(\prob(T)\). Indeed, there is no data to form a frequency ratio. Once the coin is tossed and it has landed, but we have not seen the answer (collected the data), the probability is either \(\prob(H)=1\) or \(\prob(H)=0\). The coin has landed either heads or tails up, and once we collect the data we will assign a 0/1 probability. We will contrast this example in the next section where we discuss the Bayesian interpretation of probability which allow you to form a belief about, e.g., \(\prob(H)\) before the data is collected and even before any coin has ever been tossed, and we will call this probability a prior. In summary, the frequentist interpretation is firmly grounded in the collection of data, and not much else. The reason for this is that the frequentist interpretation strives for measuring an objective truth. It also only possible to form probabilities that can be linked to some frequency of events present in a series of some kind. This is not without problems. Try to use this interpretation to quantify the probability that the Sun will rise tomorrow morning or that the Universe is geometrically open. To us physicists, the limited scope of frequentist probabilities place a serious constraint on its usefulness.

Bayesian probability#

Bayesian probability dates back to the early 18th century when Thomas Bayes derived a special form of what is nowadays known as Bayes’ theorem. After that, Laplace pioneered this branch of probability and established what was then referred to as inverse probability. He combined Bayes’ theorem with the principle of indifference, which can be seen as positing a flat prior probability on possible events. Nowadays, the Bayesian interpretation of probability amounts assigning a graded belief to any proposition or hypothesis. This approach enables probabilities to be applied beyond situations where a frequency can be identified.

So how do we quantify the probability \(\prob(A|D,I)\) for some event \(A\) given data \(D\), and any other information \(I\), if not as a frequency? This is a longer discussion, and to get to the core of that question, let us begin by inspecting Bayes’ theorem. Doing so we realize that we must also formulate a prior \(\prob(A|I)\). As a side remark we mention that in the Bayesian view there is no such thing as an absolute probability, e.g., \(\prob(A)\), as in the frequentist case, instead all probabilities are conditioned on \(I\) at least. Once the prior is formed, and we have collected some data, we can modulate the likelihood \(\prob(D|A,I)\) with the prior using Bayes’ theorem to obtain \(\prob(A|D,I)\). Note that the prior does not necessarily characterize information from the past. Indeed, there is no temporal dimension in probability theory. Bayesian inference can be viewed as a framework for making decisions under uncertainty or incomplete information. The Bayesian paradigm can be applied by a historian trying to infer events from the past based on incomplete records and archives or reaching a verdict in a legal process based on limited evidence and uncertain testimonies.

Regarding the formulation of the prior, we encounter two schools of thought; the objective and the subjective interpretations of probability. The former interpretation expands on Laplace’s principle of indifference and defines probability as a formal system of logic and reasoning in the presence of uncertainty. In this objective approach to Bayesian probability it is essential that the prior probability is assigned consistently with a logical analysis of all prior information in a minimally informative sense, i,e., as objectively as possible. The method of maximum entropy is put forward as one way to achieve this. Indeed, entropy measures the lack of informativeness’ of a probability function. So maximizing the entropy consistent with our background knowledge enables one to a logically arrive a maximally objective prior probability density (or mass). One can also try to construct a prior density that is invariant with respect to re-parametrization of the model parameters. This is called a Jeffreys prior and its construction follows a well-defined mathematical procedure. These approaches sometimes come with mathematical challenges or the necessity to violate some axioms of probability. Formalizing objective priors is an active field of research and if a method for representing ignorance comes to fruition it will have important consequences for how we should analyze data.

The fundamental strive for objectivity in the prior can be criticized. Indeed, we are seldomly in a position where we actually are objective about a scientific proposition. At least if we care about the proposition in the first place. The Bayesian analysis of data entail subjective modelling choices and not everyone has access to the same information. As such, probabilities will always be personal to some extent. In an extreme situation, we can have as many probabilities of an event \(A\) as there are agents in the world. The subjective interpretation of probability accommodates this stance. This approach does however require that agents are rational in the sense that they obey the axioms of probability. This is sometimes referred to as coherence. The traditional approach to formulate coherent and subjective probabilities regarding some event \(A\) follow from a betting analysis. It basically boils down to that your degree of belief \(\prob(A|I)\), based on your background knowledge \(I\), is equal to \(p\) if your are willing to bet \(p\) cents for a possible return of 1 dollar if \(A\) happens. For example, if you are willing to bet, say, 25 cents that it will rain on Thursday, then your probability that it will rain on Thursday is 0.25. One can argue against betting 0 or 1 dollars since it ruins the point of gambling and reflects positions of absolute certainty. It is pivotal to have a rational agent otherwise we run the risk of having a series of bets bought and sold that collectively guarantee loss regardless of outcomes. The betting analogy provides an intuitive and operational definition of subjective probability. Unfortunately, the real world does not include only rational agents, and the act of placing a bet on some event \(A\) could alter your expected belief of the same event.

Summary#

Probability calculus obeys a small set of reasonable axioms and rests on a well-founded mathematical theory of measures. The fundamental challenge in dealing with probabilities lies in mapping the mathematical measure of probability to events occurring in the real world. To make an analogy, let us consider the measure of length, i.e., the metre. The notion of length is somewhat trivial, although an expanded perspective was provided by Einstein. From an abstract point of view, you know how to define a coordinate system in some space with a well-defined inner product etc. There is indeed very little challenge to represent it mathematically. Even so, mankind has refined the operational definition of the metre for centuries and applying the measure of length to reality is not so trivial as one might think. The question what is a metre? has been given three different answers in the 20th century alone. The measure of probability is far more important than the metre since it is the metre stick we use to quantify uncertainty and as such it is a corner stone of the scientific method. Yet, we have still not settled on a philosophical position that encompasses all uncertainty, nor do we know if one such position exists. Any progress in this direction will provide an important advance in our understanding of the world.

Discuss

How would you respond to the following statement: ‘As scientists, we should be concerned with objective knowledge rather than subjective belief.’

One view

Any scientific analysis contains subjective knowledge. Indeed, we always make assumptions during the analysis, and your assumptions will be based on your prior domain knowledge, which is not necessarily equal to others’ domain knowledge. When performing Bayesian inference we must always state these subjective beliefs and assumptions in the priors. This transparency is welcome, and we should always do our best to report to what extent the inferences are sensitive to a particular choice of prior.