6.3. Computing the posterior analytically#
First the likelihood#
Suppose we had a fair coin \(\Longrightarrow\) \(p_h = 0.5\)
Is the sum rule obeyed?
Proof of penultimate equality
\((x+y)^N = \sum_{R=0}^N {N \choose R} x^R y^{N-R} \overset{x=y=1}{\longrightarrow} \sum_{R=0}^N {N \choose R} = 2^N\). More generally, \(x = p_h\) and \(y = 1 - p_h\) shows that the sum rule works in general.
The likelihood for a more general \(p_h\) is the binomial distribution:
Maximum-likelihood means: what value of \(p_h\) maximizes the likelihood (notation: \(\mathcal{L}\) is often used for the likelihood)?
Exercise: Carry out the maximization
But, as a Bayesian, we want to know about the PDF for \(p_h\), so we actually want the PDF the other way around: \(p(p_h|R,N)\). Bayes says
Note that the denominator doesn’t depend on \(p_h\) (it is just a normalization).
So how are we doing the calculation of the updated posterior?
In this case we can do analytic calculations.
Case I: uniform (flat) prior#
where we will suppress the “\(I\)” going forward. But
Recall Beta function
and \(\Gamma(x) = (x-1)!\) for integers.
and so evaluating the posterior for \(p_h\) for new values of \(R\) and \(N\) is direct: substitute (6.5) into (6.2). If we want the unnormalized result with a uniform prior (meaning we ignore the normalization constant \(\mathcal{N}\) that simply gives an overall scaling of the distribution), then we just use the likelihood (6.1) since \(p(p_h) = 1\) for this case.
Case II: conjugate prior#
Choosing a conjugate prior (if possible) means that the posterior will have the same form as the prior. Here if we pick a beta distribution as prior, it is conjugate with the coin-flipping likelihood. From the scipy.stats.beta documentation the beta distribution (function of \(x\) with parameters \(a\) and \(b\)):
where \(0 \leq x \leq 1\) and \(a>0\), \(b>0\). So \(p(x|a,b) = f(x,a,b)\) and our likelihood is a beta distribution \(p(R,N|p_h) = f(p_h,1+R,1+N-R)\) to agree with (6.1).
If the prior is \(p(p_h|I) = f(p_h,\alpha,\beta)\) with \(\alpha\) and \(\beta\) to reproduce our prior expectations (or knowledge), then by Bayes’ theorem the normalized posterior is
so we update the prior simply by changing the arguments of the beta distribution: \(\alpha \rightarrow \alpha + R\), \(\beta \rightarrow \beta + N-R\) because the (normalized) product of two beta distributions is another beta distribution. Really easy!
Warning
Check this against the code! Look in the code where the posterior is calculated and see how the beta distribution is used. Verify that (6.7) is evaluated. Try changing the values of \(\alpha\) and \(\beta\) used in defining the prior to see the shapes.