14.1. Frequentist hypothesis testing#
Recall that in frequentist statistics, probability statements are restricted to random variables. A hypothesis can not be considered a random variable, and therefore we are restricted to a much more indirect approach when trying to infer its truth, or rather when attempting to falsify it.
Basic idea#
The standard sampling theory approach to hypothesis testing is to construct a statistical test. The basic idea is described in the following.
Frequentist hypothesis testing. The sampling theory hypothesis test is designed to compare a selected statistic from the measured data with expected results from a very large number of hypothetical repeated measurements under the assumption that a chosen null hypothesis (\(\mathcal{H}_0\)) is true. The null hypothesis is accepted or rejected purely on the basis of how unexpected the data were to \(\mathcal{H}_0\), not on how much better the alternative hypothesis (\(\mathcal{H}_A\)) predicted the data.
The degree of ‘‘unexpectedness’’ is based on a statistic, such as the sample mean or the \(\chi^2\) statistic. The statistic is a random variable and it is chosen so that its distribution can be easily computed given the truth of the null hypothesis. In other words, this is the distribution of the chosen statistic for a very large number of hypothetical repeated measurements under the assumption that the null hypothesis is true. This statistic is then computed for the observed data set and its value is compared with the distribution that is associated with the truth of the null hypothesis. If the statistic from the observed data falls in a very unlikely spot on this distribution (the threshold is to be defined beforehand) we choose to reject the null hypothesis at some confidence level on the basis of the measured data set.
Hypothesis testing with the chi-squared statistic#
A very common statistic to use is the \(\chi^2\) measure. A good example is found in Gregory, ch 7.2.1, with the measurements of flux density from a distant galaxy over a period of 6000 days. The main steps of the presented analysis are the following:
Choose as a null hypothesis that the galaxy has an unknown, but constant, flux density. If we can reject this hypothesis at e.g. the 95% confidence level, then this provides indirect evidence(?) for the alternative hypothesis that the radio emission is variable.
In this example, it is assumed that the measurement errors are independent and identically distributed (iid) according to a normal distribution with a fixed standard deviation \(\sigma\) that is known beforehand.
The \(\chi^2\) statistic from the data set is evaluated (\(x_i\) is the data and \(\bar{x}\) is the average from the sample)
In our example we had 15 data points, but we are using them first to estimate the mean \(\mu\). Therefore, we lose one degree of freedom and are left with 14. This number will determine the form of the \(\chi^2\) distribution that will be used for comparison with our actual \(\chi^2\) statistic.
The question of how unlikely is this value of \(\chi^2\) is by convention interpreted in terms of the area in the tail of the \(\chi^2\) distribution beyond this line. This is called the \(P\)-value or significance.
In some cases, a two-sided statistic should be considered instead.
Fig. 14.1 The \(\chi^2\) distribution for 14 degrees of freedom. The value computed from the measurements of flux density from a galaxy is indicated by a vertical line. The shaded area is the \(P\)-value. It is 2% in this particular example so we would reject the null hypothesis with 98% confidence. (Gregory [Gre05], Fig. 7.2)#
At the point of performing this comparison, and making a final statement, the sampling theory school divides itself into two camps:
One camp uses the following protocol: first, before looking at the data, pick the significance level of the test (e.g. 5%), and determine the critical value of \(\chi^2\) above which the null hypothesis will be rejected. (The significance level is the fraction of times that the statistic \(\chi^2\) would exceed the critical value, if the null hypothesis were true.) Then, compare the actual \(\chi^2\) with the critical value, and declare the outcome of the test, and its significance level (which was fixed beforehand).
The second camp looks at the data, finds \(\chi^2\), then looks in the table of \(\chi^2\)-distributions for the significance level, \(P\), for which the observed value of \(\chi^2\) would be the critical value. The result of the test is then reported by giving this value of \(p\), which is the fraction of times that a result as extreme as the one observed, or more extreme, would be expected to arise if the null hypothesis were true.