Regression analysis with linear models

39.2. Regression analysis with linear models#

When performing a regression analysis with a linear model, i.e., doing linear regression, we have access to a set of data \(\mathcal{D}\) for the dependent variable, i.e.,

(39.4)#\[\begin{equation} \data = [y_1, y_2,\dots, y_{N_d}]^T. \end{equation}\]

For each datum there is an independent variable \(x_i\), and our model for each datum \(y_i\) is

(39.5)#\[\begin{equation} M_i \equiv M(\pars;x_i) = \sum_{j=0}^{N_p-1} \para_j f_j(x_i). \end{equation}\]

We can collect the basis function evaluated at each independent variable \(x_i\) in a matrix \(\mathbf{X}\) of dimension \(N_d \times N_p\)

(39.6)#\[\begin{split} \dmat = \begin{bmatrix} f_0(x_1) & \ldots & f_{N_p-1}(x_1) \\ f_0(x_2) & \ldots & f_{N_p-1}(x_2) \\ \vdots & \ddots & \vdots \\ f_0(x_{N_d}) & \ldots & f_{N_p-1}(x_{N_d}) \end{bmatrix} \end{split}\]

This matrix will be referred to as a design matrix.

Example 39.3 (The design matrix for polynomial models)

The design matrix for a linear model with polynomial basis functions becomes

(39.7)#\[\begin{equation} \dmat = \begin{bmatrix} 1& x_{1}^1 &x_{1}^2& \dots & \dots &x_{1}^{p-1}\\ 1& x_{2}^1 &x_{2}^2& \dots & \dots &x_{2}^{p-1}\\ 1& x_{3}^1 &x_{3}^2& \dots & \dots &x_{3}^{p-1}\\ \dots& \dots &\dots& \dots & \dots &\dots\\ 1& x_{N_d}^1 &x_{N_d}^2& \dots & \dots &x_{N_d}^{p-1}\\ \end{bmatrix}, \end{equation}\]

where we are considering a polynomial of degree \(p-1\) which implies a model with \(p\) features (including the intercept). It is also known within linear algebra as a Vandermonde matrix.

Next, we introduce a column vector for the parameters

(39.8)#\[\begin{equation} \pars = [\para_0,\para_1, \para_2,\dots, \para_{N_p-1}]^T, \end{equation}\]

and we arrive at the matrix equation

(39.9)#\[\begin{equation} \data = \dmat \pars+\boldsymbol{\epsilon}. \end{equation}\]

The last term \(\boldsymbol{\epsilon}\) is a column vector of so-called residuals. This term expresses the part of the dependent variable, for which we have data, that we cannot describe using a linear model. Formally, we can therefore write \(\epsilon_i = y_i - M_i\) and define the vector as

(39.10)#\[\begin{equation} \residuals = [\residual_1,\residual_2, \residual_3,\dots, \residual_{N_d}]^T. \end{equation}\]

It is important to realize that our model \(M\) provides an approximate description of the data. Indeed, all models are wrong and in a realistic setting we have no guarantee that the data is generated by a linear process. Of course, based on physics insight, or other assumptions, there might exists very good reasons for using a linear model to explain the data.

The normal equation#

A regression analysis often aims at finding the model parameters \(\pars\) of a model \(M\) such that the vector of errors \(\boldsymbol{\epsilon}\) is minimized in the sense of its Euclidean norm (or 2-norm). You might ask the very relevant question why this particular goal is desirable. We will return to this consideration in Bayesian Linear Regression (BLR). Nevertheless, in order to find the “optimal” set of parameters \(\pars^*\) we seek to minimize

(39.11)#\[ C(\pars)\equiv \sum_{i=1}^{N_d} \epsilon_i^2 = \sum_{i=1}^{N_d}\left(y_i-M_i\right)^2 = \left\{\left(\data-\dmat \pars\right)^T\left(\data-\dmat \pars\right)\right\}. \]

The solution to this optimization problem turns out to be a solution of the normal equation and is known as ordinary least-squares or ordinary linear regression.

Theorem 39.1 (Ordinary least squares (the normal equation))

The ordinary least-squares method corresponds to finding the optimal parameter vector \(\pars^*\) that minimizes the Euclidean norm of the residual vector \(\boldsymbol{\epsilon} = \data - \dmat \pars\), where \(\data\) is a column vector of observations and \(\dmat\) is the design matrix (39.6).

Finding this optimum turns out to correspond to solving the normal equation

(39.12)#\[ \dmat^T\data = \dmat^T\dmat\pars^*. \]

Given that the normal matrix \(\dmat^T\dmat\) is invertible, the solution to the normal equation is given by

(39.13)#\[ \pars^* =\left(\dmat^T\dmat\right)^{-1}\dmat^T\data. \]

Proof. Due to its quadratic form, the Euclidean norm \(\left| \boldsymbol{\epsilon} \right|_2^2 = \left(\data-\dmat\pars\right)^T\left(\data-\dmat\pars\right) \equiv C(\pars)\) is bounded from below and we just need to find the single extremum. That is we need to solve the problem

(39.14)#\[\begin{equation} \pars^* = {\displaystyle \mathop{\mathrm{arg} \min}_{\pars\in {\mathbb{R}}^{N_p}}} \left(\data-\dmat\pars\right)^T\left(\data-\dmat\pars\right). \end{equation}\]

In practical terms it means we will require

(39.15)#\[\begin{align} \frac{\partial C(\pars)}{\partial \para_j} = \frac{\partial }{\partial \para_j} \Bigg[ \sum_{i=1}^{N_d}\Big(y_i &-\para_0 f_0(x_i)-\para_1f_1(x_i)-\para_2f_2(x_i)-\dots \\ &- \para_{N_p-1}f_{N_p-1}(x_i)\Big)^2\Bigg] = 0, \end{align}\]

where \(y_i\) and \(f_j(x_i)\) are the elements of \(\data\) and \(\dmat\), respectively. Performing the derivative results in

(39.16)#\[\begin{split} \frac{\partial C(\pars)}{\partial \para_j} = -2\Bigg[ \sum_{i=1}^{N_d}f_j(x_i)\Big(y_i &-\para_0 f_0(x_i)-\para_1f_1(x_i)-\para_2f_2(x_i)-\dots \\ &-\para_{N_p-1}f_{N_p-1}(x_i)\Big)\Bigg]=0, \end{split}\]

which is one element of the full gradient vecor. This gradient vector can be succinctly expressed in matrix-vector form as

(39.17)#\[ \boldsymbol{\nabla}_{\pars} C(\pars) = -2 \dmat^T\left( \data-\dmat\pars\right). \]

The minimum of \(C\), where \(\boldsymbol{\nabla}_{\pars} C(\pars) = 0\), then corresponds to

\[ \dmat^T\data = \dmat^T\dmat\pars^*, \]

which is the normal equation. Finally, if the matrix \(\dmat^T\dmat\) is invertible then we have the solution

\[ \pars^* =\left(\dmat^T\dmat\right)^{-1}\dmat^T\data. \]

We note also that since our design matrix is defined as \(\dmat\in {\mathbb{R}}^{N_d\times N_p}\), the product \(\dmat^T\dmat \in {\mathbb{R}}^{N_p\times N_p}\). The product \(\left(\dmat^T\dmat\right)^{-1}\dmat^T\) is called the pseudo-inverse of the design matrix \(\dmat\). The pseudo-inverse is a generalization of the usual matrix inverse. The former can be defined for also for non-square matrices that are not necessarily full rank. In the case of full-rank and square matrices the pseudo-inverse is equal to the usual inverse.

Checkpoint question

Here we have been minimizing the sum of squared residuals, see Eq. (39.11). An optimization metric that is less dependent on the number of data is the mean-squared error. The cost function is then

\[ C_\mathrm{MSE}(\pars) \equiv \frac{1}{N_d} \left\{\left(\data-\dmat \pars\right)^T\left(\data-\dmat \pars\right)\right\}. \]

If, and how, would this choice of cost function modify:

The optimum \(\pars^*\)?
The expression for the gradient vector \(\boldsymbol{\nabla}_{\pars} C_\mathrm{MSE} (\pars)\)?

Answers

The only difference is a factor \(1/N_d\) which does not affect the position of the minimum. The gradient, however, will inherit this factor from the cost fundtion.

The regression residuals \(\boldsymbol{\epsilon}^{*} = \data - \dmat \pars^{*}\) can be used to obtain an estimator \(s^2\) of the variance of the residuals

\[ s^2 = \frac{(\boldsymbol{\epsilon}^*)^T\boldsymbol{\epsilon}^*}{N_d-N_p}, \]

where \(N_p\) is the number of parameters in the model and \(N_d\) is the number of data.

Regression analysis with linear models

Contents

39.2. Regression analysis with linear models#

The normal equation#