# Regression analysis with linear models

When performing a regression analysis with a linear model, i.e., doing linear regression, we have access to a set of data $\mathcal{D}$ for the dependent variable, i.e., 

\begin{equation}
\data = [y_1, y_2,\dots, y_{N_d}]^T.
\end{equation}

For each datum there is an independent variable $x_i$, and our model for each datum $y_i$ is

\begin{equation}
M_i \equiv M(\pars;x_i) = \sum_{j=0}^{N_p-1} \para_j f_j(x_i).
\end{equation}

We can collect the basis function evaluated at each independent variable $x_i$ in a matrix $\mathbf{X}$ of dimension $N_d \times N_p$

$$
\dmat = 
	\begin{bmatrix} 
        f_0(x_1) & \ldots & f_{N_p-1}(x_1) \\
        f_0(x_2) & \ldots & f_{N_p-1}(x_2) \\
        \vdots  & \ddots & \vdots \\
        f_0(x_{N_d}) & \ldots & f_{N_p-1}(x_{N_d})
    \end{bmatrix}
$$ (eq:LinearModels:design-matrix)

This matrix will be referred to as a **design matrix**. 

```{prf:example} The design matrix for polynomial models
:label: example:design-matrix-polynomial-models

The design matrix for a linear model with polynomial basis functions becomes

\begin{equation}
\dmat =
\begin{bmatrix} 
1& x_{1}^1 &x_{1}^2& \dots & \dots &x_{1}^{p-1}\\
1& x_{2}^1 &x_{2}^2& \dots & \dots &x_{2}^{p-1}\\
1& x_{3}^1 &x_{3}^2& \dots & \dots &x_{3}^{p-1}\\                      
\dots& \dots &\dots& \dots & \dots &\dots\\
1& x_{N_d}^1 &x_{N_d}^2& \dots & \dots &x_{N_d}^{p-1}\\
\end{bmatrix}, 
\end{equation}

where we are considering a polynomial of degree $p-1$ which implies a model with $p$ features (including the intercept). It is also known within linear algebra as a [Vandermonde matrix](https://en.wikipedia.org/wiki/Vandermonde_matrix).
```

Next, we introduce a column vector for the parameters 

\begin{equation}
\pars = [\para_0,\para_1, \para_2,\dots, \para_{N_p-1}]^T,
\end{equation}

and we arrive at the matrix equation

\begin{equation}
\data = \dmat \pars+\boldsymbol{\epsilon}.
\end{equation}

The last term $\boldsymbol{\epsilon}$ is a column vector of so-called **residuals**. This term expresses the part of the dependent variable, for which we have data, that we cannot describe using a linear model. Formally, we can therefore write $\epsilon_i = y_i - M_i$ and define the vector as

\begin{equation}
\residuals = [\residual_1,\residual_2, \residual_3,\dots, \residual_{N_d}]^T.
\end{equation}

It is important to realize that our model $M$ provides an approximate description of the data. Indeed, *all models are wrong* and in a realistic setting we have no guarantee that the data is generated by a linear process. Of course, based on physics insight, or other assumptions, there might exists very good reasons for using a linear model to explain the data.

## The normal equation

A regression analysis often aims at finding the model parameters $\pars$ of a model $M$ such that the vector of errors $\boldsymbol{\epsilon}$ is minimized in the sense of its Euclidean norm (or 2-norm). You might ask the very relevant question why this particular goal is desirable. We will return to this consideration in [](sec:BayesianLinearRegression). Nevertheless, in order to find the "optimal" set of parameters $\pars^*$ we seek to minimize

$$
C(\pars)\equiv \sum_{i=1}^{N_d} \epsilon_i^2 = \sum_{i=1}^{N_d}\left(y_i-M_i\right)^2 = \left\{\left(\data-\dmat \pars\right)^T\left(\data-\dmat \pars\right)\right\}.
$$ (eq:LinearRegression:cost-function)

The solution to this optimization problem turns out to be a solution of the normal equation and is known as ordinary least-squares or ordinary linear regression.

````{prf:theorem} Ordinary least squares (the normal equation)
:label: theorem:LinearModels:normal-equation

The ordinary least-squares method corresponds to finding the optimal parameter vector $\pars^*$ that minimizes the Euclidean norm of the residual vector $\boldsymbol{\epsilon} = \data - \dmat \pars$, where $\data$ is a column vector of observations and $\dmat$ is the design matrix {eq}`eq:LinearModels:design-matrix`. 

Finding this optimum turns out to correspond to solving the **normal equation** 

$$
\dmat^T\data = \dmat^T\dmat\pars^*.  
$$ (eq:NormalEquation)

Given that the **normal matrix** $\dmat^T\dmat$ is invertible, the solution to the normal equation is given by 

$$
\pars^* =\left(\dmat^T\dmat\right)^{-1}\dmat^T\data.
$$ (eq:LinearModels:OLS_optimum)
````

````{prf:proof}
Due to its quadratic form, the Euclidean norm $\left| \boldsymbol{\epsilon} \right|_2^2 = \left(\data-\dmat\pars\right)^T\left(\data-\dmat\pars\right) \equiv C(\pars)$ is bounded from below and we just need to find the single extremum. That is we need to solve the problem

\begin{equation}
\pars^* =
{\displaystyle \mathop{\mathrm{arg} \min}_{\pars\in
{\mathbb{R}}^{N_p}}} \left(\data-\dmat\pars\right)^T\left(\data-\dmat\pars\right).
\end{equation}

In practical terms it means we will require

\begin{align}
\frac{\partial C(\pars)}{\partial \para_j} = \frac{\partial }{\partial \para_j} \Bigg[  \sum_{i=1}^{N_d}\Big(y_i &-\para_0 f_0(x_i)-\para_1f_1(x_i)-\para_2f_2(x_i)-\dots \\
&-  \para_{N_p-1}f_{N_p-1}(x_i)\Big)^2\Bigg] = 0, 
\end{align}

where $y_i$ and $f_j(x_i)$ are the elements of $\data$ and $\dmat$, respectively. Performing the derivative results in

$$
\frac{\partial C(\pars)}{\partial \para_j} = -2\Bigg[ \sum_{i=1}^{N_d}f_j(x_i)\Big(y_i &-\para_0 f_0(x_i)-\para_1f_1(x_i)-\para_2f_2(x_i)-\dots \\
&-\para_{N_p-1}f_{N_p-1}(x_i)\Big)\Bigg]=0,
$$ (eq:LinearModels:gradient-elements)

which is one element of the full gradient vecor. This gradient vector can be succinctly expressed in matrix-vector form as

$$
\boldsymbol{\nabla}_{\pars} C(\pars) = -2 \dmat^T\left( \data-\dmat\pars\right).  
$$ (eq:LinearRegression:gradient)

The minimum of $C$, where $\boldsymbol{\nabla}_{\pars} C(\pars) = 0$, then corresponds to 

$$
\dmat^T\data = \dmat^T\dmat\pars^*,  
$$

which is the normal equation. Finally, if the matrix $\dmat^T\dmat$ is invertible then we have the solution

$$
\pars^* =\left(\dmat^T\dmat\right)^{-1}\dmat^T\data.
$$
````

We note also that since our design matrix is defined as $\dmat\in
{\mathbb{R}}^{N_d\times N_p}$, the product $\dmat^T\dmat \in
{\mathbb{R}}^{N_p\times N_p}$. The product $\left(\dmat^T\dmat\right)^{-1}\dmat^T$ is called the pseudo-inverse of the design matrix $\dmat$. The pseudo-inverse is a generalization of the usual matrix inverse. The former can be defined for also for non-square matrices that are not necessarily full rank. In the case of full-rank and square matrices the pseudo-inverse is equal to the usual inverse.

::::{admonition} Checkpoint question
:class: my-checkpoint
Here we have been minimizing the sum of squared residuals, see Eq. {eq}`eq:LinearRegression:cost-function`. An optimization metric that is less dependent on the number of data is the **mean-squared error**. The cost function is then

$$
C_\mathrm{MSE}(\pars) \equiv \frac{1}{N_d} \left\{\left(\data-\dmat \pars\right)^T\left(\data-\dmat \pars\right)\right\}.
$$

If, and how, would this choice of cost function modify:
- The optimum $\pars^*$?
- The expression for the gradient vector $\boldsymbol{\nabla}_{\pars} C_\mathrm{MSE} (\pars)$?
:::{admonition} Answers 
:class: dropdown, my-hint 
The only difference is a factor $1/N_d$ which does not affect the position of the minimum. The gradient, however, will inherit this factor from the cost fundtion. 
:::
::::

The regression residuals $\boldsymbol{\epsilon}^{*} =  \data - \dmat \pars^{*}$ can be used to obtain an estimator $s^2$ of the variance of the residuals

$$
s^2 = \frac{(\boldsymbol{\epsilon}^*)^T\boldsymbol{\epsilon}^*}{N_d-N_p},
$$

where $N_p$ is the number of parameters in the model and $N_d$ is the number of data.