Data Analysis and Machine Learning: Logistic Regression and Gradient Methods

In linear regression our main interest was centered on learning the coefficients of a functional fit (say a polynomial) in order to be able to predict the response of a continuous variable on some unseen data. The fit to the continuous variable $ y_i $ is based on some independent variables $ \boldsymbol{x}_i $. Linear regression resulted in analytical expressions for standard ordinary Least Squares or Ridge regression (in terms of matrices to invert) for several quantities, ranging from the variance and thereby the confidence intervals of the parameters $ \boldsymbol{\beta} $ to the mean squared error. If we can invert the product of the design matrices, linear regression gives then a simple recipe for fitting our data.

Logistic Regression and Classification Problems

Classification problems, however, are concerned with outcomes taking the form of discrete variables (i.e. categories). We may for example, on the basis of DNA sequencing for a number of patients, like to find out which mutations are important for a certain disease; or based on scans of various patients' brains, figure out if there is a tumor or not; or given a specific physical system, we'd like to identify its state, say whether it is an ordered or disordered system (typical situation in solid state physics); or classify the status of a patient, whether she/he has a stroke or not and many other similar situations.

The most common situation we encounter when we apply logistic regression is that of two possible outcomes, normally denoted as a binary outcome, true or false, positive or negative, success or failure etc.

Optimization and Deep learning

Logistic regression will also serve as our stepping stone towards neural network algorithms and supervised deep learning. For logistic learning, the minimization of the cost function leads to a non-linear equation in the parameters $ \boldsymbol{\beta} $. The optimization of the problem calls therefore for minimization algorithms. This forms the bottle neck of all machine learning algorithms, namely how to find reliable minima of a multi-variable function. This leads us to the family of gradient descent methods. The latter are the working horses of basically all modern machine learning algorithms.

We note also that many of the topics discussed here on logistic regression are also commonly used in modern supervised Deep Learning models, as we will see later.

Basics

We consider the case where the dependent variables, also called the responses or the outcomes, $ y_i $ are discrete and only take values from $ k=0,\dots,K-1 $ (i.e. $ K $ classes).

The goal is to predict the output classes from the design matrix $ \boldsymbol{X}\in\mathbb{R}^{n\times p} $ made of $ n $ samples, each of which carries $ p $ features or predictors. The primary goal is to identify the classes to which new unseen samples belong.

Let us specialize to the case of two classes only, with outputs $ y_i=0 $ and $ y_i=1 $. Our outcomes could represent the status of a credit card user that could default or not on her/his credit card debt. That is $$ y_i = \begin{bmatrix} 0 & \mathrm{no}\\ 1 & \mathrm{yes} \end{bmatrix}. $$

Linear classifier

Before moving to the logistic model, let us try to use our linear regression model to classify these two outcomes. We could for example fit a linear model to the default case if $ y_i > 0.5 $ and the no default case $ y_i \leq 0.5 $.

We would then have our weighted linear combination, namely $$ \begin{equation} \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \label{_auto1} \end{equation} $$ where $ \boldsymbol{y} $ is a vector representing the possible outcomes, $ \boldsymbol{X} $ is our $ n\times p $ design matrix and $ \boldsymbol{\beta} $ represents our estimators/predictors.

Some selected properties

The main problem with our function is that it takes values on the entire real axis. In the case of logistic regression, however, the labels $ y_i $ are discrete variables. A typical example is the credit card data discussed earlier, where we can set the state of defaulting the debt to $ y_i=1 $ and not to $ y_i=0 $ for one the persons in the data set (see the full example below).

One simple way to get a discrete output is to have sign functions that map the output of a linear regressor to values $ \{0,1\} $, $ f(s_i)=sign(s_i)=1 $ if $ s_i\ge 0 $ and 0 if otherwise. We will encounter this model in our first demonstration of neural networks. Historically it is called the "perceptron" model in the machine learning literature. This model is extremely simple. However, in many cases it is more favorable to use a ``soft" classifier that outputs the probability of a given category. This leads us to the logistic function.

The logistic function

The perceptron is an example of a ``hard classification" model. We will encounter this model when we discuss neural networks as well. Each datapoint is deterministically assigned to a category (i.e $ y_i=0 $ or $ y_i=1 $). In many cases, it is favorable to have a "soft" classifier that outputs the probability of a given category rather than a single value. For example, given $ x_i $, the classifier outputs the probability of being in a category $ k $. Logistic regression is the most common example of a so-called soft classifier. In logistic regression, the probability that a data point $ x_i $ belongs to a category $ y_i=\{0,1\} $ is given by the so-called logit function (or Sigmoid) which is meant to represent the likelihood for a given event, $$ p(t) = \frac{1}{1+\mathrm \exp{-t}}=\frac{\exp{t}}{1+\mathrm \exp{t}}. $$ Note that $ 1-p(t)= p(-t) $.

Examples of likelihood functions used in logistic regression and nueral networks

The following code plots the logistic function, the step function and other functions we will encounter from here and on.

"""The sigmoid function (or the logistic curve) is a
function that takes any real number, z, and outputs a number (0,1).
It is useful in neural networks for assigning weights on a relative scale.
The value z is the weighted sum of parameters involved in the learning algorithm."""

import numpy
import matplotlib.pyplot as plt
import math as mt

z = numpy.arange(-5, 5, .1)
sigma_fn = numpy.vectorize(lambda z: 1/(1+numpy.exp(-z)))
sigma = sigma_fn(z)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, sigma)
ax.set_ylim([-0.1, 1.1])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('sigmoid function')

plt.show()

"""Step Function"""
z = numpy.arange(-5, 5, .02)
step_fn = numpy.vectorize(lambda z: 1.0 if z >= 0.0 else 0.0)
step = step_fn(z)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, step)
ax.set_ylim([-0.5, 1.5])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('step function')

plt.show()

"""tanh Function"""
z = numpy.arange(-2*mt.pi, 2*mt.pi, 0.1)
t = numpy.tanh(z)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, t)
ax.set_ylim([-1.0, 1.0])
ax.set_xlim([-2*mt.pi,2*mt.pi])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('tanh function')

plt.show()

Two parameters

We assume now that we have two classes with $ y_i $ either $ 0 $ or $ 1 $. Furthermore we assume also that we have only two parameters $ \beta $ in our fitting of the Sigmoid function, that is we define probabilities $$ \begin{align*} p(y_i=1|x_i,\boldsymbol{\beta}) &= \frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}},\nonumber\\ p(y_i=0|x_i,\boldsymbol{\beta}) &= 1 - p(y_i=1|x_i,\boldsymbol{\beta}), \end{align*} $$ where $ \boldsymbol{\beta} $ are the weights we wish to extract from data, in our case $ \beta_0 $ and $ \beta_1 $.

Note that we used $$ p(y_i=0\vert x_i, \boldsymbol{\beta}) = 1-p(y_i=1\vert x_i, \boldsymbol{\beta}). $$

Maximum likelihood

In order to define the total likelihood for all possible outcomes from a dataset $ \mathcal{D}=\{(y_i,x_i)\} $, with the binary labels $ y_i\in\{0,1\} $ and where the data points are drawn independently, we use the so-called Maximum Likelihood Estimation (MLE) principle. We aim thus at maximizing the probability of seeing the observed data. We can then approximate the likelihood in terms of the product of the individual probabilities of a specific outcome $ y_i $, that is $$ \begin{align*} P(\mathcal{D}|\boldsymbol{\beta})& = \prod_{i=1}^n \left[p(y_i=1|x_i,\boldsymbol{\beta})\right]^{y_i}\left[1-p(y_i=1|x_i,\boldsymbol{\beta}))\right]^{1-y_i}\nonumber \\ \end{align*} $$ from which we obtain the log-likelihood and our cost/loss function $$ \mathcal{C}(\boldsymbol{\beta}) = \sum_{i=1}^n \left( y_i\log{p(y_i=1|x_i,\boldsymbol{\beta})} + (1-y_i)\log\left[1-p(y_i=1|x_i,\boldsymbol{\beta}))\right]\right). $$

The cost function rewritten

Reordering the logarithms, we can rewrite the cost/loss function as $$ \mathcal{C}(\boldsymbol{\beta}) = \sum_{i=1}^n \left(y_i(\beta_0+\beta_1x_i) -\log{(1+\exp{(\beta_0+\beta_1x_i)})}\right). $$

The maximum likelihood estimator is defined as the set of parameters that maximize the log-likelihood where we maximize with respect to $ \beta $. Since the cost (error) function is just the negative log-likelihood, for logistic regression we have that $$ \mathcal{C}(\boldsymbol{\beta})=-\sum_{i=1}^n \left(y_i(\beta_0+\beta_1x_i) -\log{(1+\exp{(\beta_0+\beta_1x_i)})}\right). $$ This equation is known in statistics as the cross entropy. Finally, we note that just as in linear regression, in practice we often supplement the cross-entropy with additional regularization terms, usually $ L_1 $ and $ L_2 $ regularization as we did for Ridge and Lasso regression.

Minimizing the cross entropy

The cross entropy is a convex function of the weights $ \boldsymbol{\beta} $ and, therefore, any local minimizer is a global minimizer.

Minimizing this cost function with respect to the two parameters $ \beta_0 $ and $ \beta_1 $ we obtain $$ \frac{\partial \mathcal{C}(\boldsymbol{\beta})}{\partial \beta_0} = -\sum_{i=1}^n \left(y_i -\frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}}\right), $$ and $$ \frac{\partial \mathcal{C}(\boldsymbol{\beta})}{\partial \beta_1} = -\sum_{i=1}^n \left(y_ix_i -x_i\frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}}\right). $$

A more compact expression

Let us now define a vector $ \boldsymbol{y} $ with $ n $ elements $ y_i $, an $ n\times p $ matrix $ \boldsymbol{X} $ which contains the $ x_i $ values and a vector $ \boldsymbol{p} $ of fitted probabilities $ p(y_i\vert x_i,\boldsymbol{\beta}) $. We can rewrite in a more compact form the first derivative of cost function as $$ \frac{\partial \mathcal{C}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = -\boldsymbol{X}^T\left(\boldsymbol{y}-\boldsymbol{p}\right). $$

If we in addition define a diagonal matrix $ \boldsymbol{W} $ with elements $ p(y_i\vert x_i,\boldsymbol{\beta})(1-p(y_i\vert x_i,\boldsymbol{\beta}) $, we can obtain a compact expression of the second derivative as $$ \frac{\partial^2 \mathcal{C}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^T} = \boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}. $$

Extending to more predictors

Within a binary classification problem, we can easily expand our model to include multiple predictors. Our ratio between likelihoods is then with $ p $ predictors $$ \log{ \frac{p(\boldsymbol{\beta}\boldsymbol{x})}{1-p(\boldsymbol{\beta}\boldsymbol{x})}} = \beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p. $$ Here we defined $ \boldsymbol{x}=[1,x_1,x_2,\dots,x_p] $ and $ \boldsymbol{\beta}=[\beta_0, \beta_1, \dots, \beta_p] $ leading to $$ p(\boldsymbol{\beta}\boldsymbol{x})=\frac{ \exp{(\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p)}}{1+\exp{(\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p)}}. $$

Including more classes

Till now we have mainly focused on two classes, the so-called binary system. Suppose we wish to extend to $ K $ classes. Let us for the sake of simplicity assume we have only two predictors. We have then following model $$ \log{\frac{p(C=1\vert x)}{p(K\vert x)}} = \beta_{10}+\beta_{11}x_1, $$ $$ \log{\frac{p(C=2\vert x)}{p(K\vert x)}} = \beta_{20}+\beta_{21}x_1, $$ and so on till the class $ C=K-1 $ class $$ \log{\frac{p(C=K-1\vert x)}{p(K\vert x)}} = \beta_{(K-1)0}+\beta_{(K-1)1}x_1, $$

and the model is specified in term of $ K-1 $ so-called log-odds or logit transformations.

More classes

In our discussion of neural networks we will encounter the above again in terms of a slightly modified function, the so-called Softmax function.

The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression), multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks. Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of $ K $ distinct linear functions, and the predicted probability for the $ k $-th class given a sample vector $ \boldsymbol{x} $ and a weighting vector $ \boldsymbol{\beta} $ is (with two predictors): $$ p(C=k\vert \mathbf {x} )=\frac{\exp{(\beta_{k0}+\beta_{k1}x_1)}}{1+\sum_{l=1}^{K-1}\exp{(\beta_{l0}+\beta_{l1}x_1)}}. $$ It is easy to extend to more predictors. The final class is $$ p(C=K\vert \mathbf {x} )=\frac{1}{1+\sum_{l=1}^{K-1}\exp{(\beta_{l0}+\beta_{l1}x_1)}}, $$

and they sum to one. Our earlier discussions were all specialized to the case with two classes only. It is easy to see from the above that what we derived earlier is compatible with these equations.

To find the optimal parameters we would typically use a gradient descent method. Newton's method and gradient descent methods are discussed in the material on optimization methods.

Preprocessing our data

We discuss here how to preprocess our data. Till now and in connection with our previous examples we have not met so many cases where we are too sensitive to the scaling of our data. Normally the data may need a rescaling and/or may be sensitive to extreme values. Scaling the data renders our inputs much more suitable for the algorithms we want to employ.

Scikit-Learn has several functions which allow us to rescale the data, normally resulting in much better results in terms of various accuracy scores. The StandardScaler function in Scikit-Learn ensures that for each feature/predictor we study the mean value is zero and the variance is one (every column in the design/feature matrix). This scaling has the drawback that it does not ensure that we have a particular maximum or minimum in our data set. Another function included in Scikit-Learn is the MinMaxScaler which ensures that all features are exactly between $ 0 $ and $ 1 $. The

More preprocessing

The Normalizer scales each data point such that the feature vector has a euclidean length of one. In other words, it projects a data point on the circle (or sphere in the case of higher dimensions) with a radius of 1. This means every data point is scaled by a different number (by the inverse of it’s length). This normalization is often used when only the direction (or angle) of the data matters, not the length of the feature vector.

The RobustScaler works similarly to the StandardScaler in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the RobustScaler uses the median and quartiles, instead of mean and variance. This makes the RobustScaler ignore data points that are very different from the rest (like measurement errors). These odd data points are also called outliers, and might often lead to trouble for other scaling techniques.

Simple preprocessing examples, breast cancer data and classification

We show here how we can use a simple regression case on the breast cancer data using logistic regression as algorithm for classification.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
cancer = load_breast_cancer()

# Set up training data
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Test set accuracy: {:.2f}".format(logreg.score(X_test,y_test)))

# Scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
logreg.fit(X_train_scaled, y_train)
print("Test set accuracy scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))

Covariance and Correlation

In addition to the plot of the features, we study now also the covariance (and the correlation matrix). We use also Pandas to compute the correlation matrix.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
cancer = load_breast_cancer()
import pandas as pd
# Making a data frame
cancerpd = pd.DataFrame(cancer.data, columns=cancer.feature_names)

fig, axes = plt.subplots(15,2,figsize=(10,20))
malignant = cancer.data[cancer.target == 0]
benign = cancer.data[cancer.target == 1]
ax = axes.ravel()

for i in range(30):
    _, bins = np.histogram(cancer.data[:,i], bins =50)
    ax[i].hist(malignant[:,i], bins = bins, alpha = 0.5)
    ax[i].hist(benign[:,i], bins = bins, alpha = 0.5)
    ax[i].set_title(cancer.feature_names[i])
    ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["Malignant", "Benign"], loc ="best")
fig.tight_layout()
plt.show()

import seaborn as sns
sns.set(rc={'figure.figsize':(15.0,15.0)},font_scale=1)
correlation_matrix = cancerpd.corr().round(1)
# use the heatmap function from seaborn to plot the correlation matrix
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()

#print eigvalues of correlation matrix
EigValues, EigVectors = np.linalg.eig(correlation_matrix)
print(EigValues)

In the above example we note two things. In the first plot we display the overlap of benign and malignant tumors as functions of the various features in the Wisconsing breast cancer data set. We see that for some of the features we can distinguish clearly the benign and malignant cases while for other features we cannot. This can point to us which features may be of greater interest when we wish to classify a benign or not benign tumour.

In the second figure we have computed the so-called correlation matrix, which in our case with thirty features becomes a $ 30\times 30 $ matrix.

cancerpd = pd.DataFrame(cancer.data, columns=cancer.feature_names)

correlation_matrix = cancerpd.corr().round(1)

Diagonalizing this matrix we can in turn say something about which features are of relevance and which are not. This leads us to the classical Principal Component Analysis (PCA) theorem with applications. This topic is covered in the PCA material and additional topics on dimensionality reduction.

Optimization, the central part of any Machine Learning algortithm

Almost every problem in machine learning and data science starts with a dataset $ X $, a model $ g(\beta) $, which is a function of the parameters $ \beta $ and a cost function $ C(X, g(\beta)) $ that allows us to judge how well the model $ g(\beta) $ explains the observations $ X $. The model is fit by finding the values of $ \beta $ that minimize the cost function. Ideally we would be able to solve for $ \beta $ analytically, however this is not possible in general and we must use some approximative/numerical method to compute the minimum.

Revisiting our Logistic Regression case

In our discussion on Logistic Regression we studied the case of two classes, with $ y_i $ either $ 0 $ or $ 1 $. Furthermore we assumed also that we have only two parameters $ \beta $ in our fitting, that is we defined probabilities $$ \begin{align*} p(y_i=1|x_i,\boldsymbol{\beta}) &= \frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}},\nonumber\\ p(y_i=0|x_i,\boldsymbol{\beta}) &= 1 - p(y_i=1|x_i,\boldsymbol{\beta}), \end{align*} $$ where $ \boldsymbol{\beta} $ are the weights we wish to extract from data, in our case $ \beta_0 $ and $ \beta_1 $.

The equations to solve

Our compact equations used a definition of a vector $ \boldsymbol{y} $ with $ n $ elements $ y_i $, an $ n\times p $ matrix $ \boldsymbol{X} $ which contains the $ x_i $ values and a vector $ \boldsymbol{p} $ of fitted probabilities $ p(y_i\vert x_i,\boldsymbol{\beta}) $. We rewrote in a more compact form the first derivative of the cost function as $$ \frac{\partial \mathcal{C}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = -\boldsymbol{X}^T\left(\boldsymbol{y}-\boldsymbol{p}\right). $$

Solving using Newton-Raphson's method

If we can set up these equations, Newton-Raphson's iterative method is normally the method of choice. It requires however that we can compute in an efficient way the matrices that define the first and second derivatives.

Our iterative scheme is then given by $$ \boldsymbol{\beta}^{\mathrm{new}} = \boldsymbol{\beta}^{\mathrm{old}}-\left(\frac{\partial^2 \mathcal{C}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^T}\right)^{-1}_{\boldsymbol{\beta}^{\mathrm{old}}}\times \left(\frac{\partial \mathcal{C}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\right)_{\boldsymbol{\beta}^{\mathrm{old}}}, $$ or in matrix form as $$ \boldsymbol{\beta}^{\mathrm{new}} = \boldsymbol{\beta}^{\mathrm{old}}-\left(\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X} \right)^{-1}\times \left(-\boldsymbol{X}^T(\boldsymbol{y}-\boldsymbol{p}) \right)_{\boldsymbol{\beta}^{\mathrm{old}}}. $$ The right-hand side is computed with the old values of $ \beta $.

If we can compute these matrices, in particular the Hessian, the above is often the easiest method to implement.

Brief reminder on Newton-Raphson's method

Perhaps the most celebrated of all one-dimensional root-finding routines is Newton's method, also called the Newton-Raphson method. This method requires the evaluation of both the function $ f $ and its derivative $ f' $ at arbitrary points. If you can only calculate the derivative numerically and/or your function is not of the smooth type, we normally discourage the use of this method.

The equations

The Newton-Raphson formula consists geometrically of extending the tangent line at a current point until it crosses zero, then setting the next guess to the abscissa of that zero-crossing. The mathematics behind this method is rather simple. Employing a Taylor expansion for $ x $ sufficiently close to the solution $ s $, we have $$ f(s)=0=f(x)+(s-x)f'(x)+\frac{(s-x)^2}{2}f''(x) +\dots. \label{eq:taylornr} $$

For small enough values of the function and for well-behaved functions, the terms beyond linear are unimportant, hence we obtain $$ f(x)+(s-x)f'(x)\approx 0, $$ yielding $$ s\approx x-\frac{f(x)}{f'(x)}. $$

Having in mind an iterative procedure, it is natural to start iterating with $$ x_{n+1}=x_n-\frac{f(x_n)}{f'(x_n)}. $$

Simple geometric interpretation

The above is Newton-Raphson's method. It has a simple geometric interpretation, namely $ x_{n+1} $ is the point where the tangent from $ (x_n,f(x_n)) $ crosses the $ x $-axis. Close to the solution, Newton-Raphson converges fast to the desired result. However, if we are far from a root, where the higher-order terms in the series are important, the Newton-Raphson formula can give grossly inaccurate results. For instance, the initial guess for the root might be so far from the true root as to let the search interval include a local maximum or minimum of the function. If an iteration places a trial guess near such a local extremum, so that the first derivative nearly vanishes, then Newton-Raphson may fail totally

Extending to more than one variable

Newton's method can be generalized to systems of several non-linear equations and variables. Consider the case with two equations $$ \begin{array}{cc} f_1(x_1,x_2) &=0\\ f_2(x_1,x_2) &=0,\end{array} $$ which we Taylor expand to obtain $$ \begin{array}{cc} 0=f_1(x_1+h_1,x_2+h_2)=&f_1(x_1,x_2)+h_1 \partial f_1/\partial x_1+h_2 \partial f_1/\partial x_2+\dots\\ 0=f_2(x_1+h_1,x_2+h_2)=&f_2(x_1,x_2)+h_1 \partial f_2/\partial x_1+h_2 \partial f_2/\partial x_2+\dots \end{array}. $$ Defining the Jacobian matrix $ {\bf \boldsymbol{J}} $ we have $$ {\bf \boldsymbol{J}}=\left( \begin{array}{cc} \partial f_1/\partial x_1 & \partial f_1/\partial x_2 \\ \partial f_2/\partial x_1 &\partial f_2/\partial x_2 \end{array} \right), $$ we can rephrase Newton's method as $$ \left(\begin{array}{c} x_1^{n+1} \\ x_2^{n+1} \end{array} \right)= \left(\begin{array}{c} x_1^{n} \\ x_2^{n} \end{array} \right)+ \left(\begin{array}{c} h_1^{n} \\ h_2^{n} \end{array} \right), $$ where we have defined $$ \left(\begin{array}{c} h_1^{n} \\ h_2^{n} \end{array} \right)= -{\bf \boldsymbol{J}}^{-1} \left(\begin{array}{c} f_1(x_1^{n},x_2^{n}) \\ f_2(x_1^{n},x_2^{n}) \end{array} \right). $$ We need thus to compute the inverse of the Jacobian matrix and it is to understand that difficulties may arise in case $ {\bf \boldsymbol{J}} $ is nearly singular.

It is rather straightforward to extend the above scheme to systems of more than two non-linear equations. In our case, the Jacobian matrix is given by the Hessian that represents the second derivative of cost function.

Steepest descent

The basic idea of gradient descent is that a function $ F(\mathbf{x}) $, $ \mathbf{x} \equiv (x_1,\cdots,x_n) $, decreases fastest if one goes from $ \bf {x} $ in the direction of the negative gradient $ -\nabla F(\mathbf{x}) $.

It can be shown that if $$ \mathbf{x}_{k+1} = \mathbf{x}_k - \gamma_k \nabla F(\mathbf{x}_k), $$ with $ \gamma_k > 0 $.

For $ \gamma_k $ small enough, then $ F(\mathbf{x}_{k+1}) \leq F(\mathbf{x}_k) $. This means that for a sufficiently small $ \gamma_k $ we are always moving towards smaller function values, i.e a minimum.

More on momentum based approaches

Let us try to get more intuition from these equations. It is helpful to consider a simple physical analogy with a particle of mass $ m $ moving in a viscous medium with drag coefficient $ \mu $ and potential $ E(\mathbf{w}) $. If we denote the particle's position by $ \mathbf{w} $, then its motion is described by $$ m {d^2 \mathbf{w} \over dt^2} + \mu {d \mathbf{w} \over dt }= -\nabla_w E(\mathbf{w}). $$

We can discretize this equation in the usual way to get $$ m { \mathbf{w}_{t+\Delta t}-2 \mathbf{w}_{t} +\mathbf{w}_{t-\Delta t} \over (\Delta t)^2}+\mu {\mathbf{w}_{t+\Delta t}- \mathbf{w}_{t} \over \Delta t} = -\nabla_w E(\mathbf{w}). $$

Rearranging this equation, we can rewrite this as $$ \Delta \mathbf{w}_{t +\Delta t}= - { (\Delta t)^2 \over m +\mu \Delta t} \nabla_w E(\mathbf{w})+ {m \over m +\mu \Delta t} \Delta \mathbf{w}_t. $$

Momentum parameter

Notice that this equation is identical to previous one if we identify the position of the particle, $ \mathbf{w} $, with the parameters $ \boldsymbol{\theta} $. This allows us to identify the momentum parameter and learning rate with the mass of the particle and the viscous drag as: $$ \gamma= {m \over m +\mu \Delta t }, \qquad \eta = {(\Delta t)^2 \over m +\mu \Delta t}. $$

Thus, as the name suggests, the momentum parameter is proportional to the mass of the particle and effectively provides inertia. Furthermore, in the large viscosity/small learning rate limit, our memory time scales as $ (1-\gamma)^{-1} \approx m/(\mu \Delta t) $.

Why is momentum useful? SGD momentum helps the gradient descent algorithm gain speed in directions with persistent but small gradients even in the presence of stochasticity, while suppressing oscillations in high-curvature directions. This becomes especially important in situations where the landscape is shallow and flat in some directions and narrow and steep in others. It has been argued that first-order methods (with appropriate initial conditions) can perform comparable to more expensive second order methods, especially in the context of complex deep learning models.

These beneficial properties of momentum can sometimes become even more pronounced by using a slight modification of the classical momentum algorithm called Nesterov Accelerated Gradient (NAG).

In the NAG algorithm, rather than calculating the gradient at the current parameters, $ \nabla_\theta E(\boldsymbol{\theta}_t) $, one calculates the gradient at the expected value of the parameters given our current momentum, $ \nabla_\theta E(\boldsymbol{\theta}_t +\gamma \mathbf{v}_{t-1}) $. This yields the NAG update rule $$ \begin{align} \mathbf{v}_{t}&=\gamma \mathbf{v}_{t-1}+\eta_{t}\nabla_\theta E(\boldsymbol{\theta}_t +\gamma \mathbf{v}_{t-1}) \nonumber \\ \boldsymbol{\theta}_{t+1}&= \boldsymbol{\theta}_t -\mathbf{v}_{t}. \label{_auto3} \end{align} $$

One of the major advantages of NAG is that it allows for the use of a larger learning rate than GDM for the same choice of $ \gamma $.

Second moment of the gradient

In stochastic gradient descent, with and without momentum, we still have to specify a schedule for tuning the learning rates $ \eta_t $ as a function of time. As discussed in the context of Newton's method, this presents a number of dilemmas. The learning rate is limited by the steepest direction which can change depending on the current position in the landscape. To circumvent this problem, ideally our algorithm would keep track of curvature and take large steps in shallow, flat directions and small steps in steep, narrow directions. Second-order methods accomplish this by calculating or approximating the Hessian and normalizing the learning rate by the curvature. However, this is very computationally expensive for extremely large models. Ideally, we would like to be able to adaptively change the step size to match the landscape without paying the steep computational price of calculating or approximating Hessians.

Recently, a number of methods have been introduced that accomplish this by tracking not only the gradient, but also the second moment of the gradient. These methods include AdaGrad, AdaDelta, RMS-Prop, and ADAM.

RMS prop

In RMS prop, in addition to keeping a running average of the first moment of the gradient, we also keep track of the second moment denoted by $ \mathbf{s}_t=\mathbb{E}[\mathbf{g}_t^2] $. The update rule for RMS prop is given by $$ \begin{align} \mathbf{g}_t &= \nabla_\theta E(\boldsymbol{\theta}) \label{_auto4}\\ \mathbf{s}_t &=\beta \mathbf{s}_{t-1} +(1-\beta)\mathbf{g}_t^2 \nonumber \\ \boldsymbol{\theta}_{t+1}&=&\boldsymbol{\theta}_t - \eta_t { \mathbf{g}_t \over \sqrt{\mathbf{s}_t +\epsilon}}, \nonumber \end{align} $$

where $ \beta $ controls the averaging time of the second moment and is typically taken to be about $ \beta=0.9 $, $ \eta_t $ is a learning rate typically chosen to be $ 10^{-3} $, and $ \epsilon\sim 10^{-8} $ is a small regularization constant to prevent divergences. Multiplication and division by vectors is understood as an element-wise operation. It is clear from this formula that the learning rate is reduced in directions where the norm of the gradient is consistently large. This greatly speeds up the convergence by allowing us to use a larger learning rate for flat directions.

ADAM optimizer

A related algorithm is the ADAM optimizer. In ADAM, we keep a running average of both the first and second moment of the gradient and use this information to adaptively change the learning rate for different parameters. In addition to keeping a running average of the first and second moments of the gradient (i.e. $ \mathbf{m}_t=\mathbb{E}[\mathbf{g}_t] $ and $ \mathbf{s}_t=\mathbb{E}[\mathbf{g}^2_t] $, respectively), ADAM performs an additional bias correction to account for the fact that we are estimating the first two moments of the gradient using a running average (denoted by the hats in the update rule below). The update rule for ADAM is given by (where multiplication and division are once again understood to be element-wise operations below) $$ \begin{align} \mathbf{g}_t &= \nabla_\theta E(\boldsymbol{\theta}) \label{_auto5}\\ \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \mathbf{g}_t \nonumber \\ \mathbf{s}_t &=\beta_2 \mathbf{s}_{t-1} +(1-\beta_2)\mathbf{g}_t^2 \nonumber \\ \boldsymbol{\mathbf{m}}_t&={\mathbf{m}_t \over 1-\beta_1^t} \nonumber \\ \boldsymbol{\mathbf{s}}_t &={\mathbf{s}_t \over1-\beta_2^t} \nonumber \\ \boldsymbol{\theta}_{t+1}&=\boldsymbol{\theta}_t - \eta_t { \boldsymbol{\mathbf{m}}_t \over \sqrt{\boldsymbol{\mathbf{s}}_t} +\epsilon}, \nonumber \\ \label{_auto6} \end{align} $$

where $ \beta_1 $ and $ \beta_2 $ set the memory lifetime of the first and second moment and are typically taken to be $ 0.9 $ and $ 0.99 $ respectively, and $ \eta $ and $ \epsilon $ are identical to RMSprop.

Like in RMSprop, the effective step size of a parameter depends on the magnitude of its gradient squared. To understand this better, let us rewrite this expression in terms of the variance $ \boldsymbol{\sigma}_t^2 = \boldsymbol{\mathbf{s}}_t - (\boldsymbol{\mathbf{m}}_t)^2 $. Consider a single parameter $ \theta_t $. The update rule for this parameter is given by $$ \Delta \theta_{t+1}= -\eta_t { \boldsymbol{m}_t \over \sqrt{\sigma_t^2 + m_t^2 }+\epsilon}. $$

Practical tips

Automatic differentiation

Automatic differentiation (AD), also called algorithmic differentiation or computational differentiation,is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.

Python has tools for so-called automatic differentiation. Consider the following example $$ f(x) = \sin\left(2\pi x + x^2\right) $$ which has the following derivative $$ f'(x) = \cos\left(2\pi x + x^2\right)\left(2\pi + 2x\right) $$ Using autograd we have

import autograd.numpy as np

# To do elementwise differentiation:
from autograd import elementwise_grad as egrad 

# To plot:
import matplotlib.pyplot as plt 


def f(x):
    return np.sin(2*np.pi*x + x**2)

def f_grad_analytic(x):
    return np.cos(2*np.pi*x + x**2)*(2*np.pi + 2*x)

# Do the comparison:
x = np.linspace(0,1,1000)

f_grad = egrad(f)

computed = f_grad(x)
analytic = f_grad_analytic(x)

plt.title('Derivative computed from Autograd compared with the analytical derivative')
plt.plot(x,computed,label='autograd')
plt.plot(x,analytic,label='analytic')

plt.xlabel('x')
plt.ylabel('y')
plt.legend()

plt.show()

print("The max absolute difference is: %g"%(np.max(np.abs(computed - analytic))))

Using autograd

Here we experiment with what kind of functions Autograd is capable of finding the gradient of. The following Python functions are just meant to illustrate what Autograd can do, but please feel free to experiment with other, possibly more complicated, functions as well.

import autograd.numpy as np
from autograd import grad

def f1(x):
    return x**3 + 1

f1_grad = grad(f1)

# Remember to send in float as argument to the computed gradient from Autograd!
a = 1.0

# See the evaluated gradient at a using autograd:
print("The gradient of f1 evaluated at a = %g using autograd is: %g"%(a,f1_grad(a)))

# Compare with the analytical derivative, that is f1'(x) = 3*x**2 
grad_analytical = 3*a**2
print("The gradient of f1 evaluated at a = %g by finding the analytic expression is: %g"%(a,grad_analytical))

Autograd with more complicated functions

To differentiate with respect to two (or more) arguments of a Python function, Autograd need to know at which variable the function if being differentiated with respect to.

import autograd.numpy as np
from autograd import grad
def f2(x1,x2):
    return 3*x1**3 + x2*(x1 - 5) + 1

# By sending the argument 0, Autograd will compute the derivative w.r.t the first variable, in this case x1
f2_grad_x1 = grad(f2,0)

# ... and differentiate w.r.t x2 by sending 1 as an additional arugment to grad
f2_grad_x2 = grad(f2,1)

x1 = 1.0
x2 = 3.0 

print("Evaluating at x1 = %g, x2 = %g"%(x1,x2))
print("-"*30)

# Compare with the analytical derivatives:

# Derivative of f2 w.r.t x1 is: 9*x1**2 + x2:
f2_grad_x1_analytical = 9*x1**2 + x2

# Derivative of f2 w.r.t x2 is: x1 - 5:
f2_grad_x2_analytical = x1 - 5

# See the evaluated derivations:
print("The derivative of f2 w.r.t x1: %g"%( f2_grad_x1(x1,x2) ))
print("The analytical derivative of f2 w.r.t x1: %g"%( f2_grad_x1(x1,x2) ))

print()

print("The derivative of f2 w.r.t x2: %g"%( f2_grad_x2(x1,x2) ))
print("The analytical derivative of f2 w.r.t x2: %g"%( f2_grad_x2(x1,x2) ))

Note that the grad function will not produce the true gradient of the function. The true gradient of a function with two or more variables will produce a vector, where each element is the function differentiated w.r.t a variable.

More complicated functions using the elements of their arguments directly

import autograd.numpy as np
from autograd import grad
def f3(x): # Assumes x is an array of length 5 or higher
    return 2*x[0] + 3*x[1] + 5*x[2] + 7*x[3] + 11*x[4]**2

f3_grad = grad(f3)

x = np.linspace(0,4,5)

# Print the computed gradient:
print("The computed gradient of f3 is: ", f3_grad(x))

# The analytical gradient is: (2, 3, 5, 7, 22*x[4])
f3_grad_analytical = np.array([2, 3, 5, 7, 22*x[4]])

# Print the analytical gradient:
print("The analytical gradient of f3 is: ", f3_grad_analytical)

Note that in this case, when sending an array as input argument, the output from Autograd is another array. This is the true gradient of the function, as opposed to the function in the previous example. By using arrays to represent the variables, the output from Autograd might be easier to work with, as the output is closer to what one could expect form a gradient-evaluting function.

Functions using mathematical functions from Numpy

import autograd.numpy as np
from autograd import grad
def f4(x):
    return np.sqrt(1+x**2) + np.exp(x) + np.sin(2*np.pi*x)

f4_grad = grad(f4)

x = 2.7

# Print the computed derivative:
print("The computed derivative of f4 at x = %g is: %g"%(x,f4_grad(x)))

# The analytical derivative is: x/sqrt(1 + x**2) + exp(x) + cos(2*pi*x)*2*pi
f4_grad_analytical = x/np.sqrt(1 + x**2) + np.exp(x) + np.cos(2*np.pi*x)*2*np.pi

# Print the analytical gradient:
print("The analytical gradient of f4 at x = %g is: %g"%(x,f4_grad_analytical))

More autograd

import autograd.numpy as np
from autograd import grad
def f5(x):
    if x >= 0:
        return x**2
    else:
        return -3*x + 1

f5_grad = grad(f5)

x = 2.7

# Print the computed derivative:
print("The computed derivative of f5 at x = %g is: %g"%(x,f5_grad(x)))

And with loops

import autograd.numpy as np
from autograd import grad
def f6_for(x):
    val = 0
    for i in range(10):
        val = val + x**i
    return val

def f6_while(x):
    val = 0
    i = 0
    while i < 10:
        val = val + x**i
        i = i + 1
    return val

f6_for_grad = grad(f6_for)
f6_while_grad = grad(f6_while)

x = 0.5

# Print the computed derivaties of f6_for and f6_while
print("The computed derivative of f6_for at x = %g is: %g"%(x,f6_for_grad(x)))
print("The computed derivative of f6_while at x = %g is: %g"%(x,f6_while_grad(x)))

import autograd.numpy as np
from autograd import grad
# Both of the functions are implementation of the sum: sum(x**i) for i = 0, ..., 9
# The analytical derivative is: sum(i*x**(i-1)) 
f6_grad_analytical = 0
for i in range(10):
    f6_grad_analytical += i*x**(i-1)

print("The analytical derivative of f6 at x = %g is: %g"%(x,f6_grad_analytical))

Using recursion

import autograd.numpy as np
from autograd import grad

def f7(n): # Assume that n is an integer
    if n == 1 or n == 0:
        return 1
    else:
        return n*f7(n-1)

f7_grad = grad(f7)

n = 2.0

print("The computed derivative of f7 at n = %d is: %g"%(n,f7_grad(n)))

# The function f7 is an implementation of the factorial of n.
# By using the product rule, one can find that the derivative is:

f7_grad_analytical = 0
for i in range(int(n)-1):
    tmp = 1
    for k in range(int(n)-1):
        if k != i:
            tmp *= (n - k)
    f7_grad_analytical += tmp

print("The analytical derivative of f7 at n = %d is: %g"%(n,f7_grad_analytical))

Note that if n is equal to zero or one, Autograd will give an error message. This message appears when the output is independent on input.

Unsupported functions

import autograd.numpy as np
from autograd import grad
def f8(x): # Assume x is an array
    x[2] = 3
    return x*2

f8_grad = grad(f8)

x = 8.4

print("The derivative of f8 is:",f8_grad(x))

Here, Autograd tells us that an 'ArrayBox' does not support item assignment. The item assignment is done when the program tries to assign x[2] to the value 3. However, Autograd has implemented the computation of the derivative such that this assignment is not possible.

The syntax a.dot(b) when finding the dot product

import autograd.numpy as np
from autograd import grad
def f9(a): # Assume a is an array with 2 elements
    b = np.array([1.0,2.0])
    return a.dot(b)

f9_grad = grad(f9)

x = np.array([1.0,0.0])

print("The derivative of f9 is:",f9_grad(x))

Here we are told that the 'dot' function does not belong to Autograd's version of a Numpy array. To overcome this, an alternative syntax which also computed the dot product can be used:

import autograd.numpy as np
from autograd import grad
def f9_alternative(x): # Assume a is an array with 2 elements
    b = np.array([1.0,2.0])
    return np.dot(x,b) # The same as x_1*b_1 + x_2*b_2

f9_alternative_grad = grad(f9_alternative)

x = np.array([3.0,0.0])

print("The gradient of f9 is:",f9_alternative_grad(x))

# The analytical gradient of the dot product of vectors x and b with two elements (x_1,x_2) and (b_1, b_2) respectively
# w.r.t x is (b_1, b_2).

Recommended to avoid

a += b
a -= b
a*= b
a /=b

Standard steepest descent

Before we proceed, we would like to discuss the approach called the standard Steepest descent, which again leads to us having to be able to compute a matrix. It belongs to the class of Conjugate Gradient methods (CG).

The success of the CG method for finding solutions of non-linear problems is based on the theory of conjugate gradients for linear systems of equations. It belongs to the class of iterative methods for solving problems from linear algebra of the type $$ \begin{equation*} \boldsymbol{A}\boldsymbol{x} = \boldsymbol{b}. \end{equation*} $$

In the iterative process we end up with a problem like $$ \begin{equation*} \boldsymbol{r}= \boldsymbol{b}-\boldsymbol{A}\boldsymbol{x}, \end{equation*} $$ where $ \boldsymbol{r} $ is the so-called residual or error in the iterative process.

Gradient method

The residual is zero when we reach the minimum of the quadratic equation $$ \begin{equation*} P(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^T\boldsymbol{A}\boldsymbol{x} - \boldsymbol{x}^T\boldsymbol{b}, \end{equation*} $$

with the constraint that the matrix $ \boldsymbol{A} $ is positive definite and symmetric. This defines also the Hessian and we want it to be positive definite.

Steepest descent method

We denote the initial guess for $ \boldsymbol{x} $ as $ \boldsymbol{x}_0 $. We can assume without loss of generality that $$ \begin{equation*} \boldsymbol{x}_0=0, \end{equation*} $$ or consider the system $$ \begin{equation*} \boldsymbol{A}\boldsymbol{z} = \boldsymbol{b}-\boldsymbol{A}\boldsymbol{x}_0, \end{equation*} $$ instead.

Steepest descent method

One can show that the solution $ \boldsymbol{x} $ is also the unique minimizer of the quadratic form $$ \begin{equation*} f(\boldsymbol{x}) = \frac{1}{2}\boldsymbol{x}^T\boldsymbol{A}\boldsymbol{x} - \boldsymbol{x}^T \boldsymbol{x} , \quad \boldsymbol{x}\in\mathbf{R}^n. \end{equation*} $$ This suggests taking the first basis vector $ \boldsymbol{r}_1 $ (see below for definition) to be the gradient of $ f $ at $ \boldsymbol{x}=\boldsymbol{x}_0 $, which equals $$ \begin{equation*} \boldsymbol{A}\boldsymbol{x}_0-\boldsymbol{b}, \end{equation*} $$ and $ \boldsymbol{x}_0=0 $ it is equal $ -\boldsymbol{b} $.

Final expressions

We can compute the residual iteratively as $$ \begin{equation*} \boldsymbol{r}_{k+1}=\boldsymbol{b}-\boldsymbol{A}\boldsymbol{x}_{k+1}, \end{equation*} $$ which equals $$ \begin{equation*} \boldsymbol{b}-\boldsymbol{A}(\boldsymbol{x}_k+\alpha_k\boldsymbol{r}_k), \end{equation*} $$ or $$ \begin{equation*} (\boldsymbol{b}-\boldsymbol{A}\boldsymbol{x}_k)-\alpha_k\boldsymbol{A}\boldsymbol{r}_k, \end{equation*} $$ which gives $$ \alpha_k = \frac{\boldsymbol{r}_k^T\boldsymbol{r}_k}{\boldsymbol{r}_k^T\boldsymbol{A}\boldsymbol{r}_k} $$ leading to the iterative scheme $$ \begin{equation*} \boldsymbol{x}_{k+1}=\boldsymbol{x}_k-\alpha_k\boldsymbol{r}_{k}, \end{equation*} $$

Code examples for steepest descent

Simple codes for steepest descent and conjugate gradient using a $ 2\times 2 $ matrix, in c++, Python code to come

#include <cmath>
#include <iostream>
#include <fstream>
#include <iomanip>
#include "vectormatrixclass.h"
using namespace  std;
//   Main function begins here
int main(int  argc, char * argv[]){
  int dim = 2;
  Vector x(dim),xsd(dim), b(dim),x0(dim);
  Matrix A(dim,dim);

  // Set our initial guess
  x0(0) = x0(1) = 0;
  // Set the matrix
  A(0,0) =  3;    A(1,0) =  2;   A(0,1) =  2;   A(1,1) =  6;
  b(0) = 2; b(1) = -8;
  cout << "The Matrix A that we are using: " << endl;
  A.Print();
  cout << endl;
  xsd = SteepestDescent(A,b,x0);
  cout << "The approximate solution using Steepest Descent is: " << endl;
  xsd.Print();
  cout << endl;
}

The routine for the steepest descent method

Vector SteepestDescent(Matrix A, Vector b, Vector x0){
  int IterMax, i;
  int dim = x0.Dimension();
  const double tolerance = 1.0e-14;
  Vector x(dim),f(dim),z(dim);
  double c,alpha,d;
  IterMax = 30;
  x = x0;
  r = A*x-b;
  i = 0;
  while (i <= IterMax){
    z = A*r;
    c = dot(r,r);
    alpha = c/dot(r,z);
    x = x - alpha*r;
    r =  A*x-b;
    if(sqrt(dot(r,r)) < tolerance) break;
    i++;
  }
  return x;
}

Steepest descent example

import numpy as np
import numpy.linalg as la

import scipy.optimize as sopt

import matplotlib.pyplot as pt
from mpl_toolkits.mplot3d import axes3d

def f(x):
    return 0.5*x[0]**2 + 2.5*x[1]**2

def df(x):
    return np.array([x[0], 5*x[1]])

fig = pt.figure()
ax = fig.gca(projection="3d")

xmesh, ymesh = np.mgrid[-2:2:50j,-2:2:50j]
fmesh = f(np.array([xmesh, ymesh]))
ax.plot_surface(xmesh, ymesh, fmesh)

pt.axis("equal")
pt.contour(xmesh, ymesh, fmesh)
guesses = [np.array([2, 2./5])]

x = guesses[-1]
s = -df(x)

def f1d(alpha):
    return f(x + alpha*s)

alpha_opt = sopt.golden(f1d)
next_guess = x + alpha_opt * s
guesses.append(next_guess)
print(next_guess)

pt.axis("equal")
pt.contour(xmesh, ymesh, fmesh, 50)
it_array = np.array(guesses)
pt.plot(it_array.T[0], it_array.T[1], "x-")

Conjugate gradient method

In the CG method we define so-called conjugate directions and two vectors $ \boldsymbol{s} $ and $ \boldsymbol{t} $ are said to be conjugate if $$ \begin{equation*} \boldsymbol{s}^T\boldsymbol{A}\boldsymbol{t}= 0. \end{equation*} $$ The philosophy of the CG method is to perform searches in various conjugate directions of our vectors $ \boldsymbol{x}_i $ obeying the above criterion, namely $$ \begin{equation*} \boldsymbol{x}_i^T\boldsymbol{A}\boldsymbol{x}_j= 0. \end{equation*} $$ Two vectors are conjugate if they are orthogonal with respect to this inner product. Being conjugate is a symmetric relation: if $ \boldsymbol{s} $ is conjugate to $ \boldsymbol{t} $, then $ \boldsymbol{t} $ is conjugate to $ \boldsymbol{s} $.

Conjugate gradient method

Assume now that we have a symmetric positive-definite matrix $ \boldsymbol{A} $ of size $ n\times n $. At each iteration $ i+1 $ we obtain the conjugate direction of a vector $$ \begin{equation*} \boldsymbol{x}_{i+1}=\boldsymbol{x}_{i}+\alpha_i\boldsymbol{p}_{i}. \end{equation*} $$ We assume that $ \boldsymbol{p}_{i} $ is a sequence of $ n $ mutually conjugate directions. Then the $ \boldsymbol{p}_{i} $ form a basis of $ R^n $ and we can expand the solution $ \boldsymbol{A}\boldsymbol{x} = \boldsymbol{b}$ in this basis, namely $$ \begin{equation*} \boldsymbol{x} = \sum^{n}_{i=1} \alpha_i \boldsymbol{p}_i. \end{equation*} $$

Conjugate gradient method

The coefficients are given by $$ \begin{equation*} \mathbf{A}\mathbf{x} = \sum^{n}_{i=1} \alpha_i \mathbf{A} \mathbf{p}_i = \mathbf{b}. \end{equation*} $$ Multiplying with $ \boldsymbol{p}_k^T $ from the left gives $$ \begin{equation*} \boldsymbol{p}_k^T \boldsymbol{A}\boldsymbol{x} = \sum^{n}_{i=1} \alpha_i\boldsymbol{p}_k^T \boldsymbol{A}\boldsymbol{p}_i= \boldsymbol{p}_k^T \boldsymbol{b}, \end{equation*} $$ and we can define the coefficients $ \alpha_k $ as $$ \begin{equation*} \alpha_k = \frac{\boldsymbol{p}_k^T \boldsymbol{b}}{\boldsymbol{p}_k^T \boldsymbol{A} \boldsymbol{p}_k} \end{equation*} $$

Conjugate gradient method and iterations

If we choose the conjugate vectors $ \boldsymbol{p}_k $ carefully, then we may not need all of them to obtain a good approximation to the solution $ \boldsymbol{x} $. We want to regard the conjugate gradient method as an iterative method. This will us to solve systems where $ n $ is so large that the direct method would take too much time.

Conjugate gradient method

One can show that the solution $ \boldsymbol{x} $ is also the unique minimizer of the quadratic form $$ \begin{equation*} f(\boldsymbol{x}) = \frac{1}{2}\boldsymbol{x}^T\boldsymbol{A}\boldsymbol{x} - \boldsymbol{x}^T \boldsymbol{x} , \quad \boldsymbol{x}\in\mathbf{R}^n. \end{equation*} $$ This suggests taking the first basis vector $ \boldsymbol{p}_1 $ to be the gradient of $ f $ at $ \boldsymbol{x}=\boldsymbol{x}_0 $, which equals $$ \begin{equation*} \boldsymbol{A}\boldsymbol{x}_0-\boldsymbol{b}, \end{equation*} $$ and $ \boldsymbol{x}_0=0 $ it is equal $ -\boldsymbol{b} $. The other vectors in the basis will be conjugate to the gradient, hence the name conjugate gradient method.

Conjugate gradient method

Let $ \boldsymbol{r}_k $ be the residual at the $ k $-th step: $$ \begin{equation*} \boldsymbol{r}_k=\boldsymbol{b}-\boldsymbol{A}\boldsymbol{x}_k. \end{equation*} $$ Note that $ \boldsymbol{r}_k $ is the negative gradient of $ f $ at $ \boldsymbol{x}=\boldsymbol{x}_k $, so the gradient descent method would be to move in the direction $ \boldsymbol{r}_k $. Here, we insist that the directions $ \boldsymbol{p}_k $ are conjugate to each other, so we take the direction closest to the gradient $ \boldsymbol{r}_k $ under the conjugacy constraint. This gives the following expression $$ \begin{equation*} \boldsymbol{p}_{k+1}=\boldsymbol{r}_k-\frac{\boldsymbol{p}_k^T \boldsymbol{A}\boldsymbol{r}_k}{\boldsymbol{p}_k^T\boldsymbol{A}\boldsymbol{p}_k} \boldsymbol{p}_k. \end{equation*} $$

Conjugate gradient method

We can also compute the residual iteratively as $$ \begin{equation*} \boldsymbol{r}_{k+1}=\boldsymbol{b}-\boldsymbol{A}\boldsymbol{x}_{k+1}, \end{equation*} $$ which equals $$ \begin{equation*} \boldsymbol{b}-\boldsymbol{A}(\boldsymbol{x}_k+\alpha_k\boldsymbol{p}_k), \end{equation*} $$ or $$ \begin{equation*} (\boldsymbol{b}-\boldsymbol{A}\boldsymbol{x}_k)-\alpha_k\boldsymbol{A}\boldsymbol{p}_k, \end{equation*} $$ which gives $$ \begin{equation*} \boldsymbol{r}_{k+1}=\boldsymbol{r}_k-\boldsymbol{A}\boldsymbol{p}_{k}, \end{equation*} $$

Broyden–Fletcher–Goldfarb–Shanno algorithm

The optimization problem is to minimize $ f(\mathbf {x} ) $ where $ \mathbf {x} $ is a vector in $ R^{n} $, and $ f $ is a differentiable scalar function. There are no constraints on the values that $ \mathbf {x} $ can take.

The algorithm begins at an initial estimate for the optimal value $ \mathbf {x}_{0} $ and proceeds iteratively to get a better estimate at each stage.

The search direction $ p_k $ at stage $ k $ is given by the solution of the analogue of the Newton equation $$ B_{k}\mathbf {p} _{k}=-\nabla f(\mathbf {x}_{k}), $$

where $ B_{k} $ is an approximation to the Hessian matrix, which is updated iteratively at each stage, and $ \nabla f(\mathbf {x} _{k}) $ is the gradient of the function evaluated at $ x_k $. A line search in the direction $ p_k $ is then used to find the next point $ x_{k+1} $ by minimising $$ f(\mathbf {x}_{k}+\alpha \mathbf {p}_{k}), $$ over the scalar $ \alpha > 0 $.

Data Analysis and Machine Learning: Logistic Regression and Gradient Methods

Jun 26, 2020

Logistic Regression

Logistic Regression and Classification Problems

Optimization and Deep learning

Basics

Linear classifier

Some selected properties

The logistic function

Examples of likelihood functions used in logistic regression and nueral networks

Two parameters

Maximum likelihood

The cost function rewritten

Minimizing the cross entropy

A more compact expression

Extending to more predictors

Including more classes

More classes

Preprocessing our data

More preprocessing

Simple preprocessing examples, breast cancer data and classification

Covariance and Correlation

Optimization, the central part of any Machine Learning algortithm

Revisiting our Logistic Regression case

The equations to solve

Solving using Newton-Raphson's method

Brief reminder on Newton-Raphson's method

The equations

Simple geometric interpretation

Extending to more than one variable

Steepest descent

More on Steepest descent

The ideal

The sensitiveness of the gradient descent

Convex functions

Convex function

Conditions on convex functions

More on convex functions

Some simple problems

Revisiting our first homework

Gradient descent example

The derivative of the cost/loss function

The Hessian matrix

Simple program

Gradient Descent Example

And a corresponding example using scikit-learn

Gradient descent and Ridge

Program example for gradient descent with Ridge Regression

Using gradient descent methods, limitations

Stochastic Gradient Descent

Computation of gradients

SGD example

The gradient step

Simple example code

When do we stop?

Slightly different approach

Program for stochastic gradient

Momentum based GD

More on momentum based approaches

Momentum parameter

Second moment of the gradient

RMS prop

ADAM optimizer

Practical tips

Automatic differentiation

Using autograd

Autograd with more complicated functions

More complicated functions using the elements of their arguments directly

Functions using mathematical functions from Numpy

More autograd

And with loops

Using recursion

Unsupported functions

The syntax a.dot(b) when finding the dot product

Recommended to avoid

Standard steepest descent

Gradient method

Steepest descent method

Steepest descent method

Final expressions