Foundation of Data Science

Xiaozhe Yao

2020-12-10

Preface

PDF Version

Linear Algebra

Scalar, Vectors, Matrix and Tensor

Operations

Products

We can define inner product, or known as dot product on vectors.

Norms

Determinant

Inversion

A matrix \(A\in\mathbb{R}^{n\times n}\) is called invertible if there is \(A^{-1}\in\mathbb{R}^{n\times n}\) such that \(AA^{-1}=A^{-1}A=I\).

Properties of Matrix

More concepts

Eigenvalues and Eigenvectors

Definition: \(v\in\mathbb{R}^n\) is an eigenvector of \(A\in\mathbb{R}^{n\times n}\) with eigenvalue \(\lambda\in\mathbb{R}\) if \(Av=\lambda v\).

If \(A\in\mathbb{R}^{n\times n}\) has eigenvalues \(\lambda_1,\cdots,\lambda_n\), then the determinant of \(A\) is \(\text{det}(A)=|A|=\prod \lambda_i\).

Steps to compute the eigenvalues and eigenvectors

Definite Matrix

A symmetric matrix \(A\) is:

We can induce this property by using the eigenvalues:

We can similarly define negative semi-definite and negative definite.

Rayleigh Quotient

\[ R(M,v)=\frac{v^TMv}{v^Tv} \]

For PSD \(M\),

Singular Value Decomposition

We can decompose a matrix \(X\) into \(U\sum V^T\) where \(X\in\mathbb{R}^{N\times D}\).

Calculus

Finding Minimisation

Let a function \(f:\mathbb{R}^d\to\mathbb{R}\), then there are two extrema defined:

In order to find the extrema, the general approach is by first and second order derivative tests.

Finding the maximum is the same as minimising \(-f\), hence it is enough to focus on minimisation.

Functions of one variable \(f:\mathbb{R}\to\mathbb{R}\)

Continuous and Differentiable Functions

First Derivative Test

\(f'(x^\star)=0\) means that \(x^\star\) is a critical or stationary point for \(f\). It can be a local minimum, a local maximum or a saddle (minmax) point.

Second Derivative Test

Functions of multiple variable to a scalar \(f:\mathbb{R}^m\to\mathbb{R}\)

Partial Derivative and Gradient

Partial derivative of \(f(x_1,\cdots,x_m)\) in direction \(x_i\) at \(a=(a_1,\cdots,a_m)\): \[ \frac{\partial}{\partial x_i}f(a)=\lim_{h\to0}\frac{f(a_1,\cdots,a_i+h,\cdots,a_m)-f(a_1,\cdots,a_i,\cdots,a_m)}{h} \] Assume that \(f\) is differentiable everywhere, then the gradient is defined as: \[ \nabla_xf=[\frac{\partial f}{\partial x_1},\cdots,\frac{\partial f}{\partial x_m}]^T \] This means that \([\nabla_xf]_i=\frac{\partial f}{\partial x_i}\), and we know \(\nabla_xf\) points in direction of the steepest ascent. Hence, \(-\nabla_xf\) points in direction of steepest descent.

\(x^\star\) is a critical point if \(\nabla_xf(x^\star)=0\).

Hessian Matrix

The Hessian \(\nabla_x^2f\) is a matrix of second-order partial derivatives. \[ \nabla_x^2f=\begin{bmatrix} \frac{\partial^2f}{\partial x_1^2} & \cdots & \frac{\partial^2f}{\partial x_1\partial x_m}\\ \vdots & & \vdots \\ \frac{\partial^2f}{\partial x_m\partial x_1} & \cdots & \frac{\partial^2f}{\partial x_m^2} \end{bmatrix} \] which means \([\nabla_x^2f]_{ij}=\frac{\partial^2f}{\partial x_i\partial x_j}\).

If the partial derivatives are continuous, the order of differentiation does not matter, the Hessian matrix is always symmetric.

Functions of multiple variable to a vector \(f:\mathbb{R}^m\to\mathbb{R}^n\)

\(f\) given as \(f=(f_1,\cdots,f_n)\) with \(f_i:\mathbb{R}^m\to\mathbb{R}\), then the Jacobian \(J\) of \(f\) is an \(n\times m\) matrix: \[ J_f=\begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_m} \\ \vdots & & \vdots \\ \frac{\partial f_n}{\partial x_1} & \cdots & \frac{f_n}{x_m} \end{bmatrix} \] which means \([J_f]_{i,j}=\frac{\partial f_i}{\partial x_j}\).

Useful Differentiation Rules

Chain Rule in Higher Dimensions

Let \(y=g(x)\) and \(z=f(y)\) for \(x\in\mathbb{R}^m\) and \(y\in\mathbb{R}^n\), then \[ \frac{\partial z}{\partial x_i}=\sum_{j\in[n]}\frac{\partial z}{\partial y_j}\frac{\partial y_j}{\partial x_i} \] and \(\nabla_xz=J_g\nabla_yz=\frac{\partial y}{\partial x}\nabla_yz\).

Optimization with conditions: Lagrange Multipliers

The constrained optimization problem: we want to maximize \(f(x)\) subject to \(g_i(x)=0\) for all \(i\in[n]\).

Probability Theory

Probability Space

The probability space consists of sample space \(S\) and a probability function \(p:P(S)\to[0,1]\) assigning a probability to every event.

It satisfies the axioms of probability:

With these axioms, we can infer the following properties:

Conditional Probability

Given events \(A,B\) with \(P(B)>0\), then the conditional probability of \(A\) given \(B\) is \[ P(A|B)=\frac{P(A\cap B)}{P(B)} \] which indicates that \(P(A\cap B)=P(A|B)P(B)=P(B|A)P(A)\).

Total Probability Law

Given partition \(A_1,\cdots,A_n\) of \(S\) with \(P(A_i)>0\), then \[ P(B)=\sum_{i=1}^nP(B|A_i)P(A_i) \] Bayes’ Rule

We have \[ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \] in which we call \(P(A|B)\) posterior, \(P(B|A)\) the likelihood and \(P(A)\) the prior.

Chain Rule of Conditional Probability \[ P(X_1,\cdots,X_n)=P(X_1)\prod_{i=2}^nP(X_i|X_1,\cdots,X_{i-1}) \]

Random Variable

A random variable is understood as a measureable function defined on the sample space that maps from the sample space to some numeric domain (usually \(\mathbb{R}\)). We use \(P(X=x)\) to denote the probability of event \(\{s\in S: X(s)=x\}\). We write \(X\sim P(x)\) to specify probability distribution of \(X\).

Discrete Random Variables

Continuous Random Variables

Joint Probability Distributions

Expectation and Variance

Expectation

Variance

Variance captures how much values of probability distribution vary on average if randomly drawn. Formally, we have \[ \text{Var}(X)=\mathbb{E}[X-\mathbb{E}(X)^2]=\mathbb{E}[X^2]-\mathbb{E}[X]^2 \] Some properties of variance includes:

Standard Deviation

Standard deviation is the square root of the variance, i.e. \(SD(X)=\sqrt{Var(X)}\)

Covariance

Covariance generalises variance to two random variables, the definition is: \[ \text{Cov}(X,Y)=\mathbb{E}[(X-\mathbb{E}(X))(Y-\mathbb{E}(Y))]=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y] \] Covariance matrix generalises covariance to multiple random variables, the definition is: \[ \sum_{ij}=\text{Cov}(X_i,X_j) \]

Linear Regression

Linear Models

Suppose the input is a vector \(x\in\mathbb{R}^d\) and the output is \(y\in\mathbb{R}\), and we have the datapoints \((x_1,y_1),\cdots,(x_n,y_n)\). Then we define a linear model as \[ y=w_0+x_1w_1+\cdots+x_Dw_D+\epsilon \] where we call the \(w_0\) as the bias/intercept and the \(\epsilon\) as the noise/uncertainty.

Learning of Linear Models

Least Squares Objective Function

Assume we have \((x_1,y_1),\cdots,(x_N,y_N)\) where \(x_i,y_i\in\mathbb{R}\), and we have the predicted output as \(\hat{y}=w_0+x\cdot w\) (without noise term). We define the least square objective function as \[ \mathcal{L}(w)=\mathcal{L}(w_0,w)=\frac{1}{2N}\sum_{i=1}^{N}(\hat{y_i}-y_i)^2=\frac{1}{2N}\sum_{i=1}^{N}(w_0+x\cdot w-y_i)^2 \] This objective function is also called loss function, cost function, energy function, etc. It is known as the residual sum of squares or RSS. The estimated parameters \((w_0,w)\) is known as the least squares etimate.

One Dimensional case

With the objective function defined, we have \[ \frac{\partial\mathcal{L}}{\partial w_0}=\frac{1}{N}\sum_{i=1}^N(w_0+wx_i-y_i) \\ \frac{\partial\mathcal{L}}{\partial w}=\frac{1}{N}\sum_{i=1}^N(w_0+wx_i-y_i)x_i \] We obtain the solution for \(w_0,w\) by settingthe partial derivatives to \(0\) and solving the resulting system. \[ w_0+w\cdot\frac{\sum_ix_i}{N}=\frac{\sum_i y_i}{N} \\ w_0\cdot \frac{\sum_ix_i}{N}+w\frac{\sum_ix_i^2}{N}=\frac{\sum_ix_iy_i}{N} \] Since we know these four values:

Then we end up with \[ w=\frac{\hat{Cov}(x,y)}{\hat{Var}(x)} \\ w_0=\overline{y}-w\overline{x} \] General Case

In general case, we express everything in matrix notation. We have \(\hat{y}=Xw\), in which \(\hat{y}\in\mathbb{R}^{N\times 1}\), \(X\in\mathbb{R}^{N\times(D+1)}\) and \(w\in\mathbb{R}^{(D+1)\times 1}\). Then we define the loss function as \[ \mathcal{L}(w)=\frac{1}{2N}\sum_{i=1}^{N}(x_i^Tw-y_i)^2=\frac{1}{2N}(Xw-y)^T(Xw-y) \] Then we can calculate the gradient as \[ \nabla_w\mathcal{L}=\frac{1}{N}((X^TX)w-X^Ty) \] By letting the gradient \(\nabla_w\mathcal{L}=0\), we get \((X^TX)w=X^Ty\) and hence \(w=(X^TX)^{-1}X^Ty\).

The predictions made by this model on the data \(X\) is given by \[ \hat{y}=Xw=X(X^TX)^{-1}X^Ty \] where \(X(X^TX)^{-1}X^T\) is called the hat matrix.

Rank of the Matrix \(X^TX\)

Above induction requires \(X^TX\) to be invertible. Formally, we have the rank defined as \[ \text{rank}(X^TX)=\text{rank}(X)\leq \text{min}\{D+1, N\} \] Then \(X^TX\) is invertible if \(\text{rank}(X)=D+1\).

If we use one-hot encoding, where we have \(\sum X_i=1\), we introduce some linear-denpendency in the columns of \(X\) and reduce the rank. In this case, we need to drop some features to adjust rank.

Complexity Analysis

There are several steps in calculating the weight matrix:

Hence the overall complexity is the sum of all these steps, i.e. \(O(D^2N+D^3+DN+D^2)=O(D^2N+D^3)\).

Joint of Several Relations

If \(X\) is defined by a join of several relations, then the number of rows \(N\) may be exponential in the number of relations, i.e. \(N=O(M^{\text{number of relations}})\).

If \(X\) is sparse, then it can be represented in \(O(M)\) space losslessly for acyclic joins, which are in common in practice.

In this case, \(W\) can still be calculated in \(O(D^2M+D^3)\).

Expansion and Kernel Methods

We can perform expansions in higher dimensions. For example, we can expand the simple linear models \(\psi(x)=[1,x_1,x_2]^T\) into quadratic form \(\psi(x)=[1,x_1,x_2,x_1^2,x_2^2,x_1x_2]^T\). By doing so, we are still fitting linear models, but with more, higher-dimensional features.

Using degree \(d\) polynomials in \(k\) dimensions results in \(O(k^d)\) features.

We can use kernels as the features. For some expansions, a kernel computes the dot product, i.e. \(\kappa(x',x):\mathcal{X}\times\mathcal{X}\to\mathbb{R}, \kappa(x',x)=\phi(x')\cdot\phi(x)\).

Some examples of kernels:

Polynomial Kernel

The polynomial kernel is defined as \[ \kappa_{\text{poly}}(x',x)=(x\cdot x'+\theta)^{d} \] Since we have the binomial expansion as \[ (z_1+\cdots+z_k)^d=\sum_{n_i\geq0, \sum_i n_i=d} \frac{d!}{n_1!\cdots n_k!}z_1^{n_1}\cdots z_k^{n_k} \] Let \(C\) represents the number of ways to distribute \(d\) balls into \(k\) bins, where the \(j\)-th bin holds \(n_j\geq 0\) balls. Assume that \(z_i=x_ix_i'\) in the above binomial expansion. Then we have \[ (xx')^d=\sum_{n_i\geq0, \sum_i n_i=d} \sqrt{C(d;n_1,\cdots,n_k)}x_1^{n_1}\cdots x_k^{n_k}\sqrt{C(d;n_1,\cdots,n_k)}(x_1')^{n_1}\cdots(x_k')^{n_k} \] where \(\sum_{n_i\geq0, \sum_i n_i=d}\) is the dimension of \(\phi_{\text{poly}}\), \(\sqrt{C(d;n_1,\cdots,n_k)}x_1^{n_1}\cdots x_k^{n_k}\) is one row in \(\phi_{\text{poly}}(x)\) and \(\sqrt{C(d;n_1,\cdots,n_k)}(x_1')^{n_1}\cdots(x_k')^{n_k}\) is one row in \(\phi_{\text{poly}}(x')\).

The dimension of the vector \(\phi_{\text{poly}}(x)\) and \(\phi_{\text{poly}}(x')\) is \(O(k^d)\).

The complexity for computing \(\kappa(x',x)\) is \(O(k^d)\) using \(\phi_{\text{poly}}\) vs. \(O(k\text{log}d)\) using \((xx')^d\).

RBF Kernel

A radial basis function (RBF) kernel is defined as \[ \kappa_{\text{RBF}}(x',x)=\text{exp}(-\gamma||x-x'||^2)=\frac{1}{e^{\gamma||x-x'||^2}} \] The hyperparameters in RBF kernel are the width \(\gamma\) and the centres \(x'\).

RBF acts as a similarity measurement between \(x\) and \(x'\): \(\kappa(x',x)\in(0,1]\).

Since we have \[ ||x-x'||^2=<x-x',x-x'>=<x,x>+<x',x'>-2<x,x'>\\ =||x||^2+||x'||^2-2xx' \] and
\[ \text{exp}(x,x')=\sum_{k=0}^{\infty}\frac{1}{k!}(xx')^k \] and the taylor expansion: \[ \text{exp}(z)=\sum_{k=0}^\infty\frac{z^k}{k!} \] Without loss of generality, assume \(\gamma=1\), then we have \[ \text{exp}(-\gamma||x-x'||^2)=\sum_{k=0}^{\infty}(\sqrt{\frac{1}{k!}}\text{exp}(-||x||^2)\phi_\text{poly}(x))(\sqrt{\frac{1}{k!}}\text{exp}(-||x'||^2)\phi_\text{poly}(x')) \] in which \((\sqrt{\frac{1}{k!}}\text{exp}(-||x||^2)\phi_\text{poly}(x))\) is the \(k\)-th row in \(\phi_\text{RBF}(x)\) and \((\sqrt{\frac{1}{k!}}\text{exp}(-||x'||^2)\phi_\text{poly}(x'))\) is the \(k\)-th row in \(\phi_\text{RBF}(x')\).

Note that \(\phi_\text{RBF}:\mathbb{R}^d\to\mathbb{R}^{\infty}\) projects vectors into an infinite dimensional space, and hence it is not feasible to compute \(\kappa_\text{RBF}(x',x)\) using \(\phi_\text{RBF}\).

Linear Model with RBF Kernel

First we choose centres \(\mu_1,\cdots,\mu_M\), and then the feature maps become \[ \psi_\text{RBF}(x)=[1,\kappa_\text{RBF}(\mu_1,x),\cdots,\kappa_\text{RBF}(\mu_M,x)] \] where we have \[ \kappa_\text{RBF}(x',x)=\text{exp}(-\gamma||x-x'||^2) \] Then the output from a linear model is \[ y=w_0+\sum_{i=1}^Mw_i\kappa_\text{RBF}(\mu_1,x)+\epsilon=w\psi(x)+\epsilon \] When choosing hyperparameters:

The choice for width parameter is similar with the choice for degree in polynomial basis expansion.

Generalization of Machine Learning

The fundamental goal of machine learning is to generalize beyond the examples in the training set. Hence the test data is highly important, but the data along is not enough.

Bias-Variance Tradeoff

Having high bias means that we are underfitting, while having high variance means we are overfitting.

To combat overfitting, or high variance, there are several approaches:

Avoiding overfitting may lead to high bias, a.k.a underfitting.

Regularisation and Validation

Ridge Regression

Suppose we have data \(X\in\mathbb{R}^{N\times D}\), where \(D\gg N\). One idea to avoid overfitting is to add a penalty term for weights. With a penalty term, our least squares estimate objective defined as \[ \mathcal{L}(w)=(Xw-y)^T(Xw-y) \] changed to the Ridge Regression Objective defined as \[ \mathcal{L}_\text{ridge}(w)=(Xw-y)^T(Xw-y)+\lambda\sum_{i=1}^Dw_i^2 \] By doing so, we add a penalty term for all weights but \(w_0\) to control model complexity. The effect of varying \(w_0\) is output translation and not on model complexity, hence we do not control \(w_0\).

In ridge regression, in addition to the residual sum of squares, we penalise the sum of squares of weights. This is also called \(\ell_2\)-regularization or weight-decay. Penalizing weights encourages fitting signals rather than just noise.

Weight from Ridge Regression

We first standardise the input data to make the original data becomes values drawn from standard normal distribution i.e. \(\mathcal{N}(0,1)\). If we center the outputs, i.e. their mean is \(0\), then \(w_0\) becomes \(0\) as well. Now we only need to find \(w\) that minimise the ridge regression objective, i.e. \[ \mathcal{L}_\text{ridge}(w)=(Xw-y)^T(Xw-y)+\lambda w^Tw \] The gradient of the objective with respect to \(w\) is \[ \nabla_w\mathcal{L}_\text{ridge}=2(X^TX)w-2X^Ty+2\lambda w\\ =2((X^TX+\lambda I_D)w-X^Ty) \] If \(y\) is not centred, the gradient of the objective with respect to the leading weight, or \(b\), is \[ \nabla_b\mathcal{L}_{\text{ridge}}=2Nb-2\sum_{i=1}^Ny \] Set the gradient to \(0\) and solve for \(w\), we get \[ (X^TX+\lambda I_D)w=X^Ty \] hence, \[ w_\text{ridge}=(X^TX+\lambda I_D)^{-1}X^Ty \] \[ b_\text{ridge}=\frac{1}{N}\sum y_i \]

An alternative formulation of ridge regression is by considering it as a constriained optimisation problem: \[ \text{Minimise} (Xw-y)^T(Xw-y)\:\:\text{subject to}\:\: w^Tw\leq R \] The former construction is the Lagrangian formulation of this approach.

Relationship between \(\lambda\) and \(R\)

With \(\lambda\) decreasing, the magnitudes of weights start increasing. The larger \(\lambda\), the smaller \(R\).

LASSO Regression

LASSO represents Least Absolute Shrinkage and Selection Operator. The objective function of LASSO regression is defined as \[ \mathcal{L}_\text{Lasso}(w)=(Xw-y)^T(Xw-y)+\lambda\sum_{i=1}^D|w_i| \]

Hyper-Parameter Choice

For ridge regression or lasso or other regularizers, we need to choose \(\lambda\). To do so, we divide the data into training and validation set. Then we pick the value of \(\lambda\) that minimises the validation error.

If we perform basis expansion,

Grid search is the main and trivial approach for selecting hyper-parameters. There are generally four steps:

\(K\)-Fold Cross Validation

When the data is scarce, we can divide the data into \(K\) folds (parts), instead of splitting as training and validation. Then we use \(K-1\) folds for training and \(1\) folds for validation.

Logistic Regression

Discriminative and Classification method

Discriminative models the conditional distribution over the output \(y\) given the input \(x\) and the parameters \(w\). \[ P(y|w,x) \] In classification, the output \(y\) is categorical.

Logistic Regression

Logistic Regression builds up on a linear model composed with sigmoid function, formally, we have \[ P(y|w,x)=\text{Bernoulli}(\text{sigmoid}(w\cdot x)) \] in which \(w\text{log}x_0=1\), so we do not need to handle the bias term \(w_0\) separately. The sigmoid function \(\sigma\) is defined as \[ \sigma(t)=\frac{1}{1+e^{-t}}, \sigma:\mathbb{R}\to(0,1) \] in which \(t\geq 0\) and hence \(\sigma(t)\geq\frac{1}{2}\).

Suppose we have estimated the model parameters \(w\in\mathbb{R}^d\). For a new data points \(x_{new}\), the model gives us the probability \[ P(y_{new}=1|x_{new},w)=\sigma(w\cdot x_{new})=\frac{1}{1+\text{exp}(-x_{new}\cdot w)} \] In order to make a prediction, we can simply use a threshold at \(\frac{1}{2}\).

Multi-Class Logistic Regression

Consider now there are \(C>2\) classes and \(y\in\{1,\cdots,C\}\). Then

Summary

Feature Selection

In the feature selection phase, our goal is select features that are relevant to the model to learn, given a set of possible features.

Premise in feature selection

Reasons for Feature Selection

Feature Selection Methods

Forward Stepwise Selection

For \(n\) features, we ask the following sequence of \(n\) questions:

The output is the best seen \(k\)-feature model for any \(1\leq k\leq n\).

Analysis: At each step \(i\), it trains and tests \(n-i+1\) new models, and hence \(O(n^2)\) models to train and test in total. For linear regression, it is the same complexity as building one model.

Feature Selection via Mutual Information

The mutual information for two random variables \(X\) and \(Y\) is defined as \[ I(X,Y)=\sum_x\sum_yP(X=x,Y=y)\text{log}\frac{P(X=x,Y=y)}{P(X=x)P(Y=y)} \] Approach

Computational Considerations

The probabilities \(P(X=x), P(Y=y)\) and \(P(X=x,Y=y)\) can be empirically obtained from the training set.

Covariance vs. Correlation

Covariance for random variables \(X\) and \(Y\) measures how the random variables change jointly. It is defined as \[ \text{cov}(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] \] The (pearson) correlation coefficient normalises the covariance to give a value between \(-1\) and \(+1\). It is defined as \[ \text{corr}(X,Y)=\frac{\text{cov}(X,Y)}{\sqrt{\text{cov(X,X)}\cdot\text{cov(Y,Y)}}} \] Note: Independent variables are uncorrelated, but the converse is not always true.

Support Vector Machines

SVM is a popular discriminative model for classification.

Maximum Margin Principle

Background

Data is linearly separable if there exists a linear separator that classifies all points correctly.

For a data point, its distance to the separating hyperplane is called the margin. When picking the separating boundary, we wish to pick one that maximises the smallest marging (the least distance between data and the boundary). That is, we want to maximise the distance of the closest point from the decision boundary. The points that are closet to the decision boundary are called support vectors.

Given a hyperplane \(H:=w\cdot a+w_0=0\) and a point \(x\in\mathbb{R}^d\), the distance between \(x\) and \(H\) can be computed by \[ d=\frac{w\cdot x+w_0}{||w||_2} \]

Formulation as Optimisation Problem

Our goal as stated before can be formulated as

Find distance \(||x-x^\star||_2\) between point \(x^\star\) and the hyperplane \(w\cdot x+w_0=0\) where the distance is given by the point \(x\) on the hyperplane that is closest to \(x^\star\).

Equivalently, we can formulate it as an optimisation problem. Formally, we want to find \(x\) that optimises the following problem \[ \text{minimise}\:\: ||x-x^\star||_2^2 \:\: \text{subject to}\:\: w\cdot x+w_0=0 \] The lagrangian can be formulated as \[ \wedge(x,\lambda)=||x-x^\star||_2^2 -2\lambda(w\cdot x+w_0) \]

\[ =||x||_2^2-2(x^\star+\lambda w)\cdot x-2\lambda w_0+||x^\star||_2^2 \]

By setting the gradient of the lagrangian to \(0\), we can find its critial point, i.e. \[ \nabla_x\wedge(x,\lambda)=2x-2x^\star-2\lambda w=0 \] Then we have \[ x=x^\star+\lambda w \] Then by substituting \(x\) into the hyperplane equation, we have \[ w(x^\star+\lambda w)+w_0=0 \] hence \[ \lambda=-\frac{wx^\star+w_0}{w\cdot w}=-\frac{wx^\star+w_0}{||w||_2^2} \] Linearly Separable Case

When the dataset \(D=(x_i,y_i)\) is linearly separable, we have \[ y_i(wx_i+w_0)>0 \] or \[ y_i(\frac{w}{\epsilon}x_i+\frac{w_0}{\epsilon})\geq 1 \] Margin Maximisation

The margin of data point \(x_i\) to hyperplane is defined as \[ \frac{|wx_i+w_0|}{||w||_2}=\frac{y_i(wx_i+w_0)}{||w||_2}\geq \frac{1}{||w||_2} \] We want to pick a boundary that maximises the margin \[ \text{maximise}\:\: \frac{1}{||w||_2}\:\text{subject to}\: y_i(wx_i+w_0)\geq 1 \] Equivalently, we can minimise the squared \(\ell_2\) norm of \(w\) subject to the constraints. \[ \text{minimise}\:\: ||w||_2\:\text{subject to}\: y_i(wx_i+w_0)\geq 1 \]

Hence, our optimisation problem is convex quadratic and it solvable using generic convex optimisation methods.

Relaxation of the SVM Formulation

Sometimes, the data is not linearly-separable. In this case, we want to find a separator that makes the least mistakes on the training error. That is, we need to satisfy as many of the \(N\) constraints as possible. Alternatively, we allow all constraints to be satisified with some slack variable \(\xi_i\). Then the optimisation problem becomes \[ \text{minimise} ||w||_2^2+C\sum_{i=1}^N\xi_i \]

\[ \text{subject to}\: y_i(wx_i+w_0)\geq 1-\xi_i,\:\:\xi_i\geq 0 \]

With \(\xi_i\), a feasible solution always exists because we can let \(\xi_i\) very large. The larger the \(\xi_i\), the constraints can be violated as much as necessary.

With relaxation, the optimal solution must satisfy

Hinge Loss

The hinge loss is defined as \[ \ell_{\text{hinge}}:=\text{max}\{0, 1-y_i(wx_i+w_0)\} \] Our SVM formulation becomes equivalent to minimising the objective function \[ \ell_{\text{SVM}}(w,w_0|X,y)=C\sum_{i=1}^N\ell_{\text{hinge}}(w,w_0;x_i,y_i)+||w||_2^2 \]

Dual SVM Formulation

The lagrange function of SVM can be formulated as \[ \wedge(w,w_0;\alpha)=||w||_2^2-\sum_{i=1}^N\alpha_i(y_i(wx_i+w_0)-1) \] in which \(w\) are the primal variables, \(\alpha_i\) are the dual variables and \((y_i(wx_i+w_0)-1)\) is the constraints.

The gradients w.r.t the primal variables are \[ \frac{\partial \wedge}{\partial w_0}=-\sum_{i=1}^N\alpha_iy_i \]

\[ \nabla_w\wedge=w-\sum_{i=1}^N\alpha_iy_ix_i \]

Set the gradients to zero, we get \[ \sum_{i=1}^N\alpha_iy_i=0,\:\: w=\sum_{i=1}^N\alpha_iy_ix_i \] For \(w_0\), we can calculate from the constraints, and we will get \[ y_i^2((\sum_{j=1}^N\alpha_jy_jx_j)x_i+w_0)=y_i \] hence \[ w_0=y_i-\sum_{j=1}^N\alpha_jy_j(x_jx_i) \]

It is numerically more stable to have \(w_0\) the average over all possible values \[ w_0=\frac{1}{N}\sum_{i=1}^N(y_i-\sum_{j=1}^N\alpha_jy_j(x_jx_i)) \] For \(\alpha\), we can plug the optimal \(w\) and constraint \(\sum_{i=1}^N\alpha_iy_j=0\) into Lagrangian, \[ g(\alpha)=\sum_{i=1}^N\alpha_i-\sum_{i=1}^N\sum_{j=1}^N\alpha_iy_i\alpha_jy_j(x_ix_j) \] To find the critical points of \(\wedge\), it is sufficient to find the critical points of \(g\) that satisfy the constraints \[ \alpha_i\geq0\:\: \text{and}\:\: \sum_{i=1}^N\alpha_iy_i=0 \] Compared with Primal form, the dual form

Predictions with SVM Dual

Prediction on a new point \(x_\text{new}\) requires inner products with the support vectors \[ wx_\text{new}=\sum_{i=1}^N\alpha_iy_i(x_ix_\text{new}) \] We can as well use blackbox access to a function \(\kappa\) that maps two inputs \((x,x')\) to their inner product \((xx')\). This is called a kernel function. — title: Naive Bayes —

Naive Bayes

Abstractly, naive Bayes is a conditional model: given a problem instance to be classified, represented by a vector \(x=(x_1,\cdots,x_n)\) representing some n features (independent variables), it assigns to this instance probabilities. \[ P(C_k|x_1,\cdots,x_n) \] for each of \(K\) possible outcomes or classes \(C_k\).

The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes’ theorem, the conditional probability can be decomposed as \[ P(C_k|x)=\frac{P(C_k)P(x|C_k)}{P(x)} \] Note:

Maximum Likelihood

Model and Loss Function Choice

Optimization view of Machine Learning

Probabilistic View of Machine Learning

From a probability perspective, we can approach linear regression as well.

Maximum Likelihood

Given observations \(x_1,\cdots,x_N\) indepdently and identically drawn from the same distribution \(P\) where \(P\) has a parameter \(\theta\).

The likelihood of observing \(x_1,\cdots,x_N\) is defined as the probability of making these observations assuming they are generated according to \(P\). Formally, the likelihood is defined as \(P(x_1,\cdots,x_N|\theta)\): the joint probability distribution of the observations given \(\theta\).

Maximum Likelihood Principle

We pick the parameter \(\theta\) that maximises the likelihood. Formally, we choose \[ \theta^\star=\text{argmax}_\theta P(x_1,\cdots,x_N|\theta) \] Since we know the observations are i.i.d, we have \[ p(x_1,\cdots,x_N|\theta)=\prod_{i=1}^NP(x_i|\theta) \]

MLE in Linear Regression

Recall that our linear model is \(y=wx+\epsilon\). Given \(x, w\), the model \(y\) can be regarded as a random variable with mean \(w^Tx\).

We specifically assume that given \(x,w\), then \(y\) is normal with mean \(w^Tx\) and variance \(\sigma^2\). Formally we have \[ P(y|w,x)=\mathcal{N}(w^Tx,\sigma^2)=w^Tx+\mathcal{N}(0,\sigma^2) \] Alternatively, we can view this model as \(\epsilon\sim\mathcal{N}(0,\sigma^2)\). This is called Gaussian noise.

This is a discriminative framework in which we do not model a distribution over \(X\) as the input is fixed.

Discriminative/Generative framework

Likelihood of Linear Regression

With the observed data \((X,y)\) made up of \((x_1,y_1),\cdots,(x_N,y_N)\), we want to know the likelihood of observing the data for model parameters \(w\) and \(\sigma^2\). To do so, there are two estimators available.

MLE Estimator

With the observed data \((x_1,y_1),\cdots,(x_N,y_N)\), the likelihood of observing the data for the parameters \(w,\sigma\) is given by \[ P(y|X,w,\sigma)=P(y_1,\cdots,y_N|x_1,\cdots,x_N,w,\sigma)=\prod_{i=1}^NP(y_i|x_i,w,\sigma) \] Since we assume the model is normal, we have \(y_i\sim\mathcal{N}(w^Tx_i,\sigma^2)\), we have \[ P(y_1,\cdots,y_N|x_1,\cdots,x_N,w,\sigma)=\prod_{i=1}^N\frac{1}{\sqrt{2\pi\sigma^2}}\text{exp}(-\frac{(y_i-w^Tx_i)^2}{2\sigma^2})\\ \] \[ =(\frac{1}{2\pi\sigma^2})^{N/2}\text{exp}(-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-w^Tx_i)^2)\\ \]

\[ =(\frac{1}{2\pi\sigma^2})^{N/2}\text{exp}(-\frac{1}{2\sigma^2}\sum_{i=1}^N(Xw-y)^T(Xw-y)) \]

Then the negative log-likelihood will become \[ \text{NLL}(y|X,w,\sigma)=-\text{LL}(y|X,w,\sigma)=\frac{N}{2}\text{log}(2\pi\sigma^2)+\frac{1}{2\sigma^2}(Xw-y)^T(Xw-y) \] In order to maximise the likelihood, it is equivalent to minimise the negative log-likelihood. Similar to least square estimator, we end up with \[ \hat{w}_{\text{ML}}=(X^TX)^{-1}X^Ty \] This result is the same as the least square estimator.

The for \(\sigma^2\), we have \[ \sigma^2_{\text{ML}}=\frac{1}{N}(Xw_{\text{ML}}-y)^T(Xw_\text{ML}-y) \] To make predictions, we have \[ \hat{y} = w_{\text{ML}}x \] and we have the confidence interval as \[ y\sim \hat{y}+\mathcal{N}(0,\sigma^2) \]

Optimization

Most machine learning methods can be cast as optimisation problems. Besides of closed-form solutions, there are generally two approaches to solving the problems beyond closed-form solutions:

Convex

Convex Sets

A set \(C\subseteq\mathbb{R}^D\) is convex if for any \(x,y\in C\) and \(\lambda\in[0,1]\), we have \[ \lambda x+(1-\lambda)y\in C \] Some examples of convex sets include

Convex Functions

A function \(f:\mathbb{R}^n\to\mathbb{R}^n\) defined on a convex domain is convex if \[ \forall x,y\in\mathbb{R}^D, 0\leq\lambda\leq1 \] we have \[ f(\lambda x+(1-\lambda)y)\leq\lambda f(x)+(1-\lambda)f(y) \] Some examples of convex functions include

Convex Optimisation

Given convex function \(f(x),g_1(x),\cdots,g_m(x)\) and affine functions \(h_1(x),\cdots,h_n(x)\). A convex optimisation problem is of the form: \[ \text{minimize} f(x)\:\: \text{subject to}\:\:g_i(x)\leq0\:\text{and}\:h_j(x)=0 \] where \(i\in\{1,\cdots,m\}\) and \(j\in\{1,\cdots,n\}\).

The goal is find an optimal value of a convex optimisation problem: \[ V^\star=\text{min}\{f(x):g_i(x)\leq 0, i\in\{1,\cdots, m\}, h_j(x)=0, j\in\{1,\cdots,n\}\} \] Whenever \(f(x^\star)=V^\star\), then \(x^\star\) is a (not necessarily) unique optimal point.

Local Optima and Global Optima

\(x\) is locally optimal if

\(x\) is globally optimal if

Theorem: For a convex optimisation problem, all locally optimal points are globally optimal.

Convex Optimisation Problems

Linear Programming

\[ \text{minimize}\: c^Tx+d\: \text{subject to}\: Ax\leq e\: \text{and}\: Bx=f \]

There is no closed-form solution for this class of problems. However, efficient algorithms exist, both in theory and practice (for tens of thousands of variables).

Linear Model with Absolute Loss

Suppose we have data \((X,y)\) and that we want to minimise the objective \[ \mathcal{L}(w)=\sum_{i=1}^N|x_i^Tw-y_i| \] We want to transform this optimisation problem into a linear program. We introduce a \(\xi_i\) for each data point.

The linear program in the \(D+N\) variables \(w_1,\cdots,w_D,\xi_1,\cdots,\xi_N\). Then our linear program is \[ \text{minimize}\:\: \sum_{i=1}^N\xi_i \]

\[ \text{subject to}\:\: w^Tx_i-y_i\leq \xi_i\:\:\text{and} y_i-w^Tx_i\leq \xi_i \]

where \(i\in\{1,\cdots,N\}\).

The solution to this linear program gives \(w\) that minimises the objective \(\mathcal{L}\).

Maximise Likelihood

Calculating the maximum likelihood is a convex quadratic optimisation problem with no constraints.

Mimising the Lasso Objective

For the Lasso objective, i.e. linear model with \(\ell_1\)-regularisation, we have \[ \mathcal{L}(w)=\sum_{i=1}^N(w^Tx_i-y_i)^2+\lambda\sum_{i=1}^D|w_i| \]

\[ =w^Tx^Txw-2y^TXw+y^Ty+\lambda\sum_{i=1}^D|w_i| \]

Quadratically constrained Quadratic Programming

\[ \text{minimize}\:\frac{1}{2}x^TBx+c^Tx+d \]

\[ \text{subject to}\:\frac{1}{2}x^TQ_ix+r_i^Tx+s_i\leq 0\: \text{and}\: Ax=b \]

where \(i\in\{1,\cdots,m\}\).

Semidefinite Programming

\[ \text{minimize}\: tr(CX)\:\text{subject to}\: tr(A_iX)=b_i \]

where \(X\) is positive semidefinite and \(i\in\{1,\cdots,m\}\).

Gradient Based Optimisation

Theorem of Gradient

If a function \(f\) is differentiable, the gradient of \(f\) at that point is either zero or perpendicular to the contour line of \(f\) at that point.

Background

Gradient Points in the Direction of Steepest Increase

Each component of the gradient says how fast the function changes w.r.t the standard basis \[ \frac{\partial f}{\partial x}(a)=\lim_{h\to0}\frac{f(a+h\textbf{i})-f(a)}{h} \] where \(\textbf{i}\) is the unit vector in the direction of \(x\), i.e. it captures information in which direction we move.

We can also change w.r.t the direction of some arbitrary vector \(\textbf{v}\). The derivative in direction of \(\textbf{v}\) is called directional derivative. \[ \nabla_vf(a)=\lim_{h\to0}\frac{f(a+h\textbf{v})-f(a)}{h}=\nabla f(a)\textbf{v} \] From an geometric perspective, we multiply \(||\nabla f(a)||\) by the scalar projection of \(\textbf{v}\). Hence \(\nabla f(a)\textbf{v}=||\nabla f(a)||\cdot||v||\cos\theta\) where \(\cos\theta\) is the angle between \(\nabla_f (a)\) and \(\textbf{v}\). The maximal value is for \(\cos\theta=1\), or \(\theta=0\), so \(\nabla f(a)\) and \(\textbf{v}\) have the same direction.

Hessian Matrix

Hessian matrix \(H\) is the matrix of all second-order partial derivatives of \(f\).

There are several cases regarding the eigenvalues of \(H\):

Gradient Descent

Gradient descent is one of the simplest, but very general optimisation algorithm for finding a local minimum of a differentiable function. It is iterative and produces a new vector \(w_{t+1}\) at each iteration \(t\): \[ w_{t+1}=w_t-\eta_tg_t=w_t-\eta_t\nabla f(w_t) \] where \(\eta_t>0\) is the learning rate or step size.

At each iteration, it moves in the direction of the steepest descent.

Gradient Descent for Least Squares Regression

Recall the objective function for least squares regression \[ \mathcal{L}(w)=(Xw-y)^T(Xw-y)=\sum_{i=1}^N(x_i^Tw-y_i)^2 \] We can compute the gradient of \(\mathcal{L}\) with respect to \(w\) as \[ \nabla_w\mathcal{L}=2(X^TXw-X^Ty) \] The complexity for gradient descent vs. closed-form solution for very large and wide datasets:

Learning Rate (Step Size) Choice

Choosing a good learning rate is key and we may want a time-varying step size

When choosing the learning rate, we have the following options:

Test for Convergence

Stochastic Gradient Descent (SGD)

We minimise the objective function over data points \((x_1,y_1),\cdots,(x_N,y_N)\), and our objective function is defined as \[ \mathcal{L}(w)=\frac{1}{N}\sum_{i=1}^N\ell(w;x_i,y_i)+\lambda\mathcal{R}(w) \] Where the \(\lambda\mathcal{R}(w)\) is the regularisation term.

The gradient of the objective function is \[ \nabla_w\mathcal{L}=\frac{1}{N}\sum_{i=1}^N\nabla_w\ell(w;x_i,y_i)+\lambda\nabla_w\mathcal{R}(w) \] For example, with Ridge Regression, we have \[ \mathcal{L}(w)=\frac{1}{N}\sum_{i=1}^N(w^Tx_i-y_i)^2+\lambda w^Tw \]

\[ \nabla_w\mathcal{L}=\frac{1}{N}\sum_{i=1}^N2(w^Tx_i-y_i)x_i+2\lambda w \]

As part of the learning algorithm, we calculate the gradient. Suppose we pick a random datapoint \((x_i,y_i)\) and evaluate \(g_i=\nabla_w\ell(w;x_i,y_i)\). Then we have \[ \mathbb{E}[g_i]=\frac{1}{N}\sum_{i=1}^N\nabla_w\ell(w;x_i,y_i) \] In the expectation of \(g_i\), it points in the same direction as the entire gradient (except for the regularisation term).

In SGD, we compute the gradient at one data point instead of at all data points.

In practice, mini-batch gradient descent significantly improves the performance, and reduces the variance in the gradients and hence, it is more stable than SGD.

Sub-Gradient Descent

Mimising the Lasso Objective

Recall that in the Lasso objective, i.e. linear model with \(\ell_1\)-regularisation, we have \[ \mathcal{L}(w)=\sum_{i=1}^N(w^Tx_i-y_i)^2+\lambda\sum_{i=1}^D|w_i| \]

\[ =w^Tx^Txw-2y^TXw+y^Ty+\lambda\sum_{i=1}^D|w_i| \]

The sub-gradient descent focuses on the case when \(f\) is convex: \[ f(\alpha x+(1-\alpha)y)\leq\alpha f(x)+(1-\alpha)f(y), \forall x,y, \forall \alpha\in[0,1] \] At \(x_0\), any \(g\) satisfying the following inequality is called a sub-gradient at \(x_0\):

Constrained Convex Optimisation

With constraint on the parameters, we extend the approach from gradient descnt to projected gradient descent.

Projected Gradient Descent

The goal of Projected Gradient Descent is to minimises \(f(x)\) subject to additional constraints \(w_C\in C\). Formally, we have \[ z_{t+1}=w_t-\eta_t\nabla_f(w_t) \]

\[ w_{t+1}=\overset{}{\text{argmin}}_{w_c\in C}||z_{t+1}-w_c|| \]

Gradient step is followed by a projection step.

Other methods

Second Order Methods

In calculus, our goal is find the roots of a differentiable function \(f\), i.e. solutions to \(f(x)=0\). In optimisations, the goal is to find roots of \(f'\), i.e. solutions to \(f'(x)=0\). In this case,

Newton Methods in One Dimension

Multiclass Classification

Classifiers

In practice, there are two common approaches

Measuring Performance

In Regression, we us the same loss function to test data as for training. In classification, we use the number of misclassified data points as classification error.

Confusion Matrix

For binary classification, we can use the confusion matrix.

Prediction Actual Labels Actual Labels
Yes No
Yes True Positive False Positive
No False Negative True Negative

The confusion matrix can be easily extended into multiclass classification.

Receiver Operating Characteristic (ROC)

ROC space is defined by \(\frac{\text{TPR}}{\text{FPR}}\)

Areas under the ROC Curve (AUC)

AUC makes it easy to compare ROC curves for diferent classsifers.

Precision-Recall Curves

Beyond ROC curves, replace \(\text{FPR}\) with \(\text{Precision}\).

Principal Component Analysis (PCA)

Dimensionality Reduction

Real-world data can have redundant information and may have correlated variables. Dimensionality reduction project the data into a lower dimensional subspace while still capturing the essence of the original data. It is usually considered as a preprocessing step before applying other learning algorithms.

PCA is used for dimensionality reduction by identifying a small number of directions which explan most variation in data.

PCA

First we find the principal component as the line passing through the multidimensional mean and minimises the sum of squares of the error.

The second principal component is computed in the same way as the first principal component, after correlation with first principal component has been subtracted from the points. The second principal component is orthogonal to the first principal component (PC).

Two Views of PCA

Maximum Variance: we want to find \(k\) directions that maximise the variance in the data.

Formally, our data matrix is \(X=[x_1^T,\cdots,x_N^T]^T\), and our goal is find \(v_1\in\mathbb{R}^D\), \(||v_1||=1\) that maximizes \(||Xv_1||^2\).

The solution can be derived by:

By applying the Rayleigh quotient, we can get \[ v_1=\text{argmax}(v_1^TX^TXv_1)=\text{argmax}(\frac{v_1^TX^TXv_1}{v_1^Tv_1}) \] Since \(X^TX\) is PSD, the largest eigenvalue of \(X^TX\) is the maximum value attained by \(v_1^TX^TXv_1\) for \(||v_1||=1\). At this moment, \(v_1\) is the corresponding eigenvector.

Best Reconstruction: we want to find \(k\) dimensional subspace with least reconstruction error.

Formally, given i.i.d data matrix \(X=[x_1^T,\cdots,x_N^T]^T\), our goal is to find a \(k\)-dimensional linear projection that best represents the data.

In case of one PC, the solution is:

In case of \(k\) PCs, the solution is

Finding PCs with SVD

PCA can be associated with SVD of \(X\), the first \(k\) PCs are the first \(k\) columns of \(V\).

Sidenotes

Ridge Regularizer \[ \mathcal{L}_\text{ridge}(w)=(Xw-y)^T(Xw-y)+\lambda\sum_{i=1}^Dw_i^2 \] Lasso Regularizer \[ \mathcal{L}_\text{Lasso}(w)=(Xw-y)^T(Xw-y)+\lambda\sum_{i=1}^D|w_i| \] Generative Models and Discriminative Models

Generative \(P(X|Y=y)\): Naive Bayes; Linear Discriminative Analysis;

Discriminative \(P(Y|X=x)\): logistic regression; linear regression; K-NN; SVM; Neural Networks;

Newton Methods \[ x_{k+1}=x_k-\frac{f'(x_k)}{f''(x_k)} \] Naive Bayes \[ P(Y=y|X_i=x_i)=\frac{P(x_i|y=c)P(y=c)}{\sum_{y_i}P(x_i|Y=y_i)P(Y=y_i)} \] Example: exercise 4.4.

Linear Regression \[ \hat{y}=Xw=X(X^TX)^{-1}X^Ty \] If it has solution, then the matrix \(X^TX\) must be full rank \(D\), and the rank is bounded by \(\text{min}\{D,N\}\), hence \(D\leq N\).

Distance from point to hyperplane

Hyperplane: \(w^Tx+b=0\), point \(x_0\), then the distance is \[ \frac{|w^Tx+b|}{||w||} \]

SVM in Dual Form