This note is based on 2019 Spring COMPSCI189/289A course at University of California, Berkeley by Jonathan Shewchuk.

#Supervised Learning

## Classification

### example 1: classify digit 1 and 7

- $N\times N$ pixels matrices
- flatten into vector
- create a classifier for $N\times N$ space

Note: Decision Boundary is a hyperplane

### example 2: Bank loan default prediction

See notes for M3S17Quantitative Methods for Finance

### Feature/Independent Variables/ Predictor Variables

?

### Overfitting

The model is shaped to specifically to one certain data set so not predictive to new data.

When the test error worse coz the classifier becomes too sensitive to outliers or to other spurious/untrue patterns.

Sinuous decision boundaries that fit the sample points so well that it do not classify future points well.

#### Quantify Overfitting

### Decision Boundary

The boundary chosen by classifier to separate items in class from those are not.

### Decision Function

A function $f(x)$ that maps a sample point to a scalar value that

$$

f(x)>0 if x \in class C

f(x)\leq if x not \in class C

$$

For these decision function, the decision boundary is $f(x)=0$, **usually** a (d-1) dimensional surface in $R^d$.

### Isosurface/iso contours

A isosurface for function $f$ is ${x: f(x)=0}$, 0 is isovalue here.

Note: ‘iso-‘ prefix means ‘equal-‘

### Linear classifier

The decision boundary is a line/plane

Usually a linear decision function.

$$

x=(x_1,x_2,…,x_5)^T

$$

### Conventions:

Uppercase roman: matrix, random variables, set

Lowercase roman: vector

Greek: scalar

Other scalar:

- n, number of sample points
- d, number of features
- i,j,k, integer indices

Function: f(), s(),…

### Norms

- Euclidean norms
- Normalize a vector: $\frac{x}{|x|}$
- dot product:
- length:$|x|=\sum x_i y_i$
- angle: $cos(\theta)=\frac{x\dot y}{|x||y|}$

### hyperplane

Given a decision function $f(x)=w\dot x+\alpha$ is $H={x: w\dot x =-\alpha }$

The set H is a **hyperplane**.

**property**

For any x, y on H, $w\dot(y-x)=0$

**normal vector** w

**signed distance**$w\dot x +\alpha$, $w$ is unit vector

i.e. positive on one side of H, negative on the other side

Note: the distance from H to the origin is $\alpha$.

Note2: $\alpha =0$ iff H passes through origin

### weights

coefficients in $w$ and$\alpha$ are called **weights** or **regression coefficients**

### Linearly separable

the input data is **linearly separable** if there exists a hyperplane that separates all the sample planes in C from those not in C.

### Centriod classifier

computer mean $\mu_c$ of all vectors in class C and meam $\mu_x$ of al vectors NOT in class C.

- Decision function:

$f(x)=(\mu_c-\mu_x)\dot x-(\mu_c-\mu_x)\dot \frac{(\mu_c-\mu_x)}{2}$

$(\mu_c-\mu_x)$ is normal vector

$\frac{(\mu_c-\mu_x)}{2}$ is midpoint between$\mu_c, \mu_x$

so the decision boundary is the hyperplane that bisects \bar{\mu_c\mu_x} - good at: classify with samples from two gaussian/normal distributions, especially when sample size is large

### Perceptron Algorithm

Slow but correct for linearly separable points.

Uses anumerical optimisationalgorithm, namely thegradient decent.

Sample points $X_1,X_2,…,X_n$.

For each sample point, $y_i = 1(\in C)\ or\ -1(\notin C)$

For simplicity, assume the decision boundaries pass through the origin.

## Regression

#Unsupervised Learning

## Clustering

## Dimensionality Reduction

# Validation

Hold back a subset of training data for future test use–validation set(to tune hyperparameters/choose model).

test set: final evaluation

- train a classifier multiple times, with different model/hyperparameter
- test it on NEW data
- choose the setting that gives the best validation result

?why: we want the model to be working in general data. In most cases(knn for example),input used data will always give the right output, which is not valuable for our evaluation of the model.

## Types of error

- training set error: fraction of training images not classified correctly
- test set error:

fraction of misclassifying new data

# outlier:

points are atypical

# hyperparameters

~ control overfitting/underfitting(e.g. k in knn)