Introduction to Machine Learning

This note is based on 2019 Spring COMPSCI189/289A course at University of California, Berkeley by Jonathan Shewchuk.

#Supervised Learning

Classification

example 1: classify digit 1 and 7

  1. $N\times N$ pixels matrices
  2. flatten into vector
  3. create a classifier for $N\times N$ space
    Note: Decision Boundary is a hyperplane

example 2: Bank loan default prediction

See notes for M3S17Quantitative Methods for Finance

Feature/Independent Variables/ Predictor Variables

?

Overfitting

The model is shaped to specifically to one certain data set so not predictive to new data.

When the test error worse coz the classifier becomes too sensitive to outliers or to other spurious/untrue patterns.

Sinuous decision boundaries that fit the sample points so well that it do not classify future points well.

Quantify Overfitting

Decision Boundary

The boundary chosen by classifier to separate items in class from those are not.

Decision Function

A function $f(x)$ that maps a sample point to a scalar value that
$$
f(x)>0 if x \in class C
f(x)\leq if x not \in class C
$$

For these decision function, the decision boundary is $f(x)=0$, usually a (d-1) dimensional surface in $R^d$.

Isosurface/iso contours

A isosurface for function $f$ is ${x: f(x)=0}$, 0 is isovalue here.
Note: ‘iso-‘ prefix means ‘equal-‘

Linear classifier

The decision boundary is a line/plane
Usually a linear decision function.

$$
x=(x_1,x_2,…,x_5)^T

$$

Conventions:

Uppercase roman: matrix, random variables, set
Lowercase roman: vector
Greek: scalar
Other scalar:

  • n, number of sample points
  • d, number of features
  • i,j,k, integer indices
    Function: f(), s(),…

Norms

  • Euclidean norms
  • Normalize a vector: $\frac{x}{|x|}$
  • dot product:
    • length:$|x|=\sum x_i y_i$
    • angle: $cos(\theta)=\frac{x\dot y}{|x||y|}$

hyperplane

Given a decision function $f(x)=w\dot x+\alpha$ is $H={x: w\dot x =-\alpha }$
The set H is a hyperplane.

property
For any x, y on H, $w\dot(y-x)=0$

normal vector w

signed distance$w\dot x +\alpha$, $w$ is unit vector
i.e. positive on one side of H, negative on the other side

Note: the distance from H to the origin is $\alpha$.
Note2: $\alpha =0$ iff H passes through origin

weights

coefficients in $w$ and$\alpha$ are called weights or regression coefficients

Linearly separable

the input data is linearly separable if there exists a hyperplane that separates all the sample planes in C from those not in C.

Centriod classifier

computer mean $\mu_c$ of all vectors in class C and meam $\mu_x$ of al vectors NOT in class C.

  • Decision function:
    $f(x)=(\mu_c-\mu_x)\dot x-(\mu_c-\mu_x)\dot \frac{(\mu_c-\mu_x)}{2}$
    $(\mu_c-\mu_x)$ is normal vector
    $\frac{(\mu_c-\mu_x)}{2}$ is midpoint between$\mu_c, \mu_x$
    so the decision boundary is the hyperplane that bisects \bar{\mu_c\mu_x}
  • good at: classify with samples from two gaussian/normal distributions, especially when sample size is large

Perceptron Algorithm

Slow but correct for linearly separable points.
Uses a numerical optimisation algorithm, namely the gradient decent.

Sample points $X_1,X_2,…,X_n$.
For each sample point, $y_i = 1(\in C)\ or\ -1(\notin C)$

For simplicity, assume the decision boundaries pass through the origin.

Regression

#Unsupervised Learning

Clustering

Dimensionality Reduction

Validation

Hold back a subset of training data for future test use–validation set(to tune hyperparameters/choose model).

test set: final evaluation

  1. train a classifier multiple times, with different model/hyperparameter
  2. test it on NEW data
  3. choose the setting that gives the best validation result
    ?why: we want the model to be working in general data. In most cases(knn for example),input used data will always give the right output, which is not valuable for our evaluation of the model.

Types of error

  1. training set error: fraction of training images not classified correctly
  2. test set error:
    fraction of misclassifying new data

outlier:

points are atypical

hyperparameters

~ control overfitting/underfitting(e.g. k in knn)

Hello World

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub. Qu...

Continue Reading →

Quantitative Methods in Retail Finance Note 4

This notes follows materials from the slides for M3S17 at Imperial College London Mathematics Department by Dr. Tony Bellotti. Questions mentioned are referring to past exam papers of M3S17 at Imperial College London.

#Chapter 3: Markov Transition Model

Markov Transition Models

Why Markov Chain Model?

This model allow us to track model changes in the state of an account over time.
e.g. Tracking credit card use.

First-order Markov Transition Model

  • Let $(X_1, X_2, …, X_n)$ be a sequence of discrete random variables that take value from ${1,2,…,K}$ with K fixed.
  • first-order finite value Markov chain
    The sequence is said to be a first-order finite value Markov chain if
    $$
    P(X_{n+1}=j|X_0=x_0,X_1=x_1,X_2=x_2,…,X_{n-1}=x_{n-1},X_n = i) = P(X_{n+1}=j|X_n = i)
    $$
    for all $n, x_0,x_1, …, x_{n-1}$ and $i,j$ such that $1\leq i,j \leq K$.
  • transition probability
    Transition probability $p_n$ is defined as
    $$
    p_{n}(i,j)=P(X_n=j|X_{n-1}=i)
    $$
  • transition matrix
    Transition matrix $P$ is defined as a $K \times K$ matrix such that
    $$
    P_n[i,j] = p_{n}(i,j)
    $$
  • structural zeroes
    If it is certain that no transition will be made from i to j in the $n$th period, then set $p_n(i,j)=0$, this is called a structural zero.

###Propagation of Markov Chain
If we know the transition matrix from period $n-r$ to $n$, i.e. $P_r,P_{r+1},…,P_{n-1}$ are known, then we also be able to know the state in the $n$th period:
$$
Prob(X_{n}=j|X_{n-r}=i) = P_{n-r+1}P_{n-r+2}…P_{n}[i,j]
$$
for any $1\leq r \leq n$

Forcasting from state 0:
$$
Prob(X_{n}=j|X_{0}=i) = P_{1}P_{2}…P_{n}[i,j]
$$
let $\pi_{i}$ be the marginal distribution for $X_n$:
$$\pi_n = (P(X_n = 1),P(X_n = 2),…,P(X_n = K))^T$$

Then,
$$\pi_n = \pi_0(P_1 P_2 … P_n)$$

Stationary Markov Chain

  • stationary
    A Markov chain is stationary if $p_n(i,j)=p(i,j)$ for all $n,\ and\ i,j\ where 1\leq i, j \leq K$, for some transition probability $p$.
  • stationary distribution
    A stationary distribution for transition matrix is a distribution $\pi^\star = \pi^\star P$.

    In practice, most Markov chains converge (with 𝑛) to a stationary
    distribution.
    (Markov chains which have a periodicity in state change do not necessarily converge, but we do not cover this material in this course).

Estimation

The aim this section, given some data, is to estimate $\mathbf{\theta}=(\theta_{ij})^{K}{i,j=1}$ where $\theta{ij}=p(i,j)$.
Maximum likelihood estimator for $\mathbf{\theta}$:
$$
\hat\theta_{ij}=\hat p(i,j)=\frac{n_{ij}}{\sum_{l=1}^{n}n_{il}}
$$
where $n_{ij}=|{t:x_{t-1}=i,x_{t}=j,\ for \ t\in {1,2,…,n}}|$, i.e. total number of observations that move from state $i$ to state $j$ during period 1 to n.
proof

2017Q1
(c) Derive the maximum likelihood estimator for the first order finite-valued Markov transition model, assuming stationarity.

  1. likelihood function

$P(X_0=x_0,X_1=x_1,…,X_n=x_n)$
$=\prod_{t=1}^{n} P(X_{t}=x_t|X_{t-1}=x_{t-1},X_{t-2}=x_{t-2},…,X_{1}=x_{1})P(X_0=x_0)$
$=\prod_{t=1}^{n} P(X_{t}=x_t|X_{t-1}=x_{t-1})P(X_0=x_0)(Chain rule)$
$=P(X_0=x_0)\prod_{t=1}^{n} P(X_{t}=x_t|X_{t-1}=x_{t-1})$
$=P(X_0=x_0)\prod_{i=1}^{n}\prod_{j=1}^{n}[p(i,j)]^{n_{ij}}$
$=P(X_0=x_0)\prod_{i=1}^{n}\prod_{j=1}^{n}[\theta_{ij}]^{n_{ij}}$
$$
where\ n_{ij}=|{t:x_t=x_t,X_{t-1}=x_{t-1},\ for\ t\in {1,…,n}}|
$$

  1. log likelihood function
  2. differentiate and find MLE
  3. adjust to constraints

Testing First-order Assumptions

In order to test if first order markov chain is a suitable assumption in each occasion,
we use chi-square hypothesis test.

second order Markov chain

A sequence (x_1,..,x_n) is a second order Markov chain if
$P(x_n+1=j|X_0=x_0,…,X_{n-1}=k,X_{n}=i)=P(x_n+1=j|X_{n-1}=k,X_{n}=i)$

The transition probability
$p_n(k,i,j)=P(X_n=i, X_{n-1}=k)$

The MLE for a second-Markov chain can similarly be proved as:
$\hat p(k,i,j)=\hat\theta_{kij}=\frac{n_kij}{mki}$
where $n_{kij}=|{t:x_{t-2}=k,x_{t-1}=i,x_{t}=j\ for\ t \in {2,…,n}}|,
m_{ki}=|{t:x_{t-2}=k,x_{t-1}=i\ for\ t \in {2,…,n}}|=\sum_{l=i}^{n}n_{kil}$

  1. $H_{0}$
    Null hypothesis:
    $P(1,i,j)=P(2,i,j)=…=P(K,i,j)=P(i,j)$
    where $P(k,i,j)$ is second-order Markov transition probability from state k to i then to j.

  2. $n_k$
    $n_k=|{t:x_t=k\ for\ t \in {1,..,n}}|$
    i.e. The observed number of times for each states

  3. $O_{kj}$
    The observed number of times state k follwed by i and then j
    $O_{kj}=n{kij}=m_{ki}\hat p(i,j)$

  4. $E_{kj}$
    The expected number of times state $k$ is followed by state $i$ and then $j$, given null hypothesis is:
    $E_{kj}=n_{k}\hat p(k,i)\hat p(i,j)\approx m_{ki}\hat p(i,j)$
    becuase $\hat p(k,i)=\frac{n_{ki}}{\sum_{l=1}^{K}n_{kl}}\approx \frac{m_{ki}}{n_k}$

  5. Pearson Chi_square test
    $$
    \chi^2 = \sum_{k\in S_1}\sum_{k\in S_2}\frac{(O_{kj}-E{kj})^2}{E_{kj}}
    $$

  • S1 is the set of states which do not have a structural zero moving to state $i$
  • S2 is the set of states which do not have a structural zero moving from state $i$
  1. Degree of freedom
    $df = (K-1-z_{1})(K-1-z_{2})$
  • $z_1$: the number of structural zeroes in the transition from $k$ to $i$.
  • $z_2$: the number of structural zeroes in the transition from $i$ to $j$.

Roll-rate Model

Quantitative Method in Retail Finance Note 1

This notes follows materials from the slides for M3S17 at Imperial College London Mathematics Department by Dr. Tony Bellotti. Questions mentioned are referring to past exam papers of M3S17 at Imperial College London.

Chapter 7: Fraud Detection in Retail Credit

Types of Fraud:

  • Theft
  • Counterfeit
  • Application
  • Card-not-present/online

Automated Fraud Detection

This essentially is a classification problem, legitimate transaction Y=1, illegal Y=0.

Special problems for fraud detection (that distinguish it from other classification problems)

2017 Q3(a) What special characteristics does fraud detection have, considered as a classification problem?
∗ Need to process millions of transactions in real time.
∗ Highly imbalanced classification problem: ratio of fraudulent to legitimate
transactions is typically very small.
∗ Nature of fraud is reflexive. That is, fraudsters adapt to the detection methods applied by banks to stop them.

  1. High volume data.
  2. Data highly imbalanced.
  3. Reflexive nature.

difference of fraud detection problem from application model?
Do not need to explain the model, so nonlinear complicated models are allowed.

Business Rule method

For example, the card is used abroad
and it had not been used in the past year in the country
and the card holder did not tell the bank they will be abroad

Predictive Model method

Construct a two-class classifier - a fraud scorecard.

  • the higher the score, the more trustworthy the account is.
  • classifier model with function $f$: $\hat{y} = f(x,\theta)$, and estimate $\theta$ with data x.
  • filter out some legitimate transaction to balance training data set. e.g. discard small amount transactions, regular transactions and inactive accounts.
    past research have shown that a linear model is not enough, non-linear classifiers like ANN works better.
  • This method works well in detecting known type fraudulent transactions, but not able to deal with new fraud types.

Anomaly Detection method

Another method is to model only the legal transactions and detect any abnormal new transactions.

Comparing with predictive modelling:

pros:

  1. it can detect new types of fraud
  2. it do not worry about the imbalanced data(coz 1)
    con:
  3. it is not as sensible to transactions that are similar to legitimate ones

One-class classifier only model legitimate pdf over predictor variables

2017Q3(b) What are the key differences between one-class and two-class classifiers for fraud?
∗ Two-class classifiers are modelling the difference between fraudulent and legitimate transactions.
∗ One-class classifiers just model distribution of legitimate transactions.
∗ One-class classifiers are well-suited to deal with the reflexivity of fraud, since they do not explicitly model the existing pattern of fraud.

Steps for Anomaly Detection:

  1. drop out abnormal transactions: fraudulent transactions, errors, genuine but outliers
  2. Let $S = (\mathbf{x_{1}},\mathbf{x_{2}},…,\mathbf{x_{n}})$ be a training sequence of legitimate transactions.
  3. estimate $f(\mathbf{x}|S,\gamma)$ where $\gamma$ is an estimation parameter.
  4. A classification decision is made with a new observation $\mathbf{x_{new}}$ at:
    $$
    \hat(y) = I(f(\mathbf{x_{new}}|S,\gamma)>\theta)
    $$
    where $\theta$ is a threshold in probability.

    2017Q3(e) Which of these will register a fraud alert, based on your result in part (d)?
    ANS: T2 and T3 have density less than θ = 0.05 and so will register a fraud alert.

The threshold $\theta$ is manually set based on the (sensible) strategy of controlling the fraction $\epsilon$ of legitimate cases to be classified as anomalous, based on training data.
—- This is constrained by how many false alert is generated/business resource can be input to followups.

2017 Q3(d) Given a false alarm rate ε = 1/8, and based only on this data, what is the value of θ?
ANS:
Use formula $$
max \theta s.t. \sum^{n}_{i=1}I(f(x_i) > \theta) \leq n(1 − \epsilon)$$
where $$n(1 − \epsilon) = 24 × 7/8 = 21$$
Therefore, count lowest (24 − 21) = 3 densities from the table gives $\theta$ = 0.05.

i.e.
$$
min\theta s.t. \sum_{i=1}^{n}I(f(\mathbf{x}_{i}|S,\gamma)>\theta)=floor(n\epsilon)
$$

Kernel Density Estimator

Crude empirical density:
$$
\hat{F_{EMP}}= \frac{1}{n}\sum_{i=1}^{n}I(x_{i}\leqslant x)
$$

kernel density function:

2017Q3(c) Describe Parzen density estimation.

$$
\hat{f(x)} = \frac{1}{nh^m}\sum_{i=1}^{n} K(\frac{x-x_{i}}{h})
$$s.t.

  1. $K(z) = K(-z)$
  2. $\int K(z) dz = 1$
  3. $\mathbf{x_1}, \mathbf{x_2}, …, \mathbf{x_n}$ is observations with dimension m
  4. h is bandwidth constant

Available data

  • Account Data
  • Transaction Data
  • Personal Data
  • Location data

Social Network Analysis

Using weight $w_{ij}$ to estimate association feature/degree of connectedness to fraud people $F_{i}$ and hence have fraud propensity from it.

2017Q3(f) Describe in words what Fi is measuring?
(f) Variable $F_{i}$ measures the degree of connectedness to other people committing fraud.

Formula:

$$
F_{i} = \frac{1}{\sum_{i=1}^{n}w_{ij}} \sum_{j=1}^{n} w_{ij}p_{j}
$$

2017Q3(g) What range of values can Fi take?
$0\leq F_i \leq 1$

2017Q3(h) Calculate the value of Fi for T4 (node 3)?
Use formula in the question to get (4 + 1)/(2 + 4 + 1 + 1) = 5/8 = 0.625.

2017Q3(i) How are values of Fi broadly affecting the density?
(i)
Low values of $F_i$ are inducing a small increase in density.
High values of $F_i$ are inducing a large decrease in density.
Low density means default

Model Evaluation

Special performance evaluation for this problem:

  1. proportion fraud detected
  2. keep low false alarm rate, not to upset customer
  3. cost monitoring automated fraud alert

Receiver-operating characteristics (ROC)

Two CDF measures:
$$y = {0,1} F_{y}(c)=P(S\leq c|Y=y)$$

  • y=0: given a threshold c, probability of fraud transaction that identified as fraud)
  • y=1: given a threshold c, probability of false alert generated

Area under ROC curve(AUC)

$$
A = \int F_{1}(c)F_{0}’(c)dc
$$

Precision, Recall and alarm rate

  • $F(c) = P(S\leq c)$
  • $p_{0} = P(Y=0)$
  • Recall: $F_{0}(c)$, the proportion of fraud transactions that are detected
  • Precision: the proportion alerts that are actually fraud
    $P(Y=0|S\leq c)=\frac{F_{0}(c)p_0}{F(c)}$
  • Alarm rate: $F(c)= p_0 \frac{recall}{precision}$, unconditional probability of alarm
    Monitoring cost increases linear with alarm rate

Precision-recall (PR) curve

properties:

  • typical decreasing
  • best: 1,1
  • worst: $F(c)=F_0 (c)$, horizontal line with $precision =p_0$

DS100&More - Data Manipulation and Cleaning

This is the first section for my study note: DS100&More, which records some useful points in data ETL(Extract, Transform and Loaded), visualisation, and modelling. The motivation is to keep a record of my second learning of UC Berkeley DS100 course: Principles and Techniques of Data Science. By learning relative knowledge, reader should be familiar with some python tools that are in use now, as well as making some inferences from data by hand.

Please contact me at zr116@ic.ac.uk for any correction and improvement, general discussion on data science is also welcome.

Today’s Objectives

  1. EDA(Exploratory Data Analysis)
  2. Use python Pandas to do EDA

Tabular Data

DS100 Textbook: Tabular Data

The first thing to do when we have a new data set in hand is to look at its inside. Data can be structured in different ways:
1.1 Comma-Separated Values (CSV)
[example]
1.2 Tab-Separated Values (TSV)
2.1 JavaScript Object Format (JSON)
[curly brackets]
3.1 eXtensible Markup Language (XML)
[how websites store info]
3.2 HyperText Markup Language (HTML)

CSV and TSV data is structured in a tabular structure, in rows and columns.
JSON, XML, HTML data has a hierarchical structure, like a tree.
(see Hokodo intern note for JSON processing methods)

Check file format:

  1. right click and view profile
  2. fancier way:
    use command-line interface (CLI) tools/terminal
    1
    ls

or if you are in Jupyter notebook:

1
!ls

!+command line in Jupyter

Although all these data formats are often used, CSV is often the easiest and most frequently used structure.

Before reading data, need to figure out how much memory it is going to take.

1
2
3
4
!ls -l -h folder
!du -h folder
#specific file usage
!du -sh folder/*

As a rule of thumb, reading in a file of x GB requires 4x GB memory. Pandas needs 2x but has to share memory with other programs.

Memory Overhead
Note that memory is shared by all programs running on a computer, including the operating system, web browsers, and yes, Jupyter notebook itself. A computer with 4 GiB total RAM might have only 1 GiB available RAM with many applications running. With 1 GiB available RAM, it is unlikely that pandas will be able to read in a 1 GiB file.

  • To view CSV data in python:
1
2
import pandas as pd
data = pd.read_csv('')
  • How many rows and columns:
    1
    data.shape

Check the size of data before inpecting it to avoid printing too many columns, which is not helpful to getting to konw the data.

  • Total size:

    1
    data.size
  • Inspecting important statistical values of numerical variables in a data set:

    1
    data.describe()

This returns count, mean, standard deviation, min, max, and quartiles. Details

  • inspect first/last several rows(default 5)

    1
    2
    data.head(5)#first 5
    data.tail(3)#last 3
  • other common command

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    #set 'col_name' column as index column, in_place replace the old index column
    data.set_index('col_name', in_place = True)

    #list all column names
    data.columns

    #selection
    ##select entries in column 'col1'
    data['col1']
    ##select two columns: col1, col2
    data[['col1','col2']]
    ##make selected part into a DataFrame
    data[['col1','col2']].to_frame()
    ##select rows that has val1 in col1
    data[data['col1']==val1]

    #column manipulation
    ##count number of each value in a column
    data['col1'].value_count()
    ##print unique values in a column
    data['col1'].unique()
    ##statistical values of a column
    data['col1'].min()
    data['col1'].max()
    data['col1'].median()

    #locator
    data.loc[row_list,col_list]
    data.iloc[[row_index],[column_index]]

Lagrangian Mechanics and Rigid Body Motion

This is a group research project I did in my second year at Imperial College London as a math undergraduate student. I cooperated with the team in explaining and proving basic Lagrangian Mechanics and the investigation into the heavy-top problems.

Here is an abstract of what this project is about. The full report reveals more about the proof, examples and Matlab code.

In our report we will discuss Lagrangian Mechanics and the Motion of Rigid Bodies. La- grangian Mechanics is a reformulation of Classical Mechanics, first introduced by the famous mathematician Joseph-Louis Lagrange, in 1788. We shall discuss the uses of Lagrangian Me- chanics and include two examples - the Spherical Pendulum and the Double Pendulum. In each case we will derive the equations of motion, and then try to solve these numerically and/or analytically. We will investigate the effect of removing the gravitational field (in the case of the Spherical Pendulum) and discuss any links between the two, as well as any implications of the solutions.

A rigid body is a collection of N points such that the distance between any two of them is fixed regardless of any external forces they are subject to. We shall look at the kinematics, the Inertia Tensor and Euler’s Equation and use this to explain about the dynamical stability of rigid bodies. Symmetric tops are the main example that we will investigate and discuss. We will look into the precession rate and the spinning rate and discuss two examples, Feynman’s wobbling plate and the hula hoop. A more complicated rigid body we shall then explore is the heavy symmetric top, in which we take into account the forces exerted by a gravitational field.

Full report is here.

© 2019 Z.Ran All Rights Reserved.
Theme by hiero