Introduction to Data Science: A Comp-Math-Stat Approach

1MS041, 2021

©2021 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

15. Supervised learning continued

Topics

We have seen the probabilistic viewpoint of machine learning, by going from the Likelihood to the so called loss funcntion. Finding the MLE is finding the minima in the loss function.

Traditionally however the machine learning started in principle with a single algorithm called the perceptron algorithm. The ideas come from computer science and as such the focus and terminology is different, but let us stick to the terminology used in the book "Foundations of Data Science" Chap 5.

Let us say that we are trying to device a decision rule based on input data being in $\mathbb{R}^d$, it could be binary or other. This input could be for instance the words being used in an email, where we have some form of dictionary where each word is represented by a dimension. The simplest form of decision problem is that of a binary decision, like in the case of logistic regression (the decision could be the most likely output). A commonly chosen example is that of email spam classification.

Goal: find a "simple" rule that performs well on training data

The perceptron algorithm

The perceptron algorithm tries to find a linear separator, i.e. a plane in $\mathbb{R}$ that separates the two classes. The task is thus to find $w$ and $t$ such that for the training data $S$, the data consists of pairs $(x_i,l_i)$ the $x_i$ represents our features and the $l_i$ our labels or target.

$$ \begin{aligned} w \cdot x_i > t \quad \text{for each $x_i$ labeled $+1$} \\ w \cdot x_i < t \quad \text{for each $x_i$ labeled $-1$} \end{aligned} $$

Adding a new coordinate to our space allows us to consider $\hat x_i = (x_i,1)$ and $\hat w = (w,t)$, this allows us to rewrite the inequalities above as

$$ (\hat w \cdot \hat x_i) l_i > 0. $$

The algorithm

  1. $w = 0$
  2. while there exists $x_i$ with $x_i l_i \cdot w \leq 0$, update $w := w+x_il_i$

Theorem: Perceptron for linearly separable data

If there exists $w^\ast$ such that $w^\ast \cdot x_i l_i \geq 1$ for all $i$. Then the perceptron algorithm find a $w$ satisfying $w \cdot x_i l_i \geq 0$ for all $i$ in at most $r^2|w^\ast|^2$ updates, where $r = \max_i |x_i|$.

So this theorem guarantees that if the two classes can be separated then the preceptron will also find a separator in finite time.

Kernels

What about non-linearly separable data. Take for instance $$ X = (B_4 \setminus B_3) \cup B_1 $$ and let $c^\ast = B_1$. We cannot separate these sets using a linear classifier

we can however separate the following mapping of $X$. Namely in $\mathbb R^2$ we can do $$ \phi(x) = (x_1,x_2,x_1^2+x_2^2) \in \mathbb R^3 $$ This is clearly linearly separable as we can see in the following 3d plot

Remember the extra dimension that we always add to simplify notation. Therefore the full $\phi$ in the above examples is $\hat \phi(x) = (x_1,x_2,x_1^2+x_2^2,1)$.

So if we transform the $x \to \phi(x)$ for some good transformation $\phi$ then our perceptron will try to solve $$ w \cdot \phi(x_i)l_i > 0 $$ furthermore, remember how we constructed $w$ using the perceptron algorithms, i.e. using additions of $x_i l_i$, which transforms into $\phi(x_i)l_i$, and we start with $w=0$, this gives that the weight has the form $$ w = \sum_{i=1}^n c_i \phi(x_i) $$ for numbers $c_i$. The perceptron algorithm becomes just addition and subtraction of certain $c_i$'s by 1.

Furthermore $$ w \cdot \phi(x_i) = \sum_{i=1}^n c_j \phi(x_j) \cdot \phi(x_i) = \sum_{i=1}^n c_i k_{ij} $$ where $k_{ij} = \phi(x_i) \cdot \phi(x_j)$.

Is it easy to find such a mapping $\phi$? No, it is actually quite difficult. Furthermore, if the mapping $\phi$ is high dimensional we might need to do alot of computation, which is not so efficient. What if we had a function $k(x,y)$ that could be written as $$ k(x,y) = \phi(x) \cdot \phi(y) $$ for some $\phi$ and $k$ is easier to compute, then our life would be simpler. Also, what if we are given a function $k(x,y)$ and we would like to know if it is a "kernel function".

Lemma

If $k_{ij}$ is symmetric and positive semidefinite, then there is a mapping $\phi$ such that $k_{ij} = \phi(x_i)\phi(x_j)$.

Proof

  1. $k = Q \Lambda Q^T$ (eigendecomposition)
  2. $k$ is positive definite, all eigenvalues $\geq 0$, so we can define $B = Q \Lambda^{1/2}$.
  3. $k = B B^T$
  4. define $\phi(x_i) = B_{i\cdot}$, i.e. the $i$:th row of $B$, then $k_{ij} = \phi(x_i)\cdot \phi(x_j)$.

We now have a way to identify whenever a matrix $k$ is a kernel matrix. There are some standard choices of kernel functions one could try, that produces positive semi-definite matrices whenever all points $x_i$ are distinct.

  1. $k(x,y) = e^{-\gamma |x-y|}$, called Radial Basis Function
  2. $k(x,y) = (\gamma x \cdot y + r)^d$, polynomial
  3. $k(x,y) = x \cdot y$, linear
  4. $k(x,y) = \tanh(\gamma x \cdot y + r)$, sigmoidal

Definition

We call a function $k(x,y)$ a kernel function if there is a mapping $\phi$ such that $k(x,y) = \phi(x) \cdot \phi(y)$.

Theorem (properties)

Suppose $k_1,k_2$ are kernel functions. Then

  1. For any constant $c \geq 0$, $c k_1$ is a kernel function.
  2. For any scalar function $f$, $k(x,y) = f(x)f(y)k_1(x,y)$ is a kernel function.
  3. $k_1 + k_2$ is a kernel function.
  4. $k_1k_2$ is a kernel function.

Let me cheat a bit

As we have noted, the perceptron converges in finite time if the set is linearly separable. The way to solve this problem is to introduce a "cost function" that penalizes misses in classification, the goal is then to minimize the total cost. The perceptron becomes the Support Vector Machine in this case, the loss is the so called "hinge loss" $$ \min(0,1-w \cdot x_i l_i) $$ this means that if $w \cdot x_i l_i \geq 1$ (which is the requirement for the perceptron) we have 0 cost, but if we are closer to the plane $w \cdot x = 0$ then there is a cost proportional to 1 - distance.

Consider the problem of differentiating between the following handwritten digits, where $c^\ast$ is the set of digits greater than or equal to 5.

This is really interesting right? With just a linear classifier we can differentiate between digits less than and larger than 5 using a linear classifier up to 88% correct...

But, we talked about the fact that different kernels might improve things. A famous kernel is the Radial Basis Function kernel, which we showed above. Lets try it

The Radial Basis Function Kernel

This is borderline crazy, how can we differentiate between these digits at 98% accuracy? I would leave this up to you to think about. But the fact is that it does work well.

The underlying problem

Consider a probability space $(\Omega,P,F)$ where $\Omega$ is the sample space and $P$ is the probability measure. In the theory of classifiers we often use the idea that $\Omega$ is split into two parts, one part which corresponds to class $1$ and one part which corresponds to class $-1$ (or 0). Let us call the subset of $\Omega$ that is labeled as class $1$ the target concept.

The target concept $c^\ast \subset \Omega$ is the set of outcomes which are labeled $1$, all other are labeled $-1$ (or zero)

Often the sample space $\Omega \subset \mathbb{R}^d$ and as such, each choice of weights $\hat w$ would produce two a split of $\Omega$ into two sets, i.e. the part of $\Omega$ that is one one side of the plane and one which is on the other side of the plane. When we worked with kernel functions we can have a non-linear separating boundary. The point being that we can abstract our idea of a model as being a set.

A model in our view, i.e. a choice of weights for instance, is the same as a set in $\Omega$, we can denote this set as $h \subset \Omega$.

If $x \in h$ we say that $h$ predicts $x$ as being in $c^\ast$, but it can of-course be wrong. The error we are interested in is the missclassification rate, i.e.

False positive (FP) $x \in h$ but $x \not \in c^\ast$, and the other which is False Negative (FN) $x \not \in h$ but $x \in c^\ast$.

The set of miscalssified samples can be written as the symmetric difference of the set $h$ and $c^\ast$, i.e.

$$ h \Delta c^\ast = (h \setminus c^\ast) \cup (c^\ast \setminus h) $$

The true error rate that we are interested in is the probability of a missclassification, i.e.

$$ P(h \Delta c^\ast) = P(\text{"h classifying x wrongly, i.e. FP or FN"}) $$

This is what we want, but remember that we are taking a training set $S$ and building $h$ on that set. Perhaps we can find a small training error, which would be

$$ err(S) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{h \Delta c^\ast}(x_i), \quad S = \{x_1,\ldots,x_n\} $$

Understanding model fit and generalization

So far we have only considered quite simple models, we did however touch upon polynomial regression back in 12.ipynb. The point I want to make here is that we can make the ansatz for the regression function $r(x) = E(Y \mid X=x)$ arbitrarily complicated, think for instance of polynomials of arbitrarily high order.

Let us highlight the problem with a simple example of polynomial regression using scikit-learn.

Let us assume that $r(x) = -x^2$ and that $Y \mid X \sim N(r(x),0.1^2)$.

Ok, so our goal is to find the function $r(x)=-x^2$, but we do not now beforehand which power it is, so we have to make an ansatz. Perhaps our ansatz is the either it is linear, second or third order polynomial. Lets try all these and see what happens.

Since our dataset is not very large, i.e. we had only 10 points and 10 of those we trained on, the estimation problem will be noisy. But what we can tell from the picture above is that the green curve is doing best in terms of the red points which is the test-data. This is called the problem of generalization. Let us formalize this problem as follows

Formalizing the problem

  1. Our training set $S$ consists of points drawn IID from a distribution $D$
  2. Our objective is to predict well on new points that are also drawn IID from $D$.

With this statement of the problem it becomes clear that if we can simulate the training-test split, train a model on the training data and test on the testing data, then we can get a measure of the generalization performance of the model itself on the distribution $D$.

The issue of overfitting and what to do about it?

To analyze overfitting we have to dig into the ideas of learning theory. As such we need some definitions

A hypothesis class $\mathcal H$ over $\Omega$ is a collection of subsets of $\Omega$, called hypothesises or concepts.

Example

In the case of a linear classifier we have $$\mathcal{H} = \{\{x \in \Omega: w \cdot x \geq 0\}: w \in \mathbb R^d\}.$$ That is, the set of all half-spaces in $\Omega \subset \mathbb R^d$.

Uniform convergence

The goal of our algorithm can now be stated as the problem of finding an $h \in \mathcal H$ such that the true error is small, i.e. $P(h \Delta c^\ast)$ is small.

The idea for us today is to start with a finite $\mathcal H$, why?

Three points can always be split in any possible way so for 3 points we have for a linear classifier $|\mathcal H| = 2^3$.

For 4 points, we can no longer split every possible combination, for instance the one in the image above cannot be produced. Therefore $|H| < 2^4$. In fact there is a lemma called Sauer-Shelah lemma that states (in our case of linear problems in $\mathbb R^2$) $$ |\mathcal H_n| \leq n^3 + 1 $$

So, with this motivation let us continue with finite set of $\mathcal H$.

We start with the theorem 5.4 from the book, but rephrased using our language

Theorem [Finite hypothesis class]

Let $(\Omega,P,F)$ be a probability space, and let $\mathcal{H}$ be a finite hypothesis class. Fix $\epsilon > 0$ and let $S = \{X_1,\ldots,X_n\}$ be IID sampled from $P$. Define $\mathcal{H}_\epsilon = \{h \in \mathcal{H}, err(h) \geq \epsilon\}$, then $$P(\{\exists h_i \in \mathcal{H}_\epsilon : err_S(h_i) = 0\}) \leq |\mathcal{H}|(1-\epsilon)^n$$ or equivalently $$P(\{err_S(h_i) > 0, \forall h_i \in \mathcal{H}_\epsilon\}) \geq 1-|\mathcal{H}|(1-\epsilon)^n.$$ If $$n \geq \frac{1}{\epsilon}(\ln |\mathcal H| + \ln(1/\delta)) $$ then $$P(\{err_S(h_i) > 0, \forall h_i \in \mathcal{H}_\epsilon\}) \geq 1-\delta$$

In words, for a hypothesis $h$ with large true error, if the training set is large enough there is a high chance that the training set will also contain some missclassified items.

Proof

The idea is as follows, if we know that $err(h) = P(h \Delta c^\ast) \geq \epsilon$, then define the random variable $Y(X) = \mathbf{1}_{h \Delta c^\ast}(X)$ for $X \sim P$. This r.v. $Y$ is either 1 or 0, and we know that $$ P(Y = 1) = P(X \in h\Delta c^\ast) = P(h \Delta c^\ast) \geq \epsilon $$ thus we know that $Y \sim Bernoulli(\theta)$ where $\theta \geq \epsilon$. We can do the same of the entire sample $S$, and define a $Y_i$ for all $X_i$, and note that since $X_i$ is IID, then $Y_i$ is IID, and thus $$ P(err_S(h) = 0) = P(\{Y_i = 0: \forall i = 1,\ldots,n\}) = P(Y_1 = 0)^n \leq (1-\epsilon)^n. $$ That is we can now do the following bound $$ P(\cup_{h \in \mathcal{H_\epsilon}} \{err_S(h) = 0\}) \leq \sum_{h \in \mathcal{H_\epsilon}}P(err_S(h) = 0) \leq |\mathcal{H}_\epsilon|(1-\epsilon)^n \leq |\mathcal{H}|(1-\epsilon)^n $$ This proves the first statements, the second follow from taking the complement of the first. The last statement follows from plugging in the bound for $n$ and doing some simple estimation (See book).

Say that we had a way of finding for each $S$ a $h_S$ such that $err(h_S) = 0$, for each $S$? Think of the perceptron for instance.

Then we can say that $$P(err(h_S) > \epsilon) \leq P(\{\exists h_i \in \mathcal{H}_\epsilon : err_S(h_i) = 0\})$$ Note that in the above we have inequality because the event $\{err(h_S) > \epsilon\} \subset \{\exists h_i \in \mathcal{H}_\epsilon : err_S(h_i) = 0\}$. The conclusion is that

Theorem [PAC Learnability of Empirical Risk Minimization]

Let $(\Omega,P,F)$ be a probability space, and let $\mathcal{H}$ be a finite hypothesis class. Fix $\epsilon > 0$ and let $S = \{X_1,\ldots,X_n\}$ be IID sampled from $P$. Further assume that we have a way to construct $h_S \in \mathcal{H}$ such that $err_S(h_S) = 0$, then $$P(err(e_S) > \epsilon) \leq |\mathcal{H}|(1-\epsilon)^n.$$ Again if $$n \geq \frac{1}{\epsilon}(\ln |\mathcal H| + \ln(1/\delta)) $$ then $$P(err(h_S) < \epsilon) \geq 1-\delta$$

That is with high probability the method which minimizes the training error has good true error. This ofcourse assumes that there is $h \in \mathcal{H}$ such that $err(h) = 0$. Which is not always the case, take the example of a curved decision boundary, and our hypothesis space consists of linear boundaries. This can happen if our kernel choice is not correct concerning the true $c^\ast$. In that case, the above argument does not work and we have to use other tools.

Theorem (Hoeffding bounds)

Let $X_1,\ldots, X_n$ be IID Bernoulli$(\theta)$ r.v.. Let $\overline{X}_n = \sum_{i=1}^n X_i$. Now for any $0 \leq \alpha \leq 1$ $$P(\overline{X}_n > \theta+\alpha) \leq e^{-2n\alpha^2} $$ $$P(\overline{X}_n < \theta-\alpha) \leq e^{-2n\alpha^2} $$

We will not prove this now, but using this we can get the following result concerning uniform convergence of the error.

Theorem (Uniform convergence)

Let $(\Omega,P,F)$ be a probability space, and let $\mathcal{H}$ be a finite hypothesis class. Fix $\epsilon > 0$ and let $S = \{X_1,\ldots,X_n\}$ be IID sampled from $P$. Then $$P(\{|err(h) - err_S(h)| \leq \epsilon, \forall h \in \mathcal{H}\}) > 1-2|\mathcal{H}|e^{-2n\epsilon^2}$$

This tells us that the convergence in probability has a uniform rate!!!

Proof

Fix the hypothesis $h \in \mathcal{H}$, define $Y_i$ as in the proof of Finite hypothesis class, then $Y_i \sim$ Bernoulli$(\theta)$ and $\theta = err(h)$, is one if $h$ makes a mistake and $0$ if not. For a single hypothesis we thus have $err_S(h) = \overline{Y}_n$ and using the Hoeffding bound gives us $$ P(\overline{Y_n} > err(h) + \epsilon) \leq e^{-2n\epsilon^2} $$ $$ P(\overline{Y_n} > err(h) - \epsilon) \leq e^{-2n\epsilon^2} $$ i.e. $$ P(|err_S(h) - err(h)| > \epsilon) \leq 2 e^{-2n\epsilon^2} $$ since this holds for a single hypothesis we can again use the union bound to get $$ P(\{|err_S(h) - err(h)| > \epsilon, \exists h \in \mathcal{H}) \leq 2 |\mathcal{H}| e^{-2n\epsilon^2} $$ taking the complement event we get the result of our theorem.

The case of infinite hypothesis spaces

How do we generalize this to infinite hypothesis classes?

The first thing we are going to do is to define data dependent equivalence classes.

We start with a symmetrization lemma by Vapnik and Chervonenkis, in the book you can find this hidden in the proof of Theorem 5.14.

Symmetrization Lemma (Vapnik, Chervonenkis, 1971): Let $S$ and $S'$ be two different sample data-sets from $\mu$ then for $n \epsilon^2 \geq 2$ $$ \mathbb{P}\left[\sup_{h \in \mathcal{H}}\left|err(h) - err_S(h)\right| > \epsilon\right] \leq 2\mathbb{P}\left[\sup_{h \in \mathcal{H}}\left|err_S(h) - err_{S'}(h)\right| > \frac{\epsilon}{2}\right] $$

Now by the union bound we get $$ \begin{aligned} \mathbb{P}\left[\sup_{h \in \mathcal{H}_{S\cup S'}}\left|err_S(h) - err_{S'}(h)\right|>\frac{\epsilon}{2}\right] \leq s(\mathcal{H},N) \sup_{h \in \mathcal{H}} \mathbb{P}\left[\left|err_S(h) - err_{S'}(h)\right|>\frac{\epsilon}{2}\right] \end{aligned} $$ where $s(\mathcal{H},N)$ is an upper bound on $|\mathcal{H}_{S \cup S'}|$, called $\pi_{\mathcal H}(n)$ in the book.

Growth function (shattering number)

The largest size of $\mathcal{H}_D$ for a given $N$ is called the shattering number for $\mathcal{H}$ given $N$ $$s(\mathcal{H},N) = \sup_{x_1,\ldots, x_N} |\mathcal{H}_{\{x_1,\ldots,x_N\}}|$$

Note: this is a combinatorial number and does not depend on $\mu$.

Can we bound $s(\mathcal{H},N)$?

Vapnik Chervonenkis dimension

Def: We say that $\mathcal{H}$ shatters a set $D = \{x_i,i=1,\ldots, N\}$ for any disjoint split $D_1,D_{-1}$ of $D$ we can find $h \in \mathcal{H}$ such that $h(D_1) = 1$, $h(D_{-1})=-1$.

Def: The VC-dimension of $\mathcal{H}$, denoted by $\text{VC-dim}(\mathcal{H})$ , equals the largest integer $n$ such that there exists a set of cardinality $n$ that is shattered by $\mathcal{H}$.

Sauer–Shelah lemma (1972)

Let $k = \text{VC-dim}(\mathcal{H})$ then for $N > 0$ we have $$ s(\mathcal{H},N) \leq \sum_{i=0}^{k-1} {N \choose i} $$

This is polynomial: $$s(\mathcal{H},N) \leq \left ( \frac{Ne}{k}\right )^k $$

Proof: Let $\lambda \in (0,1)$ then $$ \begin{aligned} 1 &= (\lambda + (1-\lambda))^N \\ &\geq \sum_{i=1}^{\lambda N} {N \choose i} \lambda^i(1-\lambda)^{n-1} \\ &\geq \sum_{i=1}^{\lambda N} {N \choose i} \left (\frac{\lambda}{1-\lambda} \right )^{\lambda n} (1-\lambda)^n \end{aligned} $$ Thus $$ \begin{aligned} \sum_{i=1}^{\lambda N} {N \choose i} &\leq e^{N((\lambda-1)\log(1-\lambda)-\lambda \log(1-\lambda))} \\ &\leq e^{N(\lambda-\lambda \log(1-\lambda))} \\ &= \left ( \frac{e N}{\lambda N} \right )^{\lambda N} \end{aligned} $$ Then for $k = \lambda N$ we have our result.

Putting it all together

$$ \begin{aligned} \mathbb{P}&\left[\sup_{h \in \mathcal{H}}\left|err(h) - err_{S}(h)\right| > \epsilon\right] \\ &\leq 2\mathbb{P}\left[\sup_{h \in \mathcal{H}}\left|err_{S}(h) - err_{S'}(h)\right| > \frac{\epsilon}{2}\right] \quad \text{(Symmetrization Lemma)}\\ &\leq 2 s(\mathcal{H},2 N) \sup_{h \in \mathcal{H}} \mathbb{P}\left[\left|err_{S}(h) - err_{S'}(h)\right|>\frac{\epsilon}{2}\right] \text{(Shattering number + Classes)}\\ &\leq 2 s(\mathcal{H},2 N) \sup_{h \in \mathcal{H}} \left ( \mathbb{P}\left[\left|err(h) - err_{S}(h)\right|>\frac{\epsilon}{4}\right] + \mathbb{P}\left[\left|err(h) - err_{S'}(h)\right|>\frac{\epsilon}{4}\right] \right ) \\ &\leq 8 s(\mathcal{H}, 2N)\exp \left (-\frac{N\epsilon^2}{8} \right ) \quad \text{(Hoeffdings Inequality)} \end{aligned} $$

Finally we have proved the VC inequality

VC inequality (1971)

$$ \begin{aligned} \mathbb{P}\left[\sup_{h \in \mathcal{H}}\left|err(h) - err_S(h)\right| > \epsilon\right] \leq 8 s(\mathcal{H},2N) \exp \left (-\frac{N\epsilon^2}{8} \right ) \end{aligned} $$

Choosing the values of $\epsilon$ wisely we can get

VC generalization bound

Thus we can for any $\delta$ find that with probability $(1-\delta)$ the following estimate holds

$$ err(h) \leq err_{S} + \sqrt{\frac{8 \ln(s(\mathcal{H},2N))+8\ln\frac{8}{\delta}}{N}} $$

for any $h \in \mathcal{H}$.

Using the bound on the growth function when we have a fixed VC dimension, the above bound becomes

VC generalization bound

We can for any $\delta$ find that with probability $(1-\delta)$ the following estimate holds $k = \text{VC-dim}(\mathcal{H})$ $$ err(h) \leq err_{S} + \sqrt{\frac{8 k\ln\left ( \frac{2Ne}{k}\right )+8\ln\frac{8}{\delta}}{N}} $$ We need $k \lesssim N$ in order for the estimate to be useful.

We can plot how many data-points are needed just to make the quantity on the right hand side less than $1$.

The above implies that for a model that has many parameters, we need a vast amount of data-points. Thus theoretical guarantees are quite lax.

Testing true error

In the context of the perceptron and our ideas of train-test splits we came across the general problem that, given a dataset we find the best fitting model to reduce an error rate. The question that was on our minds, how well does this model perform in general, i.e. what is the true error rate? We showed that if the class of models are simple then there is a correlation between the training error rate and the true error rate.

Question: What happens if the model is complicated?

Question: The inequalities that we used can be quite imprecise due to the general nature of the problem. Is there a way to perform statistical tests instead?

General problem

Call the training set $\mathcal{T}$ and the test-set $\mathcal{V}$. That is, we use the training data and some learning algorithm to obtain our estimated classifier $\hat h$. Then we use the test data in order to estimate the true error of $\hat h$, i.e. we do $$ \hat L(h) = \frac{1}{m} \sum_{(X_i,l_i) \in \mathcal{V}} I(h(X_i) \neq l_i) $$

Note that this is just another way to write what we had before, where we looked at the symmetric difference between sets instead. This just measures how many of the labels $h$ gets correct over the test-set $\mathcal{V}$.

It should be noted that this is the proper way of estimating the true error, since we are not using any samples that we trained on to test with. The reason we have to separate them completely, is that it is what we want. We care about prediction so we should test the error on "new" data.

Question: how do we estimate $err(h_S)$?

k-fold cross validation

  1. Randomly divide the data into $K$ chunks of approximately equal size.
  2. For $k=1$ to $K$, do the following:
    1. Delete chunk $k$ from the data
    2. Compute the "classifier" $\hat h_{(k)}$ from the rest of the data
    3. Use $\hat h_{(k)}$ to predict the data in chunk $k$. Let $\hat L_{(k)}$ denote the observed error rate
  3. Let $$
     \hat L(h) = \frac{1}{K}\sum_{k=1}^K \hat L_{(k)}
    
    $$

Lets try it!

The following dataset comes from the Coronary Risk-Factor Study (CORIS). It involves 462 males between the ages of 15 and 64 from three rural areas in South Africa (Coronary risk factor screening in three rural communities. The CORIS baseline study, J E Rossouw, J P Du Plessis, A J Benadé, P C Jordaan, J P Kotzé, P L Jooste, J J Ferreira). The outcome variable is the precense $Y=1$ or absence $Y=0$ of coronary heart disease. There are 9 covariates

Big whoop! How does this help me?

Good question. You can use this in different ways

  1. Provide an estimate of the true error of our model
  2. Use as a way to select between different models
    1. We could in each fold train many types of models and compute the mean of all the errors per type
    2. Choose the model class / type that has the smallest mean error.

Question: what is different from this and when we considered the model as fixed and tested the model on the test data?

Using it to select models

Lets revisit the problem above with the CORIS dataset and see how many features we should include

We can now use the above plot to find the best choice being 7 features.

Question: Can we use the resulting smallest error as an estimate for the total error for that model?

It should be fairly clear from the construction that we have a downward biased estimate of the total error, so if we do model selection as above, we need to use new data to perform the final estimate.

This is basically a combination of the initial train-test split and the cross-validation method to select the model