Introduction to Data Science: A Comp-Math-Stat Approach

1MS041, 2021

©2021 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

11. Non-parametric Estimation and Testing

Topics

Inference and Estimation: The Big Picture

The Big Picture is about inference and estimation, and especially inference and estimation problems where computational techniques are helpful.

  Point estimation Set estimation Hypothesis Testing

Parametric

 

MLE of finitely many parameters
done

Asymptotically Normal Confidence Intervals
done

Wald Test from Confidence Interval
done

Non-parametric
(infinite-dimensional parameter space)

about to see ... about to see ... about to see ...

So far we have seen parametric models, for example

In all these cases the parameter space (the space within which the parameter(s) can take values) is finite dimensional:

For parametric experiments, we can use the maximum likelihood principle and estimate the parameters using the Maximum Likelihood Estimator (MLE), for instance.

Non-parametric estimation

Suppose we don't know what the distribution function (DF) is? We are not trying to estimate some fixed but unknown parameter $\theta^*$ for some RV we are assuming to be $Bernoulli(\theta^*)$, we are trying to estimate the DF itself. In real life, data does not come neatly labeled "I am a realisation of a $Bernoulli$ RV", or "I am a realisation of an $Exponential$ RV": an important part of inference and estimation is to make inferences about the DF itself from our observations.

Observations from some unknown process

Consider the following non-parametric product experiment:

$$X_1, X_2, \ldots, X_n\ \overset{IID}{\sim} F^* \in \{\text{all DFs}\}$$

We want to produce a point estimate for $F^*$, which is a allowed to be any DF ("lives in the set of all DFs"), i.e., $F^* \in \{\text{all DFs}\}$

Crucially, $\{\text{all DFs}\}$, i.e., the set of all distribution functions over $\mathbb{R}$ is infinite dimensional.

We have already seen an estimate, made using the data, of a distribution function: the empirical or data-based distribution function (or empirical cumulative distribution function). This can be formalized as the following process of adding indicator functions of the half-lines beginning at the data points $[X_1,+\infty),[X_2,+\infty),\ldots,[X_n,+\infty)$:

$$\widehat{F}_n (x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x)$$

where,

$$\mathbf{1}_{[X_i,+\infty)}(x) := \begin{cases} & 1 \quad \text{ if } X_i \leq x \\ & 0 \quad \text{ if }X_i > x \end{cases}$$

First let us evaluate a set of functions that will help us conceptualize faster:

Let us continue with the concepts

We can remind ourselves of this for a small sample of $de\,Moivre(k=5)$ RVs:

We can use the empirical cumulative distribution function $\widehat{F}_n$ for our non-parametric estimate because this kind of estimation is possible in infinite-dimensional contexts due to the following two theorems:

Glivenko-Cantelli Theorem

Let $X_1, X_2, \ldots, X_n \overset{IID}{\sim} F^* \in \{\text{all DFs}\}$

and the empirical distribution function (EDF) is $\widehat{F}_n(x) := \displaystyle\frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x)$, then

$$\sup_x { | \widehat{F}_n(x) - F^*(x) | } \overset{P}{\rightarrow} 0$$

Remember that the EDF is a statistic of the data, a statistic is an RV, and (from our work the convergence of random variables), $\overset{P}{\rightarrow}$ means "converges in probability". The proof is beyond the scope of this course, but we can gain an appreciation of what it means by looking at what happens to the ECDF for $n$ simulations from:

It is clear, that as $n$ increases, the ECDF $\widehat{F}_n$ gets closer and closer to the true DF $F^*$, $\displaystyle\sup_x { | \widehat{F}_n(x) - F^*(x) | } \overset{P}{\rightarrow} 0$.

This will hold no matter what the (possibly unknown) $F^*$ is. Thus, $\widehat{F}_n$ is a point estimate of $F^*$.

We need to add the DKW Inequality be able to get confidence sets or a 'confidence band' that traps $F^*$ with high probability.

Dvoretsky-Kiefer-Wolfowitz (DKW) Inequality

Let $X_1, X_2, \ldots, X_n \overset{IID}{\sim} F^* \in \{\text{all DFs}\}$

and the empirical distribution function (EDF) is $\widehat{F}_n(x) := \displaystyle\frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x)$,

then, for any $\varepsilon > 0$,

$$P\left( \sup_x { | \widehat{F}_n(x) - F^*(x) | > \varepsilon }\right) \leq 2 \exp(-2n\varepsilon^2) $$

We can use this inequality to get a $1-\alpha$ confidence band $C_n(x) := \left[\underline{C}_n(x), \overline{C}_n(x)\right]$ about our point estimate $\widehat{F}_n$ of our possibly unknown $F^*$ such that the $F^*$ is 'trapped' by the band with probability at least $1-\varepsilon$.

$$\begin{aligned} \underline{C}_{\, n}(x) &=& \max \{ \widehat{F}_n(x)-\varepsilon_n, 0 \}, \notag \\ \overline{C}_{\, n}(x) &=& \min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}, \notag \\ \varepsilon_n &=& \sqrt{ \frac{1}{2n} \log \left( \frac{2}{\alpha}\right)} \\ \end{aligned}$$

and

$$P\left(\underline{C}_n(x) \leq F^*(x) \leq \overline{C}_n(x)\right) \geq 1-\alpha$$

YouTry in class

Try this out for a simple sample from the $Uniform(0,1)$, which you can generate using random. First we will just make the point estimate for $F^*$, the EDF $\widehat{F}_n$

In one of the assessments, you did a question that took you through the steps for getting the list of points that you would plot for an empirical distribution function (EDF). We will do exactly the same thing here.

First we find the unique values in the sample, in order from smallest to largest, and get the frequency with which each unique value occurs:

Then we accumulate the frequences to get the cumulative frequencies:

And the the relative cumlative frequencies:

And finally zip these up with the sorted unique values to get a list of points we can plot:

Here is a function that you can just use to do a ECDF plot:

This makes the plot of the $\widehat{F}_{10}$, the point estimate for $F^*$ for these $n=10$ simulated samples.

What about adding those confidence bands? You will do essentially the same thing, but adjusting for the required $\varepsilon$. First we need to decide on an $\alpha$ and calculate the $\varepsilon$ corresponding to this alpha. Here is some of our code to calculate the $\varepsilon$ corresponding to $\alpha=0.05$ (95% confidence bands), using a hidden function calcEpsilon:

See if you can write your own code to do this calculation, $\varepsilon_n = \sqrt{ \frac{1}{2n} \log \left( \frac{2}{\alpha}\right)}$. For completeness, do the whole thing:assign the value 0.05 to a variable named alpha, and then use this and the variable called n that we have already declared to calculate a value for $\varepsilon$. Call the variable to which you assign the value for $\varepsilon$ epsilon so that it replaces the value we calculated in the cell above (you should get the same value as us!).

Now we need to use this to adjust the EDF plot. In the two cells below we first of all do the adjustment for $\underline{C}_{\,n}(x) =\max \{ \widehat{F}_n(x)-\varepsilon_n, 0 \}$, and then use zip again to get the points to actually plot for the lower boundary of the 95% confidence band.

Now we need to use this to adjust the EDF plot. In the two cells below we first of all do the adjustment for $\overline{C}_{\,n}(x) =\min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}$, and then use zip again to get the points to actually plot for the lower boundary of the 95% confidence band.

We carefully gave our ecdfPointsPlot function the flexibility to be able to plot bands, by having a colour parameter (which defaults to 'grey') and a lines_only parameter (which defaults to false). Here we can plot the lower bound of the confidence interval by adding ecdfPointsPlot(ecdfPointsUniformLower, colour='green', lines_only=true) to the previous plot:

YouTry

You try writing the code to create the list of points needed for plotting the upper band $\overline{C}_{\,n}(x) =\min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}$. You will need to first of all get the upper heights (call them say cumRelFreqsUniformUpper) and then zip them up with the sortedUniqueValuesUniform to get the points to plot.

Once you have got done this you can add them to the plot by altering the code below:

(end of YouTry)


If we are doing lots of collections of EDF points we may as well define a function to do it, rather than repeating the same code again and again. We use an offset parameter to give us the flexibility to use this to make points for confidence bands as well.

NZ EartQuakes

Now we will try looking at the Earthquakes data we have used before to get a confidence band around an EDF for that. We start by bringing in the data and the function we wrote earlier to parse that data.

First check if you have already unzip-ped data/earthquakes.csv.zip file by dropping in shell via %%sh.

There is a lot of data here, so let's use an interactive plot to do the non-parametric DF estimation just for some of the last data:

Plug-in Estimator: A nonparametric Point Estimator

A function $h$ of another function $g$ is called a functional. A statistical functional $T(F)$ is any function of the distribution function $F$.

Suppose we are not directly interested in estimating the unknown $F^*$ in the experiment with $n$ IID samples: $$X_1,\ldots,X_n \overset{IID}{\sim}F^* \in \{ \text{ all DFs } \}$$ But rather in estimating a statistical functional of interest, say, $\theta^* := T(F^*)$.

NOTE: Here $\theta^*$ is not a parameter, but just some unknown quantity of interest that can be obtained as a statistical functional $T(F^*)$ of the unknown $F^*$.

Some examples of statistical functionals you have already seen include:

By substituting in the point estimate $\widehat{F}_n$ for the unknown $F^*$, we can obtain the plug-in estimator as: $$ \boxed{ \widehat{\Theta}_n = T(\widehat{F}_n) \quad \text{ of the quantity of interest } \, \theta^* := T(F^*) } $$ In other words, just plug-in $\widehat{F}_n$ for the unknown $F^*$ in $T(F^*) =: \theta^*$.

If $T(F)=\int r(x) \, d F(x)$ for some function $r(x)$ then $T$ is called a linear functional since $T$ is linear in its arguments: $T(aF+bG)=aT(F)+bT(G)$, where $(a,b) \in \mathbb{R}^2$.

Thus, the plug-in estimator for any linear statistical functional $\theta^* = T(F^*)=\int r(x) dF^*(x)$ is:

$$ \boxed{\widehat{\Theta}_n = T(\widehat{F}_n) = \frac{1}{n} \sum_{i=1}^n r(x_i)} $$

because: $$ \widehat{\Theta}_n = T(\widehat{F}_n) = \int r(x) d \widehat{F}_n(x) = \sum_{i=1}^n r(x_i) \, P(X=x_i) = \sum_{i=1}^n r(x_i) \, f(x_i) = \sum_{i=1}^n r(x_i) \, \frac{1}{n} = \frac{1}{n} \sum_{i=1}^n r(x_i) $$

Sample mean and sample variance are actually examples of plug-in estimators of the linear statistical functionals $E(X)$ and $V(X)$

Quantiles generalise the notion of a median and are very useful statistics when we want to summarise the values taken by a random variable. For a given $q \in (0,1)$, the $q$-th quantile of a random variable with DF $F$ is the statistical functional: $$T(F)={F}^{[-1]}(q) := \inf\{x: F(x) \geq q\}$$ We can use the plug-in estimator: $$\widehat{F}_n^{[-1]}(q) := \inf\{x: \widehat{F}_n(x) \geq q\}$$ to estimate $T(F)={F}^{[-1]}(q)$, and $\widehat{F}_n^{[-1]}(q)$ is called the $q$-th sample quantile

Example: Estimating quantiles of inter-earthquake times

We can simply call the quantile function knowing that it is just a plug-in estimator of the $q$-th quantile.

Quantiles and Percentiles

If $q \in [0,100]$ then we call $(\frac{q}{100})$-th quantile as the $q$-th percentile.

So only $5\%=0.05$ of the inter-EQ times are longer than 75 minutes. If we assume this non-parametric model that models the times between EQs as IID RVs, then the probability of not have any earthquakes in New Zealand (during the observation period) for longer than 150 minutes is no smaller than $0.05 \times 0.05 = 0.0025 = 1/400$, a pretty unlikely event!

Quantiles as plug-in estimates can give us a lot of basic insights into the data under the more general nonparametric view.

Bootstrap for Confidence Sets in Nonparametric Experiments

This is an adaptation of All of Statistics (2003) by Wasserman of Carnegie Mellon University, Pittsburgh, PA, USA (one of the first universities that started presenting an integrated research and educational view of mathematical statistics and machine learning, along both theoretical and applied fronts).

Plug-in estimators are point estimators. Next we will see a very general way to obtain set estimators of $\theta^* = T(F^*)$, some statistical functional of interest, when we do not make any parametric assumptions about the underlying unknown distribution function $F^*$ in our IID experiment:

$$X_1,\ldots,X_n \overset{IID}{\sim}F^* \in \{ \text{ all DFs } \}$$

NOTE: To get confidence sets (or intervals in 1D) we need to obtain estimated standard error $\widehat{se}_n$ of our point estimator, such as the plug-in estimator, $\widehat{\Theta}_n$ of $\theta^*=T(F^*)$. This is generally not possible. However, the bootstrap due to Efron gives us a helping hand.

The bootstrap is a statistical method for estimating standard errors and confidence sets of statistics, such as estimators.

Let $T_n := T_n((X_1,X_2,\ldots,X_n))$ be a statistic, i.e, any function of the data $X_1,X_2,\ldots,X_n \overset{IID}{\sim} F^*$.

Suppose we want to know its variance $V_{F^*}(T_n)$, which clearly depends on the fixed and possibly unknown DF $F^*$ (hence the subscripting by $F^*$ to emphasise this dependence).

If our statistic $T_n$ is one with an analytically unknown variance, then we can use the bootstrap to estimate it.

The bootstrap idea has the following two basic steps with their underlying assumptions:

For example, if $T_n=\overline{X}_n$, in $\mathsf{Step~1}$, $V_{\widehat{F}_n}(T_n) = s_n^2/n$, where $s_n^2=n^{-1} \sum_{i=1}^n (x_i-\overline{x}_n)$ is the sample variance and $\overline{x}_n$ is the sample mean. In this case, $\mathsf{Step~1}$ is enough. However, when the statistic $T_n$ is more complicated (e.g. $T_n=\widetilde{X}_n = F^{[-1]}(0.5)$), the sample median, then we may not be able to find a simple expression for $V_{\widehat{F}_n}(T_n)$ and may need $\mathsf{Step~2}$ of the bootstrap.

$$ \boxed{ \begin{aligned} \text{Real World Data come from} & F^* \quad \implies X_1,X_2,\ldots,X_n & \implies T_n((X_1,X_2,\ldots,X_n))=t_n \notag \\ \text{Bootstrap World Data come from} & \widehat{F}_n \quad \implies X^{\bullet}_1,X^{\bullet}_2,\ldots,X^{\bullet}_n & \implies T_n((X^{\bullet}_1,X^{\bullet}_2,\ldots,X^{\bullet}_n))=t^{\bullet}_n \notag \end{aligned} } $$

Observe that drawing an observation from the empirical DF $\widehat{F}_n$ is equivalent to drawing one point at random from the original data (think of the indices $[n] := \{ 1,2,\ldots,n \}$ of the original data $X_1,X_2,\ldots,X_n$ being drawn according to the equi-probable $de~Moivre(1/n,1/n,\ldots,1/n)$ RV on $[n]$). Thus, to simulate $X^{\bullet}_1,X^{\bullet}_2,\ldots,X^{\bullet}_n$ from $\widehat{F}_n$, it is enough to drawn $n$ observations with replacement from the dataset $\{X_1,X_2,\ldots,X_n\}$.

In summary, the algorithm for Bootstrap Variance Estimation is:

  • $\mathsf{Step~1}$: Draw $X^{\bullet}_1,X^{\bullet}_2,\ldots,X^{\bullet}_n \sim \widehat{F}_n$
  • $\mathsf{Step~2}$: Compute $t_n^{\bullet} = T_n((X^{\bullet}_1,X^{\bullet}_2,\ldots,X^{\bullet}_n))$
  • $\mathsf{Step~3}$: Repeat $\mathsf{Step~1}$ and $\mathsf{Step~2}$ $B$ times, for some large $B$, say $B>1000$, to get $t_{n,1}^{\bullet}, t_{n,2}^{\bullet},\ldots,t_{n,B}^{\bullet}$
  • $\mathsf{Step~4}$: Several ways of estimating the bootstrap confidence intervals are possible, we use the one based on percentiles:
    • The $1-\alpha$ percentile-based bootstrap confidence interval is: $$ C_n=[\widehat{G^{\bullet}}_{n}^{-1}(\alpha/2),\widehat{G^{\bullet}}_{n}^{-1}(1-\alpha/2)] , $$ where $\widehat{G^{\bullet}}_{n}$ is the empirical DF of the bootstrapped $t_{n,1}^{\bullet}, t_{n,2}^{\bullet},\ldots,t_{n,B}^{\bullet}$ and $\widehat{G^{\bullet}}_{n}^{-1}(q)$ is the $q^{\text{th}}$ sample quantile of $t_{n,1}^{\bullet}, t_{n,2}^{\bullet},\ldots,t_{n,B}^{\bullet}$.

Confidence Interval via Bootstrap - Median of inter-EQ Times in Minutes

Use Bootstrap to obtain a 95% confidence interval of the median inter earth-quate time in minutes. Use this interval to test the null hypothesis that the median inter earth-quate time is 10 minutes.

Using numpy to bootstrap, a faster method

Testing via Bootstrap

We can use bootstrap-based confidence interval in conjunction with Wald test in nonparametric experiments just as we used confidence interval from an asymptotically normal estimator such as the MLE in the parametric experiments.

The $\mathsf{size}$ Wald test from bootstrapped $1-\alpha$ Confidence Interval $C_n$ for a statistic $t$

The $\mathsf{size}$ $\alpha$ Wald test rejects:

$$ \boxed{ \text{ $H_0: t^*=t_0$ versus $H_1: t^* \neq t_0$ if and only if $t_0 \notin C_n := C_n=[\widehat{G^{\bullet}}_{n}^{-1}(\alpha/2),\widehat{G^{\bullet}}_{n}^{-1}(1-\alpha/2)]$. }} $$$$\boxed{\text{Therefore, testing the hypothesis is equivalent to verifying whether the null value $t_0$ is in the bootstrapped confidence interval.}}$$

Example: Wald Test that median of inter-EQ times is 10 minutes

We reject the null hypothesis that the median is $10$ minutes, if $10$ is not inside the 95% confidence interval for the median we just obtained via bootstrap. This is just like the Wald test except that we use the bootstrap instead of the asymptotic normality of the maximum likelihood estimator.

Let us see this more explicitly in the next cell.

Making a Generic Bootstrap Function for $1-\alpha$ Confidence Interval of any Statistic

Making a generic function makeBootstrappedConfidenceIntervalOfStatisticT as done below can be helpful in other problems.

Demonstrating the use of bootstrap for different statistics of interest

We can obtain the bootstrap-based $1-\alpha$ confidence intervals quite easily for any statistic and any dataset as demonstrated below.

Bootstrapped $1-\alpha$ Confidence Interval for Sample Median of inter-EQ Time in Minutes

All we need to do is follow these three steps to perform a bootstrap:

  1. define a lambda expression for the statistic of interest: statTMedian = lambda dataset : np.percentile(dataset,50.0)

CAUTION: Bootstrap is justified if two of its assumptions are satisfied:

Recall assumptions in the two basic steps in the Bootstrap from above (notebook 11.ipynb):

  1. $\mathsf{Step~1}$: $n$ is large enough, so that the empirical DF $\widehat{F}_n$ is a good approximation for the unknown $F^*$ in the model $X_1,X_2,\ldots,X_n \overset{IID}{\sim} F^*$, and

So it is a good idea to increase $B$ to $1000$ or perhaps more (depending on the problem and available computational resources) to see how the results are affected. Remember, the confidence intervals will slightly change due to random seed in the simulation and your values for $n$ and $B$.

NOTE: When the dataset is much larger in size (order of billions or hundreds of billions) or not IID real-valued but perhaps dependent (eg. time series models) then there are other sub-sampling schemes that one may need to use under further assumptions on the model.

Sample Exam Problem 7

Obtain the plug-in estimate of the population median of inter-EQ times in minutes stored in the array iQMinutes. Using $1000$ bootstrap replicates $B$, obtain the the 95% confidence interval for the median inter-EQ time in minutes. Using a Wald-like Test based on the bootstrapped 95% confidence interval, test the null hypothesis that the inter-EQ time is 20 minutes (just report the finding of your test through the boolean RejectedH0_sampleMedianIs20min ).

NOTE: Make Sure you run the REQUIRED-CELL before trying the Problem.

Sample Exam Problem 7 Solution

Correlation: A Bivariate Nonparametric Bootstrap

Here is a classical data set used by Bradley Efron at Stanford University's Statistics Department (the inventor of bootstrap) to illustrate the method.

The data are LSAT (Law School Admission Test in the USA) scores and GPA (grade point average) of just fifteen individuals. The inference task involved assisting the Admissions Office with their evidence-based investigations on their admissions policies.

Thus, we have bivariate data of the form $(Y_i,Z_i)$, where $Y_i={\rm LSAT}_i$ and $Z_i={\rm GPA}_i$. For example, the first individual had an LSAT score of $y_1=576$ and a GPA of $z_1=3.39$ while the fifteenth individual had an LSAT score of $y_{15}=594$ and a GPA of $z_{15}=3.96$.

We supose that the bivariate data has the following IID bivariate model:

$$\boxed{(Y_1,Z_1),(Y_2,Z_2),\ldots,(Y_{15},Z_{15}) \overset{IID}{\sim} F^* \in \{ \text{all bivariate DFs} \}}$$

This is a bivariate nonparametric experiment and its bivariate data is plotted below.

Plug-in Point Estimate of the Correlation

The law school was interested in the correlation between the GPA and LSAT scores:

$$ \boxed{ \theta^* = \frac{\int \int (y-E(Y))(z-E(Z))dF(y,z)}{\sqrt{\int (y-E(Y))^2 dF(y) \int (z-E(Z))^2 dF(z)}} } $$

The plug-in estimate of the population correlation $\theta^*$ is the sample correlation: $$ \boxed{ \widehat{\Theta}_n = \frac{\sum_{i=1}^n(Y_i-\overline{Y}_n)(Z_i-\overline{Z}_n)}{\sqrt{\sum_{i=1}^n(Y_i-\overline{Y}_n)^2 \sum_{i=1}^n(Z_i-\overline{Z}_n)^2}} } $$

The Bootstrapped $1-\alpha$ Confidence Interval of the Sample Correlation and a Test

In order to obtain the $1-\alpha$ confidence interval of the plug-in estimate of the population correlation, we can quickly tap into our function thousand bootstrapped data sets.

Then we can reject (or fail to reject) the null hypothesis that the true correlation coefficient is $0$ if $0$ is not contained in the bootstrapped 95% confidence interval (akin to a Wald Test, but with the nonparametric model).

Why is testing whether correlation = 0 is of interest?

Correlation Versus Causation - Proceed with Extreme Caution!

The following image from https://en.wikipedia.org/wiki/Correlation_and_dependence shows:

Several sets of (x, y) points, with the (Pearson's) correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

Correlation examples2.svg

Nonparametric Hypothesis Testing

Are two samples from the same distribution or not?

What if we are not interested in estimating $F^*$ itself, but we are interested in scientificially investigating whether two distributions are the same. For example, perhaps, whether the distribution of earthquake magnitudes was the same in April as it was in March or whether human subjects under a clinical trial are reacting differently two a new drug to treat some disease (typically you have a control group that does not get the treatment and the treated group).

You already have the means to do by using a Wald test of say the difference in sample means of the two samples, or correlation coefficient, etc for parametric or nonparametric experiments.

We will next see a useful nonparametric approach to this problem.

Permutation Testing

A Permuation Test is a non-parametric exact method for testing whether two distributions are the same based on samples from each of them. In industry analogs and variants of permutation testing is known as A/B Testing in industry today.

What do we mean by "non-parametric exact"? It is non-parametric because we do not impose any parametric assumptions. It is exact because it works for any sample size.

Formally, we suppose that: $$ X_1,X_2,\ldots,X_m \overset{IID}{\sim} F^* \quad \text{and} \quad X_{m+1}, X_{m+2},\ldots,X_{m+n} \overset{IID}{\sim} G^* \enspace , $$ are two sets of independent samples where the possibly unknown DFs $F^*,\,G^* \in \{ \text{all DFs} \}$.

(Notice that we have written it so that the subscripts on the $X$s run from 1 to $m+n$.)

Now, consider the following hypothesis test: $$H_0: F^*=G^* \quad \text{versus} \quad H_1: F^* \neq G^* \enspace . $$

Our test statistic uses the observations in both both samples. We want a test statistic that is a sensible one for the test, i.e., will be large when when $F^*$ is 'too different' from $G^*$

So, let our test statistic $T(X_1,\ldots,X_m,X_{m+1},\ldots,X_{m+n})$ be say: $$ T:=T(X_1,\ldots,X_m,X_{m+1},\ldots,X_{m+n})= \text{abs} \left( \frac{1}{m} \sum_{i=1}^m X_i - \frac{1}{n} \sum_{i=m+1}^n X_i \right) \enspace . $$

(In words, we have chosen a test statistic that is the absolute value of the difference in the sample means. Note the limitation of this: if $F^*$ and $G^*$ have the same mean but different variances, our test statistic $T$ will not be large.)

Then the idea of a permutation test is as follows:

$$ \text{P-value} = \mathbf{P}_0 \left( T \geq t_{obs} \right) = \frac{1}{N!} \left( \sum_{j=1}^{N!} \mathbf{1} (t_j \geq t_{obs}) \right), \qquad \mathbf{1} (t_j \geq t_{obs}) = \begin{cases} 1 & \text{if } \quad t_j \geq t_{obs} \\ 0 & \text{otherwise} \end{cases} $$

This will make more sense if we look at some real data.

Permutation Testing with Shell Data

In 2008, Guo Yaozong and Chen Shun collected data on the diameters of coarse venus shells from New Brighton beach for a course project. They recorded the diameters for two samples of shells, one from each side of the New Brighton Pier. The data is given in the following two cells.

NOTE - Real Data for Really Applied Statistics: Under my guidance and assistance, students collected the shells on either side of the pier, cleaned the shells with warm soap water, rinsed and and treated with a diluted chlorine solution, used a taxonomy guide to identify the mollusc species and classified them manually. Then they focused on Dosinia anus or coarse venus shells. After the diluted chlorine treatment to disinfect microflora/fauna, the shells were washed, dried, placed on a graph paper, etched around the base with a mechanical pencil, and finally measured with a scale for the largest diameter within the outline.

$(115 + 139)!$ is a very big number. Lets start small, and take a subselection of the shell data to demonstrate the permutation test concept: the first two shells from the left of the pier and the first one from the right:

So now we are testing the hypotheses

$$\begin{array}{lcl}H_0&:& X_1,X_2,X_3 \overset{IID}{\sim} F^*=G^* \\H_1&:&X_1, X_2 \overset{IID}{\sim} F^*, \,\,X_3 \overset{IID}{\sim} G^*, F^* \neq G^*\end{array}$$

With the test statistic $$\begin{array}{lcl}T(X_1,X_2,X_3) &=& \text{abs} \left(\displaystyle\frac{1}{2}\displaystyle\sum_{i=1}^2X_i - \displaystyle\frac{1}{1}\displaystyle\sum_{i=2+1}^3X_i\right) \\ &=&\text{abs}\left(\displaystyle\frac{X_1+ X_2}{2} - \displaystyle\frac{X_3}{1}\right)\end{array}$$

Our observed data $x_{obs} = (x_1, x_2, x_3) = (52, 54, 58)$

and the realisation of the test statistic for this data is $t_{obs} = \text{abs}\left(\displaystyle\frac{52+54}{2} - \frac{58}{1}\right) = \text{abs}\left(53 - 58\right) = \text{abs}(-5) = 5$

Now we need to tabulate the permutations and their probabilities. There are 3! = 6 possible permutataions of three items. For larger samples, you could use the factorial function to calculate this:

We said that under the null hypotheses (the samples have the same DF) each permutation is equally likely, so each permutation has probability $\displaystyle\frac{1}{6}$.

There is a way in Python (the language under the hood in Sage), to get all the permuations of a sequence:

We can tabulate the permuations, their probabilities, and the value of the test statistic that would be associated with that permutation:

Permutation $t$ $\mathbf{P}_0(T=t)$
Probability under Null
(52, 54, 58) 5 $\frac{1}{6}$
(52, 58, 54)  1 $\frac{1}{6}$
(54, 52, 58) 5 $\frac{1}{6}$
(54, 58, 52) 4 $\frac{1}{6}$
(58, 52, 54) 1 $\frac{1}{6}$
(58, 54, 52) 4 $\frac{1}{6}$

To calculate the P-value for our test statistic $t_{obs} = 5$, we need to look at how many permutations would give rise to test statistics that are at least as big, and add up their probabilities.

$$ \begin{array}{lcl}\text{P-value} &=& \mathbf{P}_0(T \geq t_{obs}) \\&=&\mathbf{P}_0(T \geq 5)\\&=&\frac{1}{6} + \frac {1}{6} \\&=&\frac{2}{6}\\ &=&\frac{1}{3} \\ &\approx & 0.333\end{array} $$

We could write ourselves a little bit of code to do this in SageMath. As you can see, we could easily improve this to make it more flexible so that we could use it for different numbers of samples, but it will do for now.

This means that there is little or no evidence against the null hypothesis (that the shell diameter observations are from the same DF).

Pooled sample size

The lowest possible P-value for a pooled sample of size $N=m+n$ is $\displaystyle\frac{1}{N!}$. Can you see why this is?

So with our small sub-samples the smallest possible P-value would be $\frac{1}{6} \approx 0.167$. If we are looking for P-value $\leq 0.01$ to constitute very strong evidence against $H_0$, then we have to have a large enough pooled sample for this to be possible. Since $5! = 5 \times 4 \times 3 \times 2 \times 1 = 120$, it is good to have $N \geq 5$

YouTry in class

Try copying and pasting our code and then adapting it to deal with a sub-sample (52, 54, 60) from the left of the pier and (58, 54) from the right side of the pier.

You will have to think about:

(add more cells if you need them)

(end of You Try)


We can use the sample function and the Python method for making permutations to experiment with a larger sample, say 5 of each.

We have met sample briefly already: it is part of the Python random module and it does exactly what you would expect from the name: it samples a specified number of elements randomly from a sequence.

As you can see from the length of time it takes to do the calculation for $(5+5)! = 10!$ permutations, we will be here a long time if we try to this on all of the two shell data sets.

Monte Carlo methods to the rescue!

We can use Monte Carlo integration to calculate an approximate P-value, and this will be our next topic.

You try

Try working out the P-value for a sub-sample (58, 63) from the left of the pier and (61) from the right (the two last values in the left-side data set and the last value in the right-side one). Do it as you would if given a similar question in the exam: you choose how much you want to use Sage to help and how much you do just with pen and paper.

Example: Permutation test for shell diameters using the bootsrapped/permuted confidence interval

So do we reject the Null Hypothesis that the diameters were the same on either side of the New Brighton Pier?

To answer this question we simply need to see whether the observed Test Statistic of the absolute difference between the sample means between the left and right side of the pier lies:

We can conclude that there is no evidence to reject the null hypothesis because $0.164216452925 \in [0.05, 0.82]$. And therefore the inviduals of Dosinia anus have the same distribution on either side of the Pier.