Introduction to Data Science: A Comp-Math-Stat Approach

1MS041, 2021

©2021 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

05. Random Variables, Expectations, Data, Statistics, Arrays and Tuples, Iterators and Generators

Topics

Random Variables

A random variable is a mapping from the sample space $\Omega$ to the set of real numbers $\mathbb{R}$. In other words, it is a numerical value determined by the outcome of the experiment.

We already saw discrete random variables that take values in a discrete set, of two types:

Now, we will see the other main type of real-valued random variable.

Continuous random variable

When a random variable takes on values in the continuum we call it a continuous RV.

Examples

Probability Density Function

A RV $X$ with DF $F$ is called continuous if there exists a piece-wise continuous function $f$, called the probability density function (PDF) $f$ of $X$, such that, for any $a$, $b \in \mathbb{R}$ with $a < b$,

$$ P(a < X \le b) = F(b)-F(a) = \int_a^b f(x) \ dx \ . $$

The following hold for a continuous RV $X$ with PDF $f$:

For any $x \in \mathbb{R}$, $P(X=x)=0$. Consequentially, for any $a,b \in \mathbb{R}$ with $a \le b$ $$P(a < X < b ) = P(a < X \le b) = P(a \leq X \le b) = P(a \le X < b)$$ By the fundamental theorem of calculus, except possibly at finitely many points (where the continuous pieces come together in the piecewise-continuous $f$): $$f(x) = \frac{d}{dx} F(x)$$ And of course $f$ must satisfy: $$\int_{-\infty}^{\infty} f(x) \ dx = P(-\infty < X < \infty) = 1$$

You try at home

Watch the Khan Academy video about probability density functions to warm-up to the meaning behind the maths above. Consider the continuous random variable $Y$ that measures the exact amount of rain tomorrow in inches. Think of the probability space $(\Omega,\mathcal{F},P)$ underpinning this random variable $Y:\Omega \to \mathbb{Y}$. Here the sample space, range or support of the random variable $Y$ denoted by $\mathbb{Y} = [0,\infty) =\{y : 0 \leq y < \infty\}$.

The Uniform$(0,1)$ RV

The Uniform$(0,1)$ RV is a continuous RV with a probability density function (PDF) that takes the value 1 if $x \in [0,1]$ and $0$ otherwise. Formally, this is written

\begin{equation} f(x) = \mathbf{1}_{[0,1]}(x) = \begin{cases} 1 & \text{if } 0 \le x \le 1 ,\\ 0 & \text{otherwise} \end{cases} \end{equation}

and its distribution function (DF) or cumulative distribution function (CDF) is:

\begin{equation} F(x) := \int_{- \infty}^x f(y) \ dy = \begin{cases} 0 & \text{if } x < 0 , \\ x & \text{if } 0 \le x \leq 1 ,\\ 1 & \text{if } x > 1 \end{cases} \end{equation}

Note that the DF is the identity map in $[0,1]$.

The PDF, CDF and inverse CDF for a Uniform$(0,1)$ RV are shown below

Uniform01ThreeCharts

The Uniform$(0,1)$ is sometimes called the Fundamental Model.

The Uniform$(0,1)$ distribution comes from the Uniform$(a,b)$ family.

\begin{equation} f(x) = \mathbf{1}_{[a,b]}(x) = \begin{cases} \frac{1}{(b-a)} & \text{if } a \le x \le b,\\ 0 & \text{otherwise} \end{cases} \end{equation}

This is saying that if $X$ is a Uniform$(a,b)$ RV, then all values of $x$ between $a$ and $b$, i.e., $a \le x \le b$, are equally probable. The Uniform$(0,1)$ RV is the member of the family where $a=0$, $b=1$.

The PDF and CDF for a Uniform$(a,b)$ RV are shown from wikipedia below

500px-Uniform_Distribution_PDF_SVG.svg.png wikipedia image 500px-Uniform_cdf.svg.png

You can dive deeper into this family of random vaiables here.

SageMath has a function for simulating samples from a Uniform$(a,b)$ distribution. We will learn more about this later in the course. Let's go ahead and use it to simulate samples from it below.

Expectations

The expectation of $X$ is also known as the population mean, first moment, or expected value of $X$.

\begin{equation} E\left(X\right) := \int x \, dF(x) = \begin{cases} \sum_x x \, f(x) & \qquad \text{if }X \text{ is discrete} \\ \int x \, f(x)\,dx & \qquad \text{if } X \text{ is continuous} \end{cases} \end{equation}

Sometimes, we denote $E(X)$ by $E X$ for brevity. Thus, the expectation is a single-number summary of the RV $X$ and may be thought of as the average.

In general though, we can talk about the Expectation of a function $g$ of a RV $X$.

The Expectation of a function $g$ of a RV $X$ with DF $F$ is:

\begin{equation} E\left(g(X)\right) := \int g(x)\,dF(x) = \begin{cases} \sum_x g(x) f(x) & \qquad \text{if }X \text{ is discrete} \\ \int g(x) f(x)\,dx & \qquad \text{if } X \text{ is continuous} \end{cases} \end{equation}

provided the sum or integral is well-defined. We say the expectation exists if

\begin{equation} \int \left|g(x)\right|\,dF(x) < \infty \ . \end{equation}

When we are looking at the Expectation of $X$ itself, we have $g(x) = x$

Thinking about the Expectations like this, can you see that the familiar Variance of X is in fact the Expection of $g(x) = (x - E(X))^2$?

The variance of $X$ (a.k.a. second moment)

Let $X$ be a RV with mean or expectation $E(X)$. The variance of $X$ denoted by $V(X)$ or $VX$ is

$$ V(X) := E\left((X-E(X))^2\right) = \int (x-E(X))^2 \,d F(x) $$

provided this expectation exists. The standard deviation denoted by $\sigma(X) := \sqrt{V(X)}$.

Thus variance is a measure of ``spread'' of a distribution.

The $k$-th moment of a RV comes from the Expectation of $g(x) = x^k$.

We call

$$ E(X^k) = \int x^k\,dF(x) $$

the $k$-th moment of the RV $X$ and say that the $k$-th moment exists when $E(|X|^k) < \infty$.

Properties of Expectations

  1. If the $k$-th moment exists and if $j<k$ then the $j$-th moment exists.

You try at home

Watch the Khan Academy videos about probability density functions and expected value if you want to get another angle on the material more slowly step-by-step:

The population mean and variance of the Bernoulli$(\theta)$ RV

We have already met the discrete Bernoulli$(\theta)$ RV. Remember, that if we have an event $A$ with $P(A) = \theta$, then a Bernoulli$(\theta)$ RV $X$ takes the value $1$ if "$A$ occurs" with probability $\theta$ and $0$ if "$A$ does not occur" with probability $1-\theta$.

In other words, the indicator function $\mathbf{1}_A$ of "$A$ occurs" with probability $\theta$ is the Bernoulli$(\theta)$ RV.

For example, flip a fair coin.
Consider the event that it turns up heads. Since the coin is fair, the probability of this event $\theta$ is $\frac{1}{2}$. If we define an RV $X$ that takes the value 1 if the coin turns up heads ("event coin turns up heads occurs") and 0 otherwise, then we have a Bernoulli$(\theta = \frac{1}{2})$ RV.

We all saw that given a parameter $\theta \in [0,1]$, the probability mass function (PMF) for the Bernoulli$(\theta)$ RV $X$ is:

$$ f(x;\theta) = \theta^x (1-\theta)^{1-x} \mathbf{1}_{\{0,1\}}(x) = \begin{cases} \theta & \text{if } x=1 \ ,\\ 1-\theta & \text{if } x=0 \ ,\\ 0 & \text{otherwise} \end{cases} $$

and its DF is:

\begin{equation} F(x;\theta) = \begin{cases} 1 & \text{if } 1 \le x \ ,\\ 1-\theta & \text{if } 0 \le x < 1 \ ,\\ 0 & \text{otherwise} \end{cases} \end{equation}

Now let's look at some expectations: the population mean and variance of an RV $X \thicksim$ Bernoulli$(\theta)$.

Because $X$ is a discrete RV, our expectations use sums rather than integrals.

The first moment or expectation is:

$$ \begin{array}{lcl} E(X) & = & \displaystyle\sum_{x=0}^{1}xf(x;\theta) \\ &=& (0 \times (1-\theta)) + (1 \times \theta)\\ &=& 0 + \theta\\ &=& \theta\\ \end{array} $$

The second moment is:

$$ \begin{array}{lcl} E(X^2) &=& \displaystyle\sum_{x=0}^{1}x^2f(x;\theta) \\ &=& (0^2 \times (1-\theta)) + (1^2 \times \theta)\\ &=& 0 + \theta\\ &=& \theta \end{array} $$

The variance is:

$$ \begin{array}{lcl} V(X) &=& E(X^2) - \left(E(X)\right)^2\\ &=& \theta - \theta^2\\ &=& \theta(1-\theta) \end{array} $$

We can see that $E(X)$ and $V(X)$ will vary with the parameter $\theta$. This is why we subscript $E$ and $V$ with $\theta$, to emphasise that the values depend on the parameter.

$$E_{\theta}(X) = \theta$$$$V_{\theta}(X) = \theta(1-\theta)$$

We can use Sage to do a simple plot to see how $E_{\theta}(X)$ and $V_{\theta}(X)$ vary with $\theta$.

Thus,

Feel free and brave to take the code apart to figure it out piece by piece if something is not clear... this is the whole point of these .ipynb notebooks!

Note how the variance is maximized at $\theta=\frac{1}{2}$.

The population mean and variance of the Uniform$(0,1)$ RV

Now let's look at the the population mean and variance of a continuous RV $X \thicksim$ Uniform$(0,1)$.

Because $X$ is a continuous RV, our expectations use integrals.

$$ \begin{array}{lcl} E(X) &=&\int_{x=0}^1 x f(x)\, dx\\ &=& \int_{x=0}^1 x \ 1 \, dx\\ &=& \frac{1}{2} \left( x^2 \right]_{x=0}^{x=1}\\ &=& \frac{1}{2} \left( 1-0 \right)\\ &=& \frac{1}{2} \end{array} $$$$ \begin{array}{lcl} E(X^2) &=& \int_{x=0}^1 x^2 f(x)\, dx \\ &=& \int_{x=0}^1 x^2 \ 1 \, dx\\ &=& \frac{1}{3} \left[ x^3 \right]_{x=0}^{x=1}\\ &=& \frac{1}{3} \left( 1-0 \right)\\ &=& \frac{1}{3}\\ \end{array} $$$$ \begin{array}{lcl} V(X) &=& E(X^2) - \left(E(X)\right)^2\\ &=&\frac{1}{3} - \left( \frac{1}{2} \right)^2\\ &=& \frac{1}{3} - \frac{1}{4}\\ &=& \frac{1}{12} \end{array} $$

Winnings on Average (Law of the unconscious statistician)

Think about playing a game where we draw $x \thicksim X$ and I pay you $r(x)$ ($r$ is some reward function, a function of $x$ that says what your reward is when $x$ is drawn). Then, your average winnings from the game is the sum (or integral), over all the possible values of $x$, of $r(x) \times$ the chance that $X=x$.

Put formally, if $Y= r(X)$, then

$$E(Y) = E(r(X)) = \int r(x) \,dF(x)$$

Probability is an Expectation

Remember when we first talked about the probability of some event $A$, we talked about the idea of the probability of $A$ as the long term relative frequency of $A$?

Now consider some event $A$ and a reward function $r(x) = \mathbf{1}_A(x)$.

Recall that $\mathbf{1}_A(x) = 1$ if $x \in A$ and $\mathbf{1}_A(x) = 0$ if $x \notin A$: the reward is 1 if $x \in A$ and 0 otherwise.

$$ \begin{array}{lcl} \text{If } X \text{ is continuous } E(\mathbf{1}_A(X)) &=& \int \mathbf{1}_A(x)\, dF(x)\\ &=& \int_A f(x)\, dx\\ &=& P(X \in A) = P(A)\\ \text{If } X \text{ is discrete } E(\mathbf{1}_A(X)) &=& \mathbf{1}_A(x)\, f(x)\\ &=& \sum_{x \in A} f(x)\\ &=& P(X \in A) = P(A) \\ \end{array} $$

This says that probability is a special case of expectation: the probability of $A$ is the expectation that $A$ will occur.

Take a Uniform$(0,1)$ RV $X$. What would you say the probability that an observation of this random variable is $\le 0.5$ is, ie what is $P(X \le 0.5)$?

Let's use SageMath to simulate some observations for us and look at the relative frequency of observations $\le 0.5$:

Or, we could look at a similar simulation for a discrete RV, say a Bernoulli$(\frac{1}{2})$ RV.

Another way of thinking about the Bernoulli$(\frac{1}{2})$ RV is that it has a discrete uniform distribution over $\{0,1\}$. It can take on a finite number of values (0 and 1 only) and the probabilities of observing either of these two values are are equal.

This could be modelling the event that we get a head when we throw a fair coin. For this we'll use the randint(0,1) function to simulate the observed value of our RV $X$.

The $de~Moivre(\frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k})$ RV

We have seen that a $Bernoulli(\theta)$ RV has two outcomes (0 and 1). What if we are interested in modelling situations where there are more than two outcomes of interest? For example, we could use a $Bernoulli(\frac{1}{2})$ RV to model whether the outcome of the flip of a fair coin is a head, but we can't use it for modelling a RV which is the number we get when we toss a six-sided die.

So, now, we will consider a natural generalization of the $Bernoulli(\theta)$ RV with more than two outcomes. This is called the $de~Moivre(\frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k})$ RV (after Abraham de Moivre, 1667-1754), one of the first great analytical probabalists).

A $de~Moivre(\frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k})$ RV $X$ has a discrete uniform distribution over $\{1, 2, ..., k\}$: there are $k$ possible equally probable or equiprobable values that the RV can take.

If we are rolling a die that is a cube with six faces and $X$ is the number on the face of the die that touches the floor upon coming to a stop, then $k=6$.

Or think of the New Zealand Lotto game. There are 40 balls in the machine, numbered $1, 2, \ldots, 40$. The number on the first ball out of the machine can be modelled as a de Moivre$(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$ RV.

We say that an RV $X$ is de Moivre$(\frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k})$ distributed if its probability mass function PMF is:

$$ f \left(x; \left( \frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k} \right) \right) = \begin{cases} 0 & \quad \text{if } x \notin \{1,2,\ldots,k\}\\ \frac{1}{k} & \quad \text{if } x \in \{1,2,\ldots,k\} \end{cases} $$

We can find the expectation:

$$ \begin{array}{lcl} E(X) & = & \sum_{x=1}^k xP(X=x)\\ &=& (1 \times \frac{1}{k}) + (2 \times \frac{1}{k}) + \ldots + (k \times \frac{1}{k})\\ &=& (1 + 2 + \dots + k)\frac{1}{k}\\ &=& \frac{k(k+1)}{2}\frac{1}{k}\\ &=& \frac{k+1}{2} \, , \end{array} $$

the second moment:

$$ \begin{array}{lcl} E(X^2) & =& \sum_{x=1}^k x^2P(X=x)\\ & =& (1^2 \times \frac{1}{k}) + (2^2 \times \frac{1}{k}) + \ldots + (k^2 \times \frac{1}{k})\\ &=& (1^2 + 2^2 + \dots + k^2)\frac{1}{k}\\ &=& \frac{k(k+1)(2k+1)}{6}\frac{1}{k}\\ &=& \frac{2k^2+3k+1}{6}\, , \end{array} $$

and finally the variance:

$$ \begin{array}{lcl} V(X) &=& E(X^2) - \left(E(X)\right)^2\\ &=& \frac{2k^2+3k+1}{6} - \left( \frac{k+1}{2} \right)^2\\ &=&\frac{2k^2+3k+1}{6} - \frac{k^2 + 2k +1}{4}\\ &=& \frac{4(2k^2 + 3k + 1) - 6(k^2 + 2k + 1) }{24}\\ &=& \frac{8k^2 + 12k + 4 - 6k^2 - 12k - 6 }{24}\\ &=& \frac{2k^2-2}{24} \\ &=& \frac{k^2-1}{12} \, . \end{array} $$

For a physical analog for a $k$-sided die consider the generalisation

We coud use the Sage randint function to simulate the number on the first ball in a Lotto draw:

Statistics

The official NZ Government site for statistics about New Zealand is http://www.stats.govt.nz/ and that for Sweden is http://www.scb.se/en/About-us/official-statistics-of-sweden/

Take a tour through it!

Let's see what is in our data/ directory in as/jp folder for the course.

Should we play a bit with data/ live?

What exactly do we mean by data and statistics?

Data and statistics

In general, given some probability triple $(\Omega, \mathcal{F}, P)$, let the function $X$ measure the outcome $\omega$ from the sample space $\Omega$.

$$X(\omega): \Omega \rightarrow \mathbb{X}$$

The RV $X$ is called data, if $X$ is the finest available measurement of $\Omega$.

$\mathbb{X}$ is called the data space (sample space of the data $X$).

$X(\omega)=x$ is the outcome $\omega$ measured by $X$ and is called the observed data or the realisation of $X$.

Often the measurements made by $X$ are real numbers or structures of real numbers or structures that are a mixture of combinatorial objects and real numbers (they typically arise from more general data objects/structures, like addresses, images, videos, trajectories, networks, etc.).

Recall the definition of a real-valued or $\mathbb{R}$-valued random variable we have already seen (an $\mathbb{R}$-valued random variable $X$ is a specific function or map from the sample space to the real line $\mathbb{R}$). We have also seen that $X$ can in fact be a real-vector-valued or $\mathbb{R}^n$-valued random variable $X=(X_1,X_2,\ldots,X_n)$, i.e. a vector of $\mathbb{R}$-valued random variables or simply a random vector. A random variable can be much more general as we will see in the sequel - it can be image-valued, movie-valued, mp3-song-valued, graph-valued, etc.

When we talked about an experiment involving two IID Bernoulli$(\frac{1}{2})$ RVs above (tossing two coins), we listed the different results we might get as (0,0), (1,0), (0,1), (1,1).

Say we observe the outcome $\omega$ = (H, H) (two heads). Our $Bernoulli$ random vector $X$ measures this as (1,1). Thus, (1,1) is the observed data or the realisation of $X$.

So what is a statistic?

A statistic is any measureable function of the data: $T(x): \mathbb{X} \rightarrow \mathbb{T}$ or simply that $\mathbb{X} \overset{T}{\rightarrow} \mathbb{T}$.

Thus, a statistic $T$ is also an RV that takes values in the space $\mathbb{T}$.

When $x \in \mathbb{X}$ is the observed data, $T(x)=t$ is the observed statistic of the observed data $x$, i.e., $x \mapsto t$.

Question: Can Statistic be the Data?



Consider the identity map... with $T(x)=x$ and $\mathbb{X} = \mathbb{T}$.

Remember $\Omega$ is always under the hood!
$$ \Omega \overset{X}{\rightarrow} \mathbb{X} \overset{T}{\rightarrow} \mathbb{T}, \quad \text{with} \quad \omega \mapsto x \mapsto t. $$

Generally, one cannot explicity work with $\Omega$ (recall $\Omega$ for the indicator random variable of rain on Southern Alps tomorrow!) but can work with the following axiomatic cascade towards more concretely measurable events:

Example 2: New Zealand Lotto Data and a Probability Model of it as IID de Moivre RVs

New Zealand lotto data

Let's look at our pre-made New Zealand lotto data. This is the winning first ball drawn from a NZ Lotto machine for several years.

Data Space

The data space is every possible sequence of ball numbers that we could have got in these 1114 draws. $\mathbb{X} = \{1, 2, \ldots, 40\}^{1114}$. There are $40^{1114}$ possible sequences and

Probability model

We can think of this list listBallOne of sample size $n=1114$ as $x = (x_1,x_2,\ldots,x_{1114})$, the realisation of a random vector $X = (X_1, X_2,\ldots, X_{1114})$ where $X_1, X_2,\ldots, X_{1114} \overset{IID}{\thicksim} de~Moivre(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$.

$P \left( (X_1, X_2, \ldots, X_{1114}) = (x_1, x_2, \ldots, x_{1114}) \right) = \frac{1}{40} \times \frac{1}{40} \times \ldots \times \frac{1}{40} = \left(\frac{1}{40}\right)^{1114}$ if $(x_1, x_2, \ldots, x_{1114}) \in \mathbb{X} = \{1, 2, \ldots, 40\}^{1114}$

Our data is just one of the $40^{1114}$ possible points in this data space.

Some statistics

Sample mean

From a given sequence of iid RVs $X_1, X_2, \ldots, X_n$, or a random vector $X = (X_1, X_2, \ldots, X_n)$, we can obtain another RV called the sample mean (technically, the $n$-sample mean):

$$T_n((X_1, X_2, \ldots, X_n)) = \bar{X_n}((X_1, X_2, \ldots, X_n)) :=\frac{1}{n}\displaystyle\sum_{i=1}^nX_i$$

We write $\bar{X_n}((X_1, X_2, \ldots, X_n))$ as $\bar{X}_n$,

and its realisation $\bar{X_n}((x_1, x_2, \ldots, x_n))$ as $\bar{x_n}$.

By the properties of expectations that we have seen before,

$$E(\bar{X_n}) = E \left(\frac{1}{n}\sum_{i=1}^nX_i \right) = \frac{1}{n}E\left(\sum_{i=1}^nX_i\right) = \frac{1}{n}\sum_{i=1}^nE(X_i)$$

And because every $X_i$ is identically distributed with the same expectation, say $E(X_1)$, then

$$E(\bar{X_n}) = \frac{1}{n}\sum_{i=1}^nE(X_i)= \frac{1}{n} \times n \times E(X_1) = E(X_1)$$

Similarly, we can show that,

$$V(\bar{X_n}) = V\left(\frac{1}{n}\sum_{i=1}^nX_i\right) = \frac{1}{n^2}V\left(\sum_{i=1}^nX_i\right)$$

And because every $X_i$ is independently and identically distributed with the same variance, say $V(X_1)$, then

$$V(\bar{X_n}) = \frac{1}{n^2}V\left(\sum_{i=1}^nX_i \right) = \frac{1}{n^2} \times n \times V(X_1) = \frac{1}{n} V(X_1)$$

Sample variance

Sample variance is given by $$ \frac{1}{n}\sum_{i=1}^n \left( X_i - \bar{X}_n \right)^2$$ Sometimes, we divide by $n-1$ instead of $n$. It is a measure of spread from the sample.

Similarly, we can look at a sample standard deviation = $\sqrt{\text{sample variance}}$

numpy for numerical Python

Python has a nice module called numpy for numerical computing akin to MATLAB. We will import numpy to use some of its basic statistical capabilities for our Lotto data. To be able to use these capabilities, it is easiest to convert our Lotto Data into a type called a numpy.array or np.array if we import numpy as np. (You will be looking more closely at numpy arrays in the You Try sections below).

Now we can get some sample statistics for the lotto ball one data.

Sample mean of NZ Lotto data

How does one get a deeper understanding of numpy and its methods?

We are relying on this powerful organization's libraries when computing in Python:

Sample vairance and sample standard deviation of NZ Lotto data

Order statistics

The $k$th order statistic of a sample is the $k$th smallest value in the sample. Order statistics that may be of particular interest include the smallest (minimum) and largest (maximum) values in the sample.

Frequencies

The next most interesting statistics we will visit is the frequencies of the ball numbers.

With lots of discrete data like this, we may be interested in the frequency of each value, or the number of times we get a particular value. This can be formalized as the following process of adding indicator functions of the data points $(X_1,X_2,\ldots,X_n)$: $$ \sum_{i=1}^n \mathbf{1}_{\{X_i\}}(x) $$

We can use the SageMath/Python dictionary to give us a mapping from ball numbers in the list to the count of the number of times each ball number comes up.

A Function to Count Elements in a Sequence

Although we are doing this with the Lotto data, mapping a list like this seems like a generally useful thing to be able to do, and so we write it as a function makeFreqDict(myDataSeq) and then use the function on our Lotto data. This is a good example of reusing useful code through a function. We will rely on this function in the sequel for other input data.

Also Python's integer types are supposed to get as big (up to available address space in memory) as one needs, as discussed in detail here:

In Python 3: sys.maxsize is the largest positive integer supported by the platform’s Py_ssize_t type, and therefore it is the maximum size for lists, strings, dicts, and many other containers.

We will still pass in the additional argument for one into our makeFreqDict function for pedagogical reasons and to avoid the more heavy-duty SageMath's type of 'sage.rings.integer.Integer'.

Note that the function makeFreqDict also works on other sequence types like numpy.arrays, and str or sequence of bytes.

Let us use the makeFreqDict to find the frequencies of every character in Jane Austen's Pride and Prejudice next.

Let us use the makeFreqDict to find the frequencies of the balls that popped out first in the 1114 NZ Lotto draws.

Thus, balls labelled by the number 1 come up 29 times in the first ball drawn (Ball One) out of 1,114 Lotto trials. Similarly, the number 2 comes up 28 times, ... 40 comes up 37 times. Of course, these numbers would be different if you have downloaded a more recent data file with additional trials!

So what? Well, we'd hope that the lotto is fair, i.e. that the probability of each ball coming up with any of the available numbers is the same for each number: the probability that Ball One is a 1 is the same as the probability that it is 2, is the same as the probability that it is 3,..... , is the same as the probability that it is 40. If the lotto is fair, the number that comes up on each ball should be a discrete uniform random variable. More formally, we would expect it to be the de Moivre$(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$ RV as lectured. Over the long term, we'd expect the number of times each number comes up on a given trial to be about the same as the number of times any other number comes up on that trial.

We have data from 1987 to 2008, and a first step to assessing the fairness of the lotto (for Ball One, anyway) could be to just visualise the data. We can use the list of points we created above and the SageMath plotting function points to plot a simple graphic like this.

Here we are plotting the frequency with which each number comes up against the numbers themselves, but it is a bit hard to see what number on the ball each red point relates to. To deal with this we add dotted lines going up from the number on the horizontal axis to the corresponding (number, frequency) tuple plotted as a red point.

For these plots let us use SageMath Graphics primitives: point, points and line, and their + operations for superimpositions.

Let us plot the probability mass function (PMF) of the hypothesized probability model for a fair NZ Lotto, i.e., the PMF of the RV $X$ ~ de Moivre$(1/40,1/40, ... , 1/40)$ and overlay it on the empirical mass function of the observed data. We will meet list comprehension properly in a future worksheet.

Empirical Distribution Function

Another extremely important statistics of the observed data is called the empirical distribution function (EDF). The EDF is the empirical or data-based distribution function (DF) just like the empirical mass function (EMF) is the empirical or data-based probability mass function (PMF). This can be formalized as the following process of adding indicator functions of the half-lines beginning at the data points $[X_1,+\infty),[X_2,+\infty),\ldots,[X_n,+\infty)$: $$\widehat{F}_n (x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x) $$

Next, let's make a function called makeEDF to return the empirical distribution function from a list of data points.

Pride and Prejidice he-she counter - Revisited

Recall our little snippets of code that extracted the number of he and she words in each Chapter of the entire text of Pride and Prejudice.

Now you should know all the different SageMath/Python components that went into this job. Let's see those snippets again below.

We have also used regular expressions or regexs here by leveraging the re library when we did import re. Besides regexes everything else should make sense to you now, as we have covered loops and conditional list and set comprehensions as well as the basics of lists, sets, strings and unicodes already. We briefly look at what regexes are below.

Pride and Prejidice big_Words_Set counter - Revisited

You should be able to understand the various components of the next code snippet that finds all the big words in the text of Pride and Prejudice using the Python set.

Tapping into libraries

Instead of rolling our own step functions using the primitive graphics objects of points and line we can just tap into the powerful visualisation library called matplotlib.

Live You&ITry now!

Turn the heList into a cumulative plot as above, i.e., make a plot of the cummulative frequency of the occurrence of he words by Chapter.

Start editing below. You will need heList of course. Try to use np.array and its .cumsum() method instead of doing it from more elementary list comprehensions.

Let's do this live now...

Real-world Data Ingestion

Let us download the New York Power Ball winning lottery numbers since 2010 from this US government site:

using this url as a comma-separated-variable or csv file:

This is a live fetch!

So we have to figure it out on the fly as new records are added to the winning New York power ball data.

There are two methods to download the file. Try the first one with urllib2 and if this does not work try the second method with curl where you are downloading the file to your data/ folder.

First method to downlod NYpowerball data with urllib2
live and directly from https://data.ny.gov/ using https = secure hyper-text transfer protocol

If there are errors due to ssl certificates when evaluating the next cell then try the second method below using http = hyper-text transfer protocol from the backed-up data at lamastex.org

Second method to downlod NYpowerball data with curl from http://lamastex.org

You need to evaluate the next two cells for this method. Only do this if you have errors from the first method.

That was not too hard. But what we really want is the first winning number across all the draws.

We can simply apply the expression in the last cell and loop it through each item say Line in NYpowerballLines:

So let's do it.

Opps, we got a ValueError: invalid literal for int() with base 10: 'Winning'.

We need to handle such error carefully, as data downloaded from public sources will generally come woth such caveats and more complicated issues in general.

A proper way to handle such errors safely is called 'Exception Handling' using try and catch. Briefly from the docs:

The try statement works as follows.

  • First, the try clause (the statement(s) between the try and except keywords) is executed.
  • If no exception occurs, the except clause is skipped and execution of the try statement is finished.
  • If an exception occurs during execution of the try clause, the rest of the clause is skipped. Then if its type matches the exception named after the except keyword, the except clause is executed, and then execution continues after the try statement.
  • If an exception occurs which does not match the exception named in the except clause, it is passed on to outer try statements; if no handler is found, it is an unhandled exception and execution stops with a message as shown above.

Now we have a new error IndexError: list index out of range. But let's see how many numbers have been appended in our list firstWins.

So, we need to handle this new exception.

From the docs we note that:

A try statement may have more than one except clause, to specify handlers for different exceptions. At most one handler will be executed....

Let's have an except clause for IndexError next.

Congratulations! You have taken your first step towards being a data scientist who knows to obtain data available in the internet from first principles!

Data scientists need to be strong in data wrangling and are expected to directly work with data and not merely load data from pre-made libraries.

More crucially, mathematical statistical data scientists are expected to be comfortable with axiomatic probability and mathematical statistical theory (we will learn more in the sequel).

Finally, computational mathematical statistical data scientists, students being trained here, can apply (and even develop as needed, in cooperation with domain experts) the right model, methods and theory for a given data science problem and solve it using the most appropriate computational tools.

Arrays

YouTrys!

We have already talked about lists in SageMath/Python. Lists are great but are also general: lists are designed to be able to cope with any sort of data but that also means they don't have some of the specific functionality we might like to be able to analyse numerical data sets. Arrays provide a way of representing and manipulating numerical data known an a matrix in maths lingo. Matrices are particularly useful for data science.

The lists we have been dealing with were one-dimensional. An array and the matrix it represents in a computer can be multidimensional. The most common forms of array that we will meet are one-dimensional and two-dimensional. You can imagine a two-dimensional array as a grid, which has rows and columns. Take for instance a 3-by-6 array (i.e., 3 rows and 6 columns).

0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17

The array type is not available in the basic SageMath package, so we import a library package called numpy which includes arrays.

Because the arrays are part of numpy, not the basic SageMath package, we don't just say array, but instead qualify array with the name of the module it is in. So, we say np.array since we did import numpy as np.

The cell below makes a two-dimensional 3-by-6 (3 rows by 6 columns) array by specifying the rows, and each element in each row, individually.

We can use the array's shape method to find out its shape. The shape method for a two-dimensional array returns an ordered pair in the form (rows,columns). In this case it tells us that we have 3 rows and 6 columns.

Another way to make an array is to start with a one-dimensional array and resize it to give the shape we want, as we do in the two cells below.

We start by making a one-dimensional array using the range function we have met earlier.

The array's resize allows us to specify a new shape for the same array. We are going to make a two-dimensional array with 3 rows and 6 columns. Note that the shape you specify must be compatible with the number of elements in the array.

In this case, array2 has 18 elements and we specify a new shape of 3 rows and 6 columns, which still makes 18 = 3*6 elements.

If the new shape you specify won't work with the array, you'll get an error message.

If you check the type for array2, you might be surprised to find that it is a numpy.ndarray. What happened to numpy.array? The array we are using is the NumPy N-dimensional array, or numpy.ndarray.

Tensors and general multidimensional arrays are possible via resize method.

The range function will only give us ranges of integers. The numpy module includes a more flexible function arange (that's arange with one r -- think of it as something like a-for-array-range -- not 'arrange' with two r's). The arange function takes parameters specifying the start_argument, stop_argument and step_argument values (similar to range), but is not restricted to integers and returns a one-dimensional array rather than a list.

Here we use arange to make an array of numbers from 0.0 (start_argument) going up in steps of 0.1 (step_argument) to 0.18 - 0.1 and just as with range, the last number we get in the array is stop_argument - step_argument = 0.18 - 0.01 = 0.17.

Resize and Reshape

The resize method resizes the array it is applied to.

The reshape method will leave the original array unchanged but return a new array, based on the original one, of the required new shape.

So the reshape does leave the original array unchanged.

Arrays: indexing, slicing and copying

Remember indexing into lists to find the elements at particular positions? We can do this with arrays, but we need to specify the index we want for each dimension of the array. For our two-dimensional arrays, we need to specify the row and column index, i.e. use a format like [row_index,column_index]. For exampe, [0,0] gives us the element in the first column of the first row (as with lists, array indices start from 0 for the first element).

array4[0,0] gives the element in the first row and first column while array4[5,2] gives us the element in the third column (index 2) of the sixth row (index 5).

We can use the colon to specify a range of columns or rows. For example, [0:4,0] gives us the elements in the first column (index 0) of rows with indices from 0 through 3. Note that the row index range 0:4 gives rows with indices starting from index 0 and ending at index 3=4-1, i.e., indices 0 through 3.

Similarly we could get all the elements in the second column (column index 1) of rows with indices 2 through 4.

The colon on its own gives everything, so a column index of : gives all the columns. Thus we can get, for example, all elements of a particular row -- in this case, the third row.

Or all the elements of a specified column, in this case the first column.

Or all the elements of a range of rows (think of this as like slicing the array horizontally to obtain row indices 1 through 4=5-1).

Or all the elements of a range of columns (think of this as slicing the array vertically to obtain column indices 1 through 2).

Naturally, we can slice both horizontally and vertically to obtain row indices 2 through 4 and column indices 0 through 1, as follows:

Finally, [:] gives a copy of the whole array. This copy of the original array can be assigned to a new name for furter manipulation without affecting the original array.

Useful arrays

NumPy provides quick ways of making some useful kinds of arrays. Some of these are shown in the cells below.

An array of zeros or ones of a particular shape. Here we ask for shape (2,3), i.e., 2 rows and 3 columns.

An array for the identity matrix, i.e., square (same number of elements on each dimension) matrix with 1's along the diagonal and 0's everywhere else.

Useful functions for sequences

Now we are going to demonstrate some useful methods of sequences. A list is a sequence, and so is a tuple.

There are differences between them (recall that tuples are immutable) but in many cases we can use the same methods on them because they are both sequences.

We will demonstrate with a list and a tuple containing the same data values. The data could be the results of three IID Bernoulli trials..

A useful operation we can perform on a tuple is to count the number of times a particular element, say 0 or 1 in our ObsDataTuple, occurs. This is a statistic of our data called the sample frequency of the element of interest.

Another useful operation we can perform on a tuple is to sum all elements in our obsDataTuple or further scale the sum by the sample size. These are also statistics of our data called the sample sum and the sample mean, respectively. Try the same things with obsDataList.

We used a lot of the techniques we have seen so far when we wanted to get the relative frequency associated with each ball in the lotto data.

Let's revist the code that made relative frequencies and fully understand each part of it now.

We could also do it with a list comprehension as follows:

More on tuples and sequences in general

First, we can make an empty tuple like this:

Secondly, if we want a tuple with only one element we have to use a trailing comma. If you think about it from the computer's point of view, if it can't see the trailing comma, how is it to tell that you want a tuple not just a single number?

You can make tuples out almost any object. It's as easy as putting the things you want in ( ) parentheses and separating them by commas.

Actually, you don't even need the ( ). If you present SageMath (Python) with a sequence of object separated by commas, the default result is a tuple.

A statement like the one in the cell above is known as tuple packing (more generally, sequence packing).

When we work with tuples we also often use the opposite of packing - Python's very useful sequence unpacking capability. This is, literally, taking a sequence and unpacking or extracting its elements.

The statement below is an example of multiple assignment, which you can see now is really just a combination of tuple packing followed by unpacking.

Let's try to implement a for loop with a tuple -- it will work just like the for loop with a list.

Next let's try a list comprehension on the tuple. This creates a list as expected.

But if we try the same comprehension with tuples, i.e., with () insteqad of [] we get a generator object. This will be covered next by first easing into functional programming in Python.

Functional Programming in Python

Let us take at the basics of functional programming and what Python has to offer on this front at:

Iterators

From https://docs.python.org/2/howto/functional.html#iterators:

An iterator is an object representing a stream of data; this object returns the data one element at a time. A Python iterator must support a method called next() that takes no arguments and always returns the next element of the stream. If there are no more elements in the stream, next() must raise the StopIteration exception. Iterators don’t have to be finite, though; it’s perfectly reasonable to write an iterator that produces an infinite stream of data.

The built-in iter() function takes an arbitrary object and tries to return an iterator that will return the object’s contents or elements, raising TypeError if the object doesn’t support iteration. Several of Python’s built-in data types support iteration, the most common being lists and dictionaries. An object is called an iterable object if you can get an iterator for it.

You can experiment with the iteration interface manually:

Iterators can be materialized as lists or tuples by using the list() or tuple() constructor functions:

Generators

"Naive Tuple Comprehension" creates a generator in SageMath/Python. From https://docs.python.org/2/glossary.html#term-generator:

generator A function which returns an iterator. It looks like a normal function except that it contains yield statements for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator resumes, it picks-up where it left-off (in contrast to functions which start fresh on every invocation).

generator expression An expression that returns an iterator. It looks like a normal expression followed by a for expression defining a loop variable, range, and an optional if expression. The combined expression generates values for an enclosing function:

List comprehensions (listcomps) and Generator Expressions (genexps) from

With a list comprehension, you get back a Python list; stripped_list is a list containing the resulting lines, not an iterator. Generator expressions return an iterator that computes the values as necessary, not needing to materialize all the values at once. This means that list comprehensions aren’t useful if you’re working with iterators that return an infinite stream or a very large amount of data. Generator expressions are preferable in these situations. Generator expressions are surrounded by parentheses (“()”) and list comprehensions are surrounded by square brackets (“[]”). Generator expressions have the form:

( expression for expr in sequence1
             if condition1
             for expr2 in sequence2
             if condition2
             for expr3 in sequence3 ...
             if condition3
             for exprN in sequenceN
             if conditionN )

Two common operations on an iterator’s output are 1) performing some operation for every element, 2) selecting a subset of elements that meet some condition. For example, given a list of strings, you might want to strip off trailing whitespace from each line or extract all the strings containing a given substring.

List comprehensions and generator expressions (short form: “listcomps” and “genexps”) are a concise notation for such operations, borrowed from the functional programming language Haskell (https://www.haskell.org/). You can strip all the whitespace from a stream of strings with the following code:

Once again we have emptied the stripped_iter so we only get an empty list when we materilize it into a list.

We can pass a generator to a function like sum to evaluate it:

See https://docs.python.org/2/howto/functional.html#generators for more details:

Generators are a special class of functions that simplify the task of writing iterators. Regular functions compute a value and return it, but generators return an iterator that returns a stream of values.

You’re doubtless familiar with how regular function calls work in Python or C. When you call a function, it gets a private namespace where its local variables are created. When the function reaches a return statement, the local variables are destroyed and the value is returned to the caller. A later call to the same function creates a new private namespace and a fresh set of local variables. But, what if the local variables weren’t thrown away on exiting a function? What if you could later resume the function where it left off? This is what generators provide; they can be thought of as resumable functions.

Here’s the simplest example of a generator function:

Any function containing a yield keyword is a generator function; this is detected by Python’s bytecode compiler which compiles the function specially as a result.

When you call a generator function, it doesn’t return a single value; instead it returns a generator object that supports the iterator protocol. On executing the yield expression, the generator outputs the value of i, similar to a return statement. The big difference between yield and a return statement is that on reaching a yield the generator’s state of execution is suspended and local variables are preserved. On the next call to the generator’s .next() method, the function will resume executing.

Here’s a sample usage of the generate_ints() generator:

Using list comprehensions and generator expressions one can do functional programming in Python. But this is not the same as doing pure functional programming in a language such as haskell (see https://www.haskell.org/) that is designed specifically for pure funcitonal programming.

Keep learning

We have taken several exercises above from http://docs.python.org/tutorial/datastructures.html.

This is a very useful site to look for user-friendly information about Python in general (it can't help you with anything that Sage overlays on top of Python though, or with many of the specialised modules that you may want to import into your Sage notebook).