Introduction to Data Science: A Comp-Math-Stat Approach

1MS041, 2021

©2021 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

13. Markov Chains and Random Structures

Markov chain, named after Andrey Markov, is a mathematical model for a possibly dependent sequence of random variables.

Intuitively, a Markov Chain is a system which "jumps" among several states, with the next state depending (probabilistically) only on the current state. A useful heuristic is that of a frog jumping among several lily-pads, where the frog's memory is short enough that it doesn't remember what lily-pad it was last on, and so its next jump can only be influenced by where it is now.

Formally, the Markov property states that the conditional probability distribution for the system at the next step (and in fact at all future steps) given its current state depends only on the current state of the system, and not additionally on the state of the system at previous steps:

$$P(X_{n+1} \ | \, X_1,X_2,\dots,X_n) = P(X_{n+1}|X_n). \,$$

Since the system changes randomly, it is generally impossible to predict the exact state of the system in the future. However, the statistical and probailistic properties of the system's future can be predicted. In many applications it is these statistical properties that are important.

Formal definition and terms

A Markov chain is a sequence of random variables $X_1, X_2, X_3, \ldots$ with the Markov property, namely that, given the present state, the future and past states are independent. Formally,

$$P(X_{n+1}=x \ | \ X_1=x_1, X_2=x_2 \ldots, X_n=x_n) = P(X_{n+1}=x|X_n=x_n).\,$$

The possible values of $X_i$ or the set of all states of the system form a countable set $\mathbb{X}$ called the state space of the chain.

The changes of state of the system are called transitions, and the probabilities associated with various state-changes are called transition probabilities.

Markov chains are often depicted by a weighted directed graph, where the edges are labeled by the probabilities of going from one state to the other states. This is called the flow diagram or transition probability diagram. The transition probability matrix $\mathbf{P}$ encodes the probabilities associated with state-changes or "jumps" from one state to another in the state-space $\mathbb{X}$. If $\mathbb{X}$ is labelled by $\{0,1,2,\ldots\}$ then the $i,j$-th entry in the matrix $\mathbf{P}$ corresponds to the probability of going from state $i$ to state $j$ in one time-step.

$$\mathbf{P} = \begin{bmatrix} p_{0,0} & p_{0,1} & p_{0,2} & \ldots \\ p_{1,0} & p_{1,1} & p_{1,2} & \ldots \\ p_{2,0} & p_{2,1} & p_{2,2} & \ldots \\ \vdots & \vdots & \vdots & \ddots \end{bmatrix}$$

The state of the system at the $n$-th time step is described by a state probability vector $$\mathbf{p}^{(n)} = \left( \mathbf{p}^{(n)}_0, \mathbf{p}^{(n)}_1, \mathbf{p}^{(n)}_2,\ldots \right)$$ Thus, $\mathbf{p}^{(n)}_i$ is the probability you will find the Markov chain at state $i \in \mathbb{X}$ at time-step $n$ and $\mathbf{p}^{(0)}_i$ is called the initial probability vector at the convenient initial time $0$.

The state space $\mathbb{X}$ and transition probability matrix $\mathbf{P}$ completely characterizes a Markov chain.

Example 1: Dry-Wet chain, a toy weather model

A SageMath break-down of the Wiki Example

We can coarsely describe the weather of a given day by a toy model that states if it is "wet" or "dry". Each day the weather in our toy model is an element of our state space

$$\mathbb{X} = \{\text{"wet"}, \text{"dry"}\}.$$

We can make a picture of our toy probability model with a flow diagram or transition probability diagram as follows:

The probabilities of weather conditions, given the weather on the preceding day, can be represented by a transition probability matrix:

$$\mathbf{P} = \begin{bmatrix} 0.9 & 0.1\\ 0.5 & 0.5 \end{bmatrix}$$

The matrix $\mathbf{P}$ represents our toy weather model in which a dry day is 90% likely to be followed by another dry day, and a wet or rainy day is 50% likely to be followed by another wet day. The columns can be labelled "dry" and "wet" respectively, and the rows can be labelled in the same manner. For convenience, we will use integer labels $0$ and $1$ for "dry" and "wet", respectively.

$(\mathbf{P})_{i j}=p_{i,j}$ is the probability that, if a given day is of type $i$, it will be followed by a day of type $j$.

Since the transition probability matrix $\mathbf{P}$ is a stochastic matrix:

The rows of $\mathbf{P}$ sum to $1$. The probabiltites in each row can be thought of as "current-state" specific $de~Moivre(p_{i,j}'s)$ distribution Basically, you toss a current-state-specific many-faced weigted die to determine the next state How can we think of our toy weather model in terms of two Bernouli trials; $Bernoulli(0.9)$ and $Bernoulli(0.5)$? -- recall a fair coin in the wet pocket and the biased coin in the dry pocket analogy from lectures...

How do we represent a matrix $\mathbf{P}$ in Sage?

Let's construct and assign the matrix to P.

We could have used QQ instead of RR below if the probabilities are rational and if we had a need for exact rational arithmetic.

Predicting the weather with our Dry-Wet chain

The weather on day 0 is known to be dry. This is represented by a probability vector in which the "dry" entry is 100%, and the "wet" entry is 0%:

$$ \mathbf{p}^{(0)} = \begin{bmatrix} 1 & 0 \end{bmatrix} $$

The weather on day 1 can be predicted by:

$$ \mathbf{p}^{(1)} = \mathbf{p}^{(0)} \ \mathbf{P} = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.9 & 0.1 \end{bmatrix} $$

Thus, there is an 90% chance that day 1 will also be sunny.

The weather on day 2 can be predicted in the same way:

$$\mathbf{p}^{(2)} =\mathbf{p}^{(1)} \ \mathbf{P} = \mathbf{p}^{(0)} \ \mathbf{P}^2 = \begin{bmatrix} 1 & 0 \end{bmatrix} \ \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{bmatrix}^2 = \begin{bmatrix} 0.86 & 0.14 \end{bmatrix}$$

or, equivalently,

$$ \mathbf{p}^{(2)} =\mathbf{p}^{(1)} \ \mathbf{P} = \begin{bmatrix} 0.9 & 0.1 \end{bmatrix} \ \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.86 & 0.14 \end{bmatrix} $$

How do we do this in Sage?

or, equivalently,

General rules for day $n$ follow from mathematical induction as follows:

$$ \mathbf{p}^{(n)} = \mathbf{p}^{(n-1)} \ \mathbf{P} $$$$ \mathbf{p}^{(n)} = \mathbf{p}^{(0)} \ \mathbf{P}^n $$

How do we operate with a matrix in SageMath to do this for any given $n$?

Let's get the state probability vector at time n = 0,1,...,nmax

In the next interact let's visualize the $n$-step state probability vector $\mathbf{p}^{(n)}=(\mathbf{p}^{(n)}_0,\mathbf{p}^{(n)}_1)$, $n=0,1,\ldots, {\tt nmax}$ steps

See what's going on here...?

Try to increase nmax and see where the state prob vector is going geometrically.

Steady state of the weather in our Dry-Wet chain

In this example, predictions for the weather on more distant days are increasingly inaccurate and tend towards a steady state vector. This vector represents the probabilities of dry and wet weather on all days, and is independent of the initial weather.

The steady state vector is defined as:

$$\mathbf{s} = \lim_{n \to \infty} \mathbf{p}^{(n)}$$

but only converges to a strictly positive vector if $\mathbf{P}$ is a regular transition matrix (that is, there is at least one $\mathbf{P}^n$ with all non-zero entries making the Markov chain irreducible and aperiodic).

Since the $\mathbf{s}$ is independent from initial conditions, it must be unchanged when transformed by $\mathbf{P}$. This makes it an eigenvector (with eigenvalue $1$), and means it can be derived from $\mathbf{P}$. For our toy model of weather:

$$\begin{matrix} \mathbf{P} & = & \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{bmatrix} \\ \mathbf{s} \ \mathbf{P} & = & \mathbf{s} & \mbox{(} \mathbf{s} \mbox{ is unchanged by } \mathbf{P} \mbox{.)} \\ & = & \mathbf{s} \ \mathbf{I} \\ \mathbf{s} \ (\mathbf{P} - \mathbf{I}) & = & \mathbf{0} \\ & = & \mathbf{s} \left( \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{bmatrix} - \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \right) \\ & = & \mathbf{s} \begin{bmatrix} -0.1 & 0.1 \\ 0.5 & -0.5 \end{bmatrix} \end{matrix}$$

$$\begin{bmatrix} \mathbf{s}_0 & \mathbf{s}_1 \end{bmatrix} \begin{bmatrix} -0.1 & 0.1 \\ 0.5 & -0.5 \end{bmatrix} = \begin{bmatrix} 0 & 0 \end{bmatrix}$$

So $- 0.1 \mathbf{s}_0 + 0.5 \mathbf{s}_1 = 0$ and since they are a probability vector we know that $\mathbf{s}_0 + \mathbf{s}_1 = 1$.

Solving this pair of simultaneous equations gives the steady state distribution:

$$\left( \mathbf{s}_0 , \mathbf{s}_1 \right) = \left( 5/6 , 1/6 \right) = \left( 0.833 , 0.167 \right)$$

In conclusion, in the long term, 83% of days are dry.

How do we operate the above to solve for $\mathbf{s}$ in SageMath? There are two "methods". We can either use

You are not expected to follow method 2 if you have not had been introduced to eigenvalues and eigenvectors.

Method 1: Solving a system of equations to obtain $\mathbb{s}$.

(End of Method 1 to solve for the steady state vector $\mathbf{s}$.)


Method 2:

Alternatively use eigen decomposition over rationals in SageMath to solve for $\mathbb{s}$.

If your linear algebra is rusty

To follow Method 2 you need to know a bit more about eigen values, eigen vectors and eigen spaces.

Learn from Khan academy more slowly: -https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors

Also, ensure you really understand the visual interactive exploration of low-dimensional eigen values and eigen vectors here:

-https://setosa.io/ev/eigenvectors-and-eigenvalues/.

(End of Method 2.)


Grab all the days data from start to end:

Interactive cell to allow you to select some specific data and turn it into the list of 0 or 1 tuples (this list will then be available in sel_daysdata in later cells in the worksheet).

Daily Precipitation at Christchurch, fed from NIWA data (NZ's equivalent of US's NOAA)

Maximum likelihood estimation of the unknown transition probabilities

for the Dry-Wet Markov chain model of Christchurch weather

In the example we have been working with earlier, the transition probabilities were given by the matrix $$\mathbf{P}=\begin{bmatrix} 0.9 & 0.1\\0.5&0.5\end{bmatrix}$$ and we simply used the given $\mathbf{P}$ to understand the properties and utilities of the probability model for a possibly dependent sequence of $\{0,1\}$-valued random variables encoding the $\{\text{Dry},\text{Wet}\}$ days, respectively.

What we want to do now is use the data from Christchurch's Aeroclub obtained from NIWA to estimate Christchurch's unknown transition probability matrix $$\mathbf{P}= \begin{bmatrix} p_{0,0} & p_{0,1}\\ p_{1,0} & p_{1,1} \end{bmatrix}.$$ Let us use the principle of maximum likelihood and derive the maximum likelihood estimator $$\widehat{\mathbf{P}}=\begin{bmatrix} \hat{p}_{0,0} & \hat{p}_{0,1}\\ \hat{p}_{1,0} & \hat{p}_{1,1}\end{bmatrix}.$$

Recall that the likelihood function $$L(\text{unknown parameters} \ ; \ \text{Data})$$ is essentially the joint density of the data $X_0,X_1,X_2,\ldots,X_n$ as a function of the parameters. The data gives $n+1$ consecutive days of Dry or Wet recordings as $0$ or $1$ at the Christchurch's Aeroclub. What are the unknown parameters here? Well, they are the four entries $(p_{0,0}, p_{0,1}, p_{1,0}, p_{1,1})$ of the unknown $\mathbf{P}$. But, due to the fact that $\mathbf{P}$ is not any old matrix of real numbers, but rather a stochastic matrix, it is constrained so that the entries are non-negative and the entries of each row sums to $1$. For instance we can write the off-diagonal entries in terms of the diagonal entries $p_{0,1}=1-p_{0,0}$ and $p_{1,0}=1-p_{1,1}$ and merely treat two parameters $(p_{0,0},p_{1,1})$ as the truly unknowns that can take any value in the unit square $[0,1] \times [0,1]$ parameter space.

$$ \begin{aligned} L(p_{0,0},p_{1,1}) &=& L(p_{0,0},p_{1,1}; x_0,x_1,\ldots,x_n)\\ &=& P\left( X_0=x_0,X_1=x_1,\ldots,X_{n-1}=x_{n-1},X_n=x_n \right) \end{aligned} $$

In the above equation, we are given that the transition probabilities are $p_{0,0},p_{1,1}$.

Now, by definition of conditional probability and the markov property,

$$ \scriptsize{ \begin{aligned} L(p_{0,0},p_{1,1}) &=& P\left( X_n=x_n \ | \ X_{n-1}=x_{n-1},\ldots,X_2=x_2,X_1=x_1,X_0=x_0 \right) \ P\left( X_{n-1}=x_{n-1},\ldots,X_2=x_2,X_1=x_1,X_0=x_0 \right)\\ &=& P\left( X_n=x_n \ | \ X_{n-1}=x_{n-1}\right) \ P\left( X_{n-1}=x_{n-1},\ldots,X_2=x_2,X_1=x_1,X_0=x_0 \right) \\ &=& P\left( X_n=x_n \ | \ X_{n-1}=x_{n-1}\right) \ P\left( X_{n-1}=x_{n-1} \ | \ X_{n-2}=x_{n-2}, X_{n-3}=x_{n-3},\ldots,X_2=x_2,X_1=x_1,X_0=x_0 \right) \\ & & \qquad \qquad \qquad P\left(X_{n-2}=x_{n-2}, X_{n-3}=x_{n-3},\ldots,X_2=x_2,X_1=x_1,X_0=x_0 \right) \\ &=& P\left( X_n=x_n \ | \ X_{n-1}=x_{n-1}\right) \ P\left( X_{n-1}=x_{n-1} \ | \ X_{n-2}=x_{n-2} \right) \ P\left(X_{n-2}=x_{n-2}, X_{n-3}=x_{n-3},\ldots,X_2=x_2,X_1=x_1,X_0=x_0 \right) \\ &\vdots&\\ &=& P\left( X_n=x_n \ | \ X_{n-1}=x_{n-1}\right) \ P\left( X_{n-1}=x_{n-1} \ | \ X_{n-2}=x_{n-2} \right) \ \cdots \ P\left( X_{2}=x_{2} \ | \ X_{1}=x_{1} \right) \ P\left( X_{1}=x_{1} \ | \ X_{0}=x_{0} \right) P(X_{0}=x_{0})\\ &=& P(X_{0}=x_{0}) \ \prod_{i=1}^n P \left( X_i=x_i \ | \ X_{i-1}=x_{i-1} \right) \\ &=& \left( p_{0,0} \right)^{n_{0,0}} \ \left( 1-p_{0,0} \right)^{n_{0,1}} \ \left( p_{1,1} \right)^{n_{1,1}} \ \left( 1-p_{1,1} \right)^{n_{1,0}} ,\\ \end{aligned} } $$

where, $n_{i,j}$ is the number of transitions from state $i$ to state $j$ in the observed data sequence $x_0,x_1,\ldots,x_n$. Therefore the log likelihood function is $$ \begin{aligned} l(p_{0,0},p_{1,1}) &=& \log \left( L(p_{0,0},p_{1,1}) \right) \\ &=& \log \left( \left( p_{0,0} \right)^{n_{0,0}} \ \left( 1-p_{0,0} \right)^{n_{0,1}} \ \left( p_{1,1} \right)^{n_{1,1}} \ \left( 1-p_{1,1} \right)^{n_{1,0}} \right) \\ &=& {n_{0,0}} \log \left( p_{0,0} \right) + {n_{0,1}} \log \left( 1 - p_{0,0} \right) + {n_{1,1}} \log \left( p_{1,1} \right) + {n_{1,0}} \log \left( 1-p_{1,1} \right) \end{aligned} $$

Finally, we can find the maximum likelihood estimates (MLEs) $\widehat{p}_{0,0}$ and $\widehat{p}_{1,1}$ for the unknown transition probabilities $p_{0,0}$ and $p_{1,1}$ by differentiating the loglikelihood function with respect to $p_{0,0}$ and $p_{1,1}$, respectively, and solving the resulting equations in terms of the variable of differentiation. This will yield the following which me can also obtain symbolically in sage. $$ \begin{aligned} \frac{d}{d p_{0,0}} \left( l(p_{0,0},p_{1,1}) \right) &=& \frac{d}{d p_{0,0}} \left( {n_{0,0}} \log \left( p_{0,0} \right) + {n_{0,1}} \log \left( 1 - p_{0,0} \right) + {n_{1,1}} \log \left( p_{1,1} \right) + {n_{1,0}} \log \left( 1-p_{1,1} \right) \right) \\ &=& \cdots \\ &=& \frac{n_{0,0}}{p_{0,0}} - \frac{n_{0,1}}{1-p_{0,0}} \end{aligned} $$

Similarly, $$ \begin{aligned} \frac{d}{d p_{1,1}} \left( l(p_{0,0},p_{1,1}) \right) &=& \frac{d}{d p_{1,1}} \left( {n_{0,0}} \log \left( p_{0,0} \right) + {n_{0,1}} \log \left( 1 - p_{0,0} \right) + {n_{1,1}} \log \left( p_{1,1} \right) + {n_{1,0}} \log \left( 1-p_{1,1} \right) \right) \\ &=& \cdots \\ &=& \frac{n_{1,1}}{p_{1,1}} - \frac{n_{1,0}}{1-p_{1,1}} \end{aligned} $$

Finally, solving the above equations in terms of $p_{0,0}$ and $p_{1,1}$ gives the MLEs that $$\widehat{p}_{0,0} = \frac{n_{0,0}}{n_{0,0}+n_{0,1}} \quad \text{and} \quad \widehat{p}_{1,1} = \frac{n_{1,1}}{n_{1,0}+n_{1,1}} $$ as follows: $$ \begin{aligned} \frac{d}{d p_{0,0}} \left( l(p_{0,0},p_{1,1}) \right) &=& 0 \\ &\Leftrightarrow& \frac{n_{0,0}}{p_{0,0}} - \frac{n_{0,1}}{1-p_{0,0}} = 0\\ &\Leftrightarrow& p_{0,0} = \frac{n_{0,0}}{n_{0,0}+n_{0,1}} \end{aligned} $$ and $$ \begin{aligned} \frac{d}{d p_{1,1}} \left( l(p_{0,0},p_{1,1}) \right) &=& 0 \\ &\Leftrightarrow& \frac{n_{1,1}}{p_{1,1}} - \frac{n_{1,0}}{1-p_{1,1}} = 0\\ &\Leftrightarrow& p_{1,1} = \frac{n_{1,1}}{n_{1,0}+n_{1,1}} \end{aligned} $$

Let's make symbolic expressions for this log likelihood function:

Make a function to make a transition counts matrix from any list of 0/1 tuples passed in.

Get the transition counts matrix for all the data (we will get the same as we had before, but as a matrix which echoes the layout of our transition probabilities matrix, and we would also be able to use our function for other lists of tuples):

Make a function to turn transitions counts into a matrix of values for:

$$\widehat{\mathbf{P}}=\begin{bmatrix} \hat{p}_{0,0} & \hat{p}_{0,1}\\ \hat{p}_{1,0} & \hat{p}_{1,1}\end{bmatrix}.$$

Look at this for all the data:

As we said before, we can concentrate just on two unknowns $(\hat{p}_{0,0},\hat{p}_{1,1})$, so we can make a function just to return this tuple:

What is $(\hat{p}_{0,0},\hat{p}_{1,1})$ using all our data?

We can use our log-likelihood function in the form of a Sage symbolic function with symbolic variables n00, n01, n10, n11, p00, p01, and substitute in the values we have just found, using all our data, for all of these variables, to find the maximum of the log-likelihood function (i.e. the value of the function evaluated at the MLE).

Animation of the MLE and contour of log-likelihood function as the amount of data increases

Here is a plot of $(\hat{p}_{00},\hat{p}_{11})$, moving as the amount of data increases. It loops continually so if it looks like it is not moving, it is towards the end when the MLE has settled down - just wait a short while and the loop will start again:

MLE as the amount of data increases log-likelihood contour with the MLE (black dot)

You Think: What is the MLE of $\theta$ in an product $Bernoulli(\theta)$ experiment for the problem, i.e., we now model $$X_0,X_1,\ldots,X_n \overset{IID}{\thicksim} Bernoulli(\theta^*)$$

So, what is the MLE $\hat{\theta}$ of the unknown $\theta^*$ the IID $Bernoulli(\theta^*)$ model?

Questions:

Here is a nice trick to make a flow diagram fast and dirty in Sage. For our Christchurch Dry-Wet chain with MLE $\widehat{\mathbf{P}}$ we can do the following flow diagram.

YouTry

Consider the Markov chain describing the mode of transport used by a lazy professor. He has only two modes of transport, namely Walk or Drive. Label Walk by $0$ and Drive by $1$. If he walks today then he will definitely drive tomorrow. But, if he drives today then he flips a fair coin to decide whether he will Walk or Drive tomorrow. His decision to get to work is the same on each day. In the cells below try to:

Also do by hand!

<br> <br > <br> <br >

Random Walks and Random Graphs in SageMath

We get introduced to simple random walks in SageMath.

Random walks on graphs is one of the most useful applications of Markov chains.

First let us familiarize ourselves with some basic definitions in Graph Theory.

Definitions in Graph Theory

Here we take a brief tour of the most basic definitions in graph theory. A Graph $\mathbb{G} := (\mathbb{V},\mathbb{E})$ consists of a vertex set $\mathbb{V} := \{v_1,v_2,\ldots,v_k\}$ together with an edge set $\mathbb{E} := \{e_1,e_2,\ldots,e_l\}$.

Thus, $\mathbb{E} \subset \mathbb{V}^2$ and a graph $\mathbb{G}$ with directed edges in $\mathbb{E}$ is said to be an directed graph and a graph $\mathbb{G}$ with undirected edges is said to be an undirected graph.

Note that $\deg(v_i) = \# \mathrm{nbhd}(v_i)$. If there is a sequences of edges or a path from every vertex to every other vertex then the undirected graph is said to be connected.

We can represent a directed graph by its adjacency matrix given by:

$$ A := \left( A(v_i,v_j) \right)_{(v_i,v_j) \in \mathbb{V} \times \mathbb{V}}, \quad A(v_i,v_j) = \begin{cases} 1 & \text{if } \ ( v_i, v_j ) \in \mathbb{E} \\ 0 & \text{otherwise} \enspace . \end{cases} $$

Problem:

Try to make simple Markov chain models as random walks on a lattice or a graph by modifying the following scripts.

Random Walks and Random Graphs in SageMath.

We get introduced to simple random walks in SageMath.

A 3D Random Walk

This was done originally by William Stein.

Random Graphs

Graphs are a rich source of random discrete structures. Let us get a prelude into them next.

Graph Browser, by Marshall Hampton

There is so much more to explore. If you are into stochastic processes on discrete structures with continuous time aspects then check out:

If you are interested in simulating an epidemic model on a random graph model of a possibly evolving host network check out Simulating Transmission Processes on Networks. by Simon Godskesen who completed his Bachelor thesis in Mathematics Department at Uppsala University and the links to GitHub from the description of the video. The simulation uses SageMath Kernel. Feel free to checkout Simon's .ipynb notebook code: