Introduction to Data Science

1MS041, 2023

©2023 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

Probability

Lets recall some basics from the notes but in a simpler language:

An experiment is an activity or procedure that produces distinct or well-defined outcomes. The set of such outcomes is called the sample space of the experiment. The sample space is usually denoted with the symbol $\Omega$. Lets look at some examples of experiments.

Roll a dice experiment

If our experiment is to roll a dice wth faces painted $red$, $green$, $yellow$, $pink$, $blue$ and $black$ and find what colour the top face is at the end of each roll, then the sample space $\Omega = \{red, green, yellow, pink, blue, black\}$.

Flip a coin experiment

If our experiment is to flip a coin where the two faces of the coin can be identified as 'heads' ($H$) and 'tails' ($T$), and we are interested in what face lands uppermost, then the sample space is $\Omega = \{H, T\}$.

Draw a fruit from a fruit bowl

Suppose we have a well-mixed fruit bowl that contains:

  • 2 oranges
  • 3 apples
  • 1 lemon

If our experiment is to take a single fruit from the bowl and the outcome is the type of fruit we take then what is the sample space for this experiment?

Recall that the sample space is the set of all possible outcomes of the experiment. If we take a single fruit we could get only one of the three fruits in each draw: an orange, or an apple or a lemon. The sample space $\Omega = \{orange, apple, lemon\}$.

An event is a subset of the sample space. For example, we could take the event $\{orange, lemon\} \subset \Omega$ in the fruit bowl experiment.

Probability maps a set of events to a set of numbers in a certain axiomatic manner. Abstractly, probability is a function that assigns numbers in the range 0 to 1 to events

$$P : \text{set of events } \rightarrow [0,1]$$

which satisfies the following axioms:

  1. For any event $A$, we have $0 \leq P(A) \leq 1$.
  2. If $\Omega$ is the sample space, $P(\Omega) = 1$.
  3. If $A$ and $B$ are disjoint (i.e., $A \cap B = \emptyset$), then $P(A \cup B) = P(A) + P(B)$.
  4. If $A_1, A_2, \ldots$ is an infinite sequence of pair-wise disjoint events (i.e., $A_i \cap A_j = \emptyset$ when $i \ne j$), then
$$ \begin{array}{lcl} \underbrace{P\left(\bigcup_{i=1}^{\infty}A_i\right)} &=& \underbrace{\sum_{i=1}^{\infty}P\left(A_i\right)} \\ A_1 \cup A_2 \cup A_3 \dots &=& P(A_1) + P(A_2) + P(A_3) + \ldots \end{array} $$

The Long Term Relative Frequency interpretation requires the following functions, namely,

$$ \begin{array}{llcl} \text{let } & N(A, n) & = & \text{ the number of times } A \text{ occurs in the first } n \text{ trials,} \\ \text{then } & P(A) & = & \lim_{n \rightarrow \infty} \frac{N(A, n)}{n} \end{array} $$

To think about this, consider what $\lim_{n \rightarrow \infty} \frac{N(A, n)}{n}$ is:

$$ \begin{array}{c} \frac{N(A, 1)}{1},\, \frac{N(A, 2)}{2},\,\frac{N(A, 3)}{3},\,\ldots \text{where is this fraction going?} \end{array} $$

Let's look at axioms 1, 2, and 3 above more closely.

For any event $A$, with $0 \leq P(A) \leq 1$. Well, clearly $0 \leq \frac{N(A, n)}{n} \leq 1$.

If $\Omega$ is the sample space, $P(\Omega) = 1$. This essentially says "something must happen". $P(\Omega) = \frac{N(\Omega, n)}{n} = \frac{n}{n} = 1$.

If $A$ and $B$ are disjoint (i.e., $A \cap B = \emptyset$), then $N(A \cup B, n) = N(A, n) + N(B, n)$ since $A \cup B$ occurs if either $A$ or $B$ occurs but we know that it is impossible for both $A$ and $B$ to happen (the intersection is the empty set). This extends to infinitely many disjoint events.

Axiom 4 is a bit more controversial, and you can live without it if we only worry about finite stuff. Check the last section in the lecture notes Extension of probability.

Lets do some probability examples.

Example 1: Tossing a fair coin

The sample space and probabilties of this experiment are:

$$\Omega = \{H, T\}, \ \text{and} \ P(\{H\}) = P(\{T\}) = \frac{1}{2} \ .$$

We can represent our probability as the following function:

  • $P : \{ \{H\} , \{T\}, \{H,T\}, \{\} \} \to \{0,\frac{1}{2},1\}$,
  • with $P(\{H\})=\frac{1}{2}$, $P(\{T\})=\frac{1}{2}$, $P(\{H,T\})=1$ and $P(\{\})=0$.

notational convenience: The outcomes which are the elements in the the sample space are denoted without set brackets, for example: $P(\{H\})$ is denoted by $P(H)$ for brevity.

Check that all our axioms are satisfied:

  1. yes! because: $0 \le P(H) = P(T) = \frac{1}{2} \le 1$ and $0 \le P(\Omega) = 1 \le 1$.
  • yes! becasue: $P(\Omega) = P(\{H, T\} = P(\{H\}) + P(\{T\}) = \frac{1}{2} + \frac{1}{2} = 1$.
  • yes!, because: $P(\{H, T\} = P(\{H\}) + P(\{T\})$.

See this video about tossing a coin by Persi Diaconis of Stanford's Statistics Department.

Example 2: Tossing an unfair coin

$\Omega = \{H, T\}$, $P(H) =\frac{3}{4}$, $P(T) = \frac{1}{4}$. So the coin lands heads 3 out of 4 times and lands tails only 1 out of 4 times.

Check that all our axioms are satisfied:

  1. $0 \le P(H) = \frac{3}{4} \le 1$, $0 \le P(T) = \frac{1}{4} \le 1$ and $0 \le P(\Omega) = 1 \le 1$.
  • $P(\Omega) = P(\{H, T\} = P(\{H\}) + P(\{T\}) = \frac{3}{4} + \frac{1}{4} = 1$.
  • $P(\{H, T\} = P(\{H\}) + P(\{T\})$.

Yes, all three axioms are satisfied by the probabiliy for this unfair coin experiment too.

Example 2: New Zealand Lotto

In New Zealand Lotto, the balls drawn are numbered 1 to 40. The number on a ball is an outcome.

$\Omega = \{1, 2, \ldots,40\}$, $P(\omega) = \frac{1}{40}$ for each $\omega \in \Omega$ (i.e., $P(1) = P(2) = \ldots = P(40) = \frac{1}{40}$)

Now, consider the event that the first ball is even? What is the probability of this event, $P(\{2, 4, 6, \ldots, 38, 40\})$?

$$ \begin{array}{lcll} P(\{2, 4, \ldots, 38, 40\}) & = & P \left( \{2\} \cup \{4\} \cup \cdots \cup \{38\} \cup \{40\} \right) & \text{(defn. of set union)}\\ & = & P( \{2\}) +P( \{4\}) + \cdots + P(\{38\}) + P( \{40\}) & \text{(extn. Axiom 3)} \\ & = & \sum_{i \in \{2, 4, \ldots, 40\}} P(\{i\}) & \\ & = & 20 \times \frac{1}{40} & \\ & = & \frac{1}{2} & \end{array} $$

Similarly for the probability of an odd ball:

$$ \begin{array}{lcl} P(\{1, 3, 5, \ldots, 37, 39\}) & = & P \left( \{1\} \cup \{3\} \cup \cdots \cup \{37\} \cup \{39\} \right)\\ & = & P(\{1\}) +P( \{3\}) + \cdots + P(\{37\}) + P( \{39\})\\ & = & \sum_{i \in \{1, 3, \ldots, 37, 39\}} P(\{i\}) \\ & = & 20 \times \frac{1}{40} \\ & = & \frac{1}{2} \end{array} $$

Aside: The set of all possible subsets of $\Omega$ is called the power set and is denoted by $2^{\Omega}$. The power set contains all events of a sample space and is a natural domain for the probability function defined on a sample space with finitely many outcomes. We will see more on this later but if you are impatient see here.

Computer Representations of Mathematical Concepts and Objects

So we have probabilities associated with events satisfying some axioms -- that sounds like a good use for a dictionary? That's basically what are going to use, but we 'wrap' up the dictionary in some extra code that does things like check that the probabilities add to 1, and that each element in the list of outcomes we have given is unique. If you are interested in programming, we have coded our own type, or class, called ProbyMap (it's given in the next cell, and you can simply evaluate it and ignore all the details completely!). Once the class is coded, we create a ProbyMap by providing the sample space and the probabilities. You don't have to worry about how the class is implemented, but note that you may often want to use the computer to create a computerised representation of a mathematical concept. Once the concept (a discrete probability map in our case) is implemenetd, then we can use the computer to automate the mundane tasks of large-scale computations.

In [ ]:
# create a class for a probability map - if you are new to Python just evaluate and skip this cell
# This was coded by Jenny Harlow

import copy
class ProbyMap(object):                # class definition
    'Probability map class'
    def __init__(self, sspace, probs): # constructor
        self.__probmap = {}  # default probmap is empty
        # make checks on the objects given as sspace and probs
        try:                           
            sspace_set = set(sspace) # check that we can make the sample space into a set
            assert len(sspace_set) == len(sspace) # and not lose any elements
            prob_list = list(probs) # and we can make the probs into a list
            probsum = sum(prob_list) # and we can sum the probs
            assert abs(probsum - 1) < 1e-10 # and the probs sum to 1
            assert len(prob_list) == len(sspace_set) # and there is proby for each event
            
            self.__probmap = dict(zip(list(sspace),prob_list))    # map from sspace to probs
                    
        except TypeError as diag: # if there any problems with types
            init_error = 1
            print (str(diag))
            
        except AssertionError as e:
            init_error = 1
            print ("Check sample space and probabilities")
            
                  
    def P(self, events):
        '''Return the probability of an event or set of events.
        
        events is set of events in the sample space to calculate the probability for.'''
        
        retvalue = 0
        try:                              
            events_set = set(events) # check we can make a set out of the events
            assert len(events_set) == len(events) # and not lose any events
            assert events_set <= set(self.__probmap.keys()) # events subset of sample space
        
            for ev in events:     # add each mapped probability to the return value
                retvalue += self.__probmap[ev]
                
        except TypeError as diag:
            print (str(diag))      
        
        except AssertionError:
            print ("Check your events")
        
        return retvalue
        
    def __str__(self):                            # redefine printable string rep
        'Printable representation of the object.'
        num_keys = len(self.__probmap.keys())
        counter = 0
        retval = '{'
        for each_key in self.__probmap:
            counter += 1
            retval += str(each_key)
            retval += ': '
            retval += "%.3f" % self.__probmap[each_key]
            if counter < num_keys:
                retval += ', '
        retval += '}'        
                
        return retval
        
    __repr__ = __str__
        
    def get_probmap(self):                        # get a deep copy of the proby map
        return copy.deepcopy(self.__probmap)      # getter cannot alter object's map
    
    probmap = property(get_probmap)               # allow read access via .probmap
    
    def get_ref_probmap(self):                    # get a reference to the real probmap
        return self.__probmap                     # getter can alter the object's map
        
    
    ref_probmap = property(get_ref_probmap)       # allow access via .ref_probmap
    
    @staticmethod
    def dictExp(big_map, small_map):
        '''Internal helper function for __pow__(...).
        
        Takes two proby map dictionaries and returns one mult by other.'''
        new_bl = {}
        for sle in small_map:
            for ble in big_map:
                new_key = str(ble) + ' ' + str (sle)
                new_bl[new_key] = big_map[ble]*small_map[sle]
        return new_bl
        
    def __pow__(self, x):
        '''probability map exponentiated.'''
        try:
            assert isinstance(x, Integer)
            pmap = copy.deepcopy(self.__probmap) # copy the probability map dictionary
            new_pmap = copy.deepcopy(self.__probmap) # and another copy
            for i in range(x-1):
                new_pmap = self.dictExp(new_pmap, pmap)
            
            return ProbyMap(new_pmap.keys(), new_pmap.values())       
            
        except AssertionError as e:
            print ("cannot raise to non-integer power")
            return None

Example 4: Experiments, outcomes, sample spaces, events, and the probability of events

Let's go back to the well-mixed fruit bowl experiment. The fruit bowl contains:

  • 2 oranges
  • 3 apples
  • 1 lemon

The experiment is to take one piece of fruit from the bowl and the outcome is the type of fruit we get.

The sample space is $\Omega = \{orange, apple, lemon\}$

We can use the Python list to create this sample space (a list is a bit easier to use than a set, but using a list means that we are responsible for making sure that each element contained in it is unique).

In [ ]:
# sample space is the set of distinct type of fruits in the bowl
samplespace = ['orange', 'apple', 'lemon']

We can also use a list to specify what the probability of each outcome in the sample space is. The probabilities can be calculated by knowing how many fruits of each kind are there in the bowl. We say that the fruit bowl is 'well-stirred', which essentially means that when we pick a fruit it really is a 'random sample' from the bowl (for example, we have not carefully put all the apples on the top so that someone will almost certainly get an apple when they take a fruit). More on this later in the course! Note that the probabilities encoded by a list named probabilities are in the same order as the outcomes in the samplespace list.

In [ ]:
# probabilities take into account the number of each type of fruit in the "well-stirred" fruit bowl
probabilities = [2/6, 3/6, 1/6]
In [ ]:
probMapFruitbowl = ProbyMap(sspace = samplespace, probs=probabilities) # make our probability map
probMapFruitbowl         # disclose our probability map

We can use our probability map to find the probability of a single outcome like this:

In [ ]:
# Find the probability of outcome 'lemon'
probMapFruitbowl.P(['lemon'])

We can also use our probability map to find the probability of an event (set of outcomes).

In [ ]:
# Find the probability of the event {lemon, orange}
probMapFruitbowl.P(['lemon','orange'])

Basically, the probability map implemented by ProbyMap is essentially a map or dictionary with some additional bells and whistles.

Basically, the probability map is essentially a map or dictionary.

Next we will obtain the set of all events (the largest $\sigma$-algebra or $\sigma$-field in math lingo) from the outcomes in our sample space via the Subset function and find the probability of each event using our ProbyMap in a for loop.

In [ ]:
def powerset(samplespace):
    """A function that takes a set and produces all subsets
    """
    from itertools import chain, combinations
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(samplespace)
    return list(chain.from_iterable(combinations(s, r) for r in range(len(s)+1)))
In [ ]:
# make the set of all possible events from the set of outcomes 

setOfAllEvents = powerset(samplespace)  # Subsets(A) returns the set of all subsets of A
setOfAllEvents   # disclose the set of all events
In [ ]:
# loop through the set of all events and print the computed probability
for event in setOfAllEvents:
    print ("P(", event, ") = ", probMapFruitbowl.P(event))

Example 5: Experiments with the English language

In English language text there are 26 letters in the alphabet. The relative frequencies with which each letter appears is tabulated below:

E 13.0% H 3.5% W 1.6%
T 9.3% L 3.5% V 1.3%
N 7.8% C 3.0% B 0.9%
R 7.7% F 2.8% X 0.5%
O 7.4% P 2.7% K 0.3%
I 7.4% U 2.7% Q 0.3%
A 7.3% M 2.5% J 0.2%
S 6.3% Y 1.9% Z 0.1%
D 4.4% G 1.6%    

Using these relative frequencies as probabilities we can create a probability map for the letters in the English alphabet. We start by defining the sample space and the probabilities.

In [ ]:
alphaspace = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q',
              'R','S','T','U','V','W','X','Y','Z']
alphaRelFreqs = [73/1000,9/1000,30/1000,44/1000,130/1000,28/1000,16/1000,35/1000,74/1000,
                 2/1000,3/1000,35/1000, 25/1000,78/1000,74/1000,27/1000,3/1000,77/1000,63/1000,
                 93/1000,27/1000,13/1000,16/1000,5/1000,19/1000,1/1000]

Then we create the probability map, represented by a ProbyMap object.

In [ ]:
probMapLetters = ProbyMap(sspace = alphaspace, probs=alphaRelFreqs) # make our probability map
probMapLetters         # disclose our probability map

Please do NOT try to list the set of all subsets (events) of the 26 alphabet set (sample space): there are over 67 million events and the computer will probably crash by running out of memory! You can see how large a number we are talking about by evaluating the next cell which calculates $2^{26}$ for you.

In [ ]:
2**26

Instead of asking for the probability of each event (over 67 million of them to exhaustively march through!) we define some events of interest, say the vowels in the alphabet or the set of letters that make up a name.

In [ ]:
vowels = ['A', 'E', 'I', 'O', 'U']

And we can get the probability that a letter drawn from a 'well-stirred' jumble of English letters is a vowel.

In [ ]:
probMapLetters.P(vowels)

We can make ourselves another set of letters and find probabilities for that too. In the cell below, we go straight from a string to a set. The reason that we can do this is that a string or str is in fact another collection; str is a sequence type just like list.

In [ ]:
Cat=set("CAT") # make a set from a string
Cat            # disclose the set NameOfRaaz you have built
In [ ]:
probMapLetters.P(Cat)

Try either adapting what we have above, or doing the same thing in some of cells below, to find the probabilities of other sets of letters yourself. For example, what is the probability of your name in the above sense?

In [ ]:
 
In [ ]:
 

The crucial point of the above exercise with our own implementation of a Python class for probability maps is to show how computers and mathematics can go hand in hand to quickly generalize operations over specific instances of a large class of mathematical notions (Example 4 and 5 for probability models on finite sample spaces in our case above).

Counting words in a text file

Now we will load a file and count its words, this will allow us to estimate probabilities etc.

SMS spam data

In the following problem we will explore SMS spam texts. The dataset is the SMS Spam Collection Dataset and we have provided for you a way to load the data.

The sms data is stored in the file data/spam.csv, lets have a look at the head of that file. Recall the BASH command head which lists the first lines of a file. In this case we submitted the command -n 4 to tell it to show only 4 lines.

In [1]:
%%sh 
head -n 4 data/spam.csv
v1,v2,,,
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",,,
ham,Ok lar... Joking wif u oni...,,,
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,,,

From the above we can see that it is a so called "comma separated values" or "csv" file. There are many ways of loading such a file, but the most pythonic way is to use the csv package, although it is a bit slow. Later you will probably use the Pandas package to load tabular data, but it is too black box for us at this stage. Below I have provided a function to load this data. Some breakdown is perhaps necessary

  • The first row is the header, this puts names to the values
  • Each row has five columns, but only the first two have names
  • The first one called v1 is the label for the sms text which can be found in column v2.
In [2]:
def load_sms():
    """
    A wrapper function to load the sms data
    """
    import csv
    lines = []
    hamspam = {'ham': 0, 'spam': 1}
    with open('data/spam.csv', mode='r', encoding='latin-1') as f:
        reader = csv.reader(f)
        # When using the csv reader, each time you use the function
        # next on it, it will spit out a list split at the ','
        header = next(reader)
        # We store this as ("txt",label), where we have used the function
        # hamspam to convert from "ham","spam" to 0 and 1.
        lines = [(line[1],hamspam[line[0]]) for line in reader]
        
    return lines
In [3]:
sms_data = load_sms()
sms_data[:2]
Out[3]:
[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  0),
 ('Ok lar... Joking wif u oni...', 0)]

We will now do something interesting with this data.

$$ \Omega = \{\text{All possible sms texts}\} $$

for each outcome $\omega$ (the text) we assign a value $0$ if it is not spam and $1$ if it is spam, we can represent this with a function $X(\omega)$. In our dataset $\omega$ is just the first value in the tuple and $X(\omega)$ is the second value.

So in the case when $\omega = $ 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...' we have $X(\omega) = 0$.

$X$ is actually a coming attraction, i.e. the concept of a random variable, which assigns values to outcomes. We can also think of $A$ as the event, "the text is a spam", thus $\omega \in A$ if and only if $X(\omega) = 1$.

We can ignore what $X$ actually is and rewrite this problem such that we only look at $X$. In this case

$$ \Omega = \{0,1\} $$
In [ ]:
samplespace_sms_X = [0,1]

How do we actually get the probabilities? Well, since we have two outcomes we can use the LTRF idea

In [4]:
# Number of times the spam occurs divided by the amount of sms data
N_spam_n = sum([sms[1] for sms in sms_data]) / len(sms_data)
N_spam_n
Out[4]:
0.13406317300789664
In [ ]:
relFreqs_sms_X = [1-N_spam_n, N_spam_n]
sms_X_probmap = ProbyMap(sspace = samplespace_sms_X, probs=relFreqs_sms_X) # make our probability map
In [ ]:
sms_X_probmap

So you see, the probabilities for the spam/ham problem became the same as in the biased coin flip.

Counting words

Lets check what happens when we count frequencies of words in the spam dataset. Lets choose the word free. What we want is a function that takes as input the sms text and outputs $0$ if the word does not appear and $1$ if it appears.

In [6]:
sms_data[:2]
Out[6]:
[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  0),
 ('Ok lar... Joking wif u oni...', 0)]
In [7]:
def free_in_txt(txt):
    if 'free' in txt: # txt = "freestyle", txt="free films" = 1, returns zero if we see txt="Free" 
        return 1
    return 0
In [8]:
free_in_txt("Hello there")
Out[8]:
0
In [10]:
free_in_txt("Hello there I am free")
Out[10]:
1

$Y(\omega)$ = "1 if free appears and 0 if it does not". Again, we can also think of $B$ as the event, "the word free appears in the text", thus $\omega \in B$ if and only if $Y(\omega) = 1$.

In [12]:
sms_free_in_word = [free_in_txt(txt) for txt,label in sms_data]
In [17]:
sum(sms_free_in_word)
Out[17]:
122
In [19]:
#samplespace_sms_Y = samplespace_sms_X
# Number of times the free occured in data
N_free_n = sum(sms_free_in_word) / len(sms_free_in_word)
N_free_n
Out[19]:
0.02189519023689878
In [ ]:
## YouTry
# Create a probability map with the sample space and probabilities
# for the above example.

Conditional probability

Now that we have the two events $A$ and $B$ or the two random variables $X$ and $Y$ we can try to compute the following quantity

$$ P(A \mid B) := \frac{P(A \cap B)}{P(B)} $$

or equivalently

$$ P(X = 1 \mid Y = 1) := \frac{P(X = 1 \text{ and } Y = 1)}{P(Y = 1)}. $$

What does this mean? What is the probability that our text is a spam given that we know it contains the word "free".

$P(A \cap B)$ is the statement that the events $A$ and $B$ happen at the same time, i.e. $X=Y=1$. We can make this easily

In [20]:
sms_data[:2]
Out[20]:
[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  0),
 ('Ok lar... Joking wif u oni...', 0)]
In [21]:
X = [label for txt,label in sms_data]
Y = sms_free_in_word
X_and_Y = [x*y for x,y in zip(X,Y)]
N_X_and_Y = sum(X_and_Y)/len(X_and_Y)
N_X_and_Y
Out[21]:
0.01094759511844939
In [22]:
# So lets just divide this with N_free_n from above and we get
N_X_and_Y/N_free_n
Out[22]:
0.5

So the probability jumped up to 1/2. So, if we see the word free in the sms, we are more certain that it is a spam.

Another way to do the same thing, is to take our original data and remove all texts that does not contain the word free and compute the relative frequency of spam in the remaining data

In [ ]:
free_only_sms = [(txt, label) for txt,label in sms_data if free_in_txt(txt)]
sum([sms[1] for sms in free_only_sms])/len(free_only_sms)

Dependence and independence

From the above computation, would you say that the events $A$ and $B$ are dependent or independent?