Introduction to Data Science

1MS041, 2023

©2023 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

Second notebook on Random variables

Lets take a look again at the spam dataset and look at our concepts there

In [1]:
from Utils import load_sms
sms_data = load_sms()
sms_data[:2]
Out[1]:
[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  0),
 ('Ok lar... Joking wif u oni...', 0)]

Let X represents each SMS text (an entry in the list), and let $Y$ represent whether text is spam or not i.e. $Y \in \{0,1\}$. Thus $\mathbb{P}(Y = 1)$ is the probability that we get a spam. The goal is to estimate: $$ \mathbb{P}(Y = 1 | \text{"free" or "prize" is in } X) \enspace . $$ That is, the probability that the SMS is spam given that "free" or "prize" occurs in the SMS. Hint: it is good to remove the upper/lower case of words so that we can also find "Free" and "Prize"; this can be done with text.lower() if text a string.

To do this we can create a new random variable $Z$ which is $1$ if "free" or "prize" appears in $X$.

In [3]:
interesting_words=set(['free','prize'])
TF10 = {True: 1, False: 0}
Z_obs = [TF10[not interesting_words.isdisjoint([word.lower() for word in line[0].split(' ')])] for line in sms_data]
In [4]:
Z_obs[:10]
Out[4]:
[0, 0, 1, 0, 0, 0, 0, 0, 1, 1]
In [5]:
Y_obs = [y for x,y in sms_data]
Y_obs[:10]
Out[5]:
[0, 0, 1, 0, 0, 1, 0, 0, 1, 1]
In [6]:
import numpy as np
def F_X_12(x):
    TF10 = {True: 1, False: 0}
    return np.mean([TF10[(y <= x[0]) and (z <= x[1])] for y,z in zip (Y_obs,Z_obs)])
In [10]:
F_X_12([1,0])
Out[10]:
0.9551328068916008

This is the JDF for this problem

In [14]:
print("\tz <= 0 \t\tz <= 1")
for x1 in range(0,2):
    print("y <= %d \t" % x1,end='')
    for x2 in range(0,2):
        print("%.2f" % (F_X_12((x1,x2))),end='\t\t')
    print('\n')
	z <= 0 		z <= 1
y <= 0 	0.86		0.87		

y <= 1 	0.96		1.00		

In [ ]:
F_X_12((1,0))
In [15]:
F_X_12((0,0)) == F_X_12((0,1))*F_X_12((1,0))
Out[15]:
False
In [16]:
F_X_12((0,1))*F_X_12((1,0))
Out[16]:
0.8270846721557743
In [ ]:
# Are they indepdentent? If so, then the JDF is just the product of the 
# DFs for Y and Z, but
0.865936826992103*0.955132806891601

Which is not 0.858, so they are not independent. So lets try to estimate the probability that $Y=1$ given that $Z = 1$. Lets again do that by filtering

In [ ]:
np.mean([y for z,y in zip(Z_obs,Y_obs) if z == 1])

Compare that with the marginal probability of $Y = 1$, which is according to our JDF 1-0.866 = 0.134

In [ ]:
# Or we can just compute it directly
np.mean(Y_obs)

What we see from this is that knowing that the words "free" or "prize" appeared in the sms text, we are much more certain that it is a spam. We also see that looking directly at the JDF this can be hard to see, although it is equivalent.

In [17]:
x = np.random.normal(size=100)
In [18]:
np.mean(x)
Out[18]:
-0.08265975892373005
In [19]:
g = lambda x: x**2
In [23]:
mean = np.mean(x)
y = x-mean
np.mean(y**4)
Out[23]:
3.588674726116084

Moments etc

In [ ]:
import numpy as np
x = np.random.normal(size=100)
In [ ]:
x

Sample mean

In [ ]:
np.mean(x)

Sample variance

In [ ]:
np.var(x)

Or by doing it yourself

In [ ]:
mu = np.mean(x)
np.mean(np.power(x-mu,2))

Higher moments, we can use scipy

In [ ]:
from scipy.stats import skew, kurtosis
In [ ]:
skew(x)
In [ ]:
kurtosis(x,fisher=False)

Moments and tail behavior

In [25]:
def standardize(data):
    mean = np.mean(data)
    std = np.sqrt(np.var(data))
    return (data-mean)/std
In [26]:
import numpy as np
chi2 = np.random.chisquare(4,size=10000)
normal = np.random.normal(size=10000)
import matplotlib.pyplot as plt
_=plt.hist(standardize(chi2),bins=50,alpha=0.5)
_=plt.hist(standardize(normal),bins=50,alpha=0.5)
plt.xlim(-3,5)
Out[26]:
(-3.0, 5.0)
In [27]:
from scipy.stats import skew, kurtosis
def print_basic_stats(data):
    print("mean: %.2f\tstd: %.2f\tskew: %.2f\tkurtosis: %.2f" % (np.mean(data),np.std(data),skew(data),kurtosis(data,fisher=False)))
In [29]:
print_basic_stats(standardize(normal))
mean: -0.00	std: 1.00	skew: -0.01	kurtosis: 2.97
In [30]:
print_basic_stats(standardize(chi2))
mean: 0.00	std: 1.00	skew: 1.40	kurtosis: 5.86
In [ ]:
print_basic_stats(standardize(np.sqrt(chi2)))
In [ ]:
np.mean(np.power(standardize(chi2),3)) # Skewness
In [ ]:
np.mean(np.power(standardize(chi2),4)) # kurtosis

Transformations of random variables

Consider a Binomial random variable

In [ ]:
n = 10
p = 0.5
x = np.random.binomial(n,p,size=1000)

Lets plot the empirical density

In [ ]:
from Utils import makeEMF,makeEDF,plotEDF,plotEMF
plotEMF(makeEMF(x))

If we had the function $g(x) = \sin(x/3)$

In [ ]:
plotEMF(makeEMF(np.sin(x)))
In [ ]:
plotEDF(makeEDF(np.sin(x)))

Can we compute this thing? What is $\sin^{[-1]}$?

Since $X$ is discrete, we can check what $\mathbb{Y}$ is, since $\mathbb{X}=\{0,1,\ldots,10\}$.

In [ ]:
Y_space = np.sort(np.sin(np.arange(0,11)))
sin_inv = dict(zip(np.sin(np.arange(0,11)),np.arange(0,11)))
In [ ]:
from scipy.special import binom as binomial
plotEMF([(y,binomial(n,sin_inv[y])*(p**sin_inv[y])*((1-p)**(n-sin_inv[y]))) for y in Y_space])
In [ ]:
plotEDF(emfToEdf([(y,binomial(n,sin_inv[y])*(p**sin_inv[y])*((1-p)**(n-sin_inv[y]))) for y in Y_space]))