Introduction to Data Science ¶

1MS041, 2023¶

Second notebook on Random variables¶

Lets take a look again at the spam dataset and look at our concepts there

from Utils import load_sms
sms_data = load_sms()
sms_data[:2]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  0),
 ('Ok lar... Joking wif u oni...', 0)]

Let X represents each SMS text (an entry in the list), and let $Y$ represent whether text is spam or not i.e. $Y \in \{0,1\}$. Thus $\mathbb{P}(Y = 1)$ is the probability that we get a spam. The goal is to estimate: $$ \mathbb{P}(Y = 1 | \text{"free" or "prize" is in } X) \enspace . $$ That is, the probability that the SMS is spam given that "free" or "prize" occurs in the SMS. Hint: it is good to remove the upper/lower case of words so that we can also find "Free" and "Prize"; this can be done with text.lower() if text a string.

To do this we can create a new random variable $Z$ which is $1$ if "free" or "prize" appears in $X$.

interesting_words=set(['free','prize'])
TF10 = {True: 1, False: 0}
Z_obs = [TF10[not interesting_words.isdisjoint([word.lower() for word in line[0].split(' ')])] for line in sms_data]

Z_obs[:10]

[0, 0, 1, 0, 0, 0, 0, 0, 1, 1]

Y_obs = [y for x,y in sms_data]
Y_obs[:10]

[0, 0, 1, 0, 0, 1, 0, 0, 1, 1]

import numpy as np
def F_X_12(x):
    TF10 = {True: 1, False: 0}
    return np.mean([TF10[(y <= x[0]) and (z <= x[1])] for y,z in zip (Y_obs,Z_obs)])

F_X_12([1,0])

0.9551328068916008

This is the JDF for this problem

print("\tz <= 0 \t\tz <= 1")
for x1 in range(0,2):
    print("y <= %d \t" % x1,end='')
    for x2 in range(0,2):
        print("%.2f" % (F_X_12((x1,x2))),end='\t\t')
    print('\n')

	z <= 0 		z <= 1
y <= 0 	0.86		0.87		

y <= 1 	0.96		1.00

F_X_12((1,0))

F_X_12((0,0)) == F_X_12((0,1))*F_X_12((1,0))

False

F_X_12((0,1))*F_X_12((1,0))

0.8270846721557743

# Are they indepdentent? If so, then the JDF is just the product of the 
# DFs for Y and Z, but
0.865936826992103*0.955132806891601

Which is not 0.858, so they are not independent. So lets try to estimate the probability that $Y=1$ given that $Z = 1$. Lets again do that by filtering

np.mean([y for z,y in zip(Z_obs,Y_obs) if z == 1])

Compare that with the marginal probability of $Y = 1$, which is according to our JDF 1-0.866 = 0.134

# Or we can just compute it directly
np.mean(Y_obs)

What we see from this is that knowing that the words "free" or "prize" appeared in the sms text, we are much more certain that it is a spam. We also see that looking directly at the JDF this can be hard to see, although it is equivalent.

x = np.random.normal(size=100)

np.mean(x)

-0.08265975892373005

g = lambda x: x**2

mean = np.mean(x)
y = x-mean
np.mean(y**4)

3.588674726116084

Moments etc¶

import numpy as np
x = np.random.normal(size=100)

x

Sample mean¶

np.mean(x)

Sample variance¶

np.var(x)

Or by doing it yourself

mu = np.mean(x)
np.mean(np.power(x-mu,2))

Higher moments, we can use scipy

from scipy.stats import skew, kurtosis

skew(x)

kurtosis(x,fisher=False)

Moments and tail behavior¶

def standardize(data):
    mean = np.mean(data)
    std = np.sqrt(np.var(data))
    return (data-mean)/std

import numpy as np
chi2 = np.random.chisquare(4,size=10000)
normal = np.random.normal(size=10000)
import matplotlib.pyplot as plt
_=plt.hist(standardize(chi2),bins=50,alpha=0.5)
_=plt.hist(standardize(normal),bins=50,alpha=0.5)
plt.xlim(-3,5)

(-3.0, 5.0)

from scipy.stats import skew, kurtosis
def print_basic_stats(data):
    print("mean: %.2f\tstd: %.2f\tskew: %.2f\tkurtosis: %.2f" % (np.mean(data),np.std(data),skew(data),kurtosis(data,fisher=False)))

print_basic_stats(standardize(normal))

mean: -0.00	std: 1.00	skew: -0.01	kurtosis: 2.97

print_basic_stats(standardize(chi2))

mean: 0.00	std: 1.00	skew: 1.40	kurtosis: 5.86

print_basic_stats(standardize(np.sqrt(chi2)))

np.mean(np.power(standardize(chi2),3)) # Skewness

np.mean(np.power(standardize(chi2),4)) # kurtosis

Transformations of random variables¶

Consider a Binomial random variable

n = 10
p = 0.5
x = np.random.binomial(n,p,size=1000)

Lets plot the empirical density

from Utils import makeEMF,makeEDF,plotEDF,plotEMF
plotEMF(makeEMF(x))

If we had the function $g(x) = \sin(x/3)$

plotEMF(makeEMF(np.sin(x)))

plotEDF(makeEDF(np.sin(x)))

Can we compute this thing? What is $\sin^{[-1]}$?

Since $X$ is discrete, we can check what $\mathbb{Y}$ is, since $\mathbb{X}=\{0,1,\ldots,10\}$.

Y_space = np.sort(np.sin(np.arange(0,11)))
sin_inv = dict(zip(np.sin(np.arange(0,11)),np.arange(0,11)))

from scipy.special import binom as binomial
plotEMF([(y,binomial(n,sin_inv[y])*(p**sin_inv[y])*((1-p)**(n-sin_inv[y]))) for y in Y_space])

plotEDF(emfToEdf([(y,binomial(n,sin_inv[y])*(p**sin_inv[y])*((1-p)**(n-sin_inv[y]))) for y in Y_space]))

Introduction to Data Science¶