©2023 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)
Lets take a look again at the spam dataset and look at our concepts there
from Utils import load_sms
sms_data = load_sms()
sms_data[:2]
Let X represents each SMS text (an entry in the list), and let $Y$ represent whether text is spam or not i.e. $Y \in \{0,1\}$. Thus $\mathbb{P}(Y = 1)$ is the probability that we get a spam. The goal is to estimate:
$$
\mathbb{P}(Y = 1 | \text{"free" or "prize" is in } X) \enspace .
$$
That is, the probability that the SMS is spam given that "free" or "prize" occurs in the SMS.
Hint: it is good to remove the upper/lower case of words so that we can also find "Free" and "Prize"; this can be done with text.lower()
if text
a string.
To do this we can create a new random variable $Z$ which is $1$ if "free" or "prize" appears in $X$.
interesting_words=set(['free','prize'])
TF10 = {True: 1, False: 0}
Z_obs = [TF10[not interesting_words.isdisjoint([word.lower() for word in line[0].split(' ')])] for line in sms_data]
Z_obs[:10]
Y_obs = [y for x,y in sms_data]
Y_obs[:10]
import numpy as np
def F_X_12(x):
TF10 = {True: 1, False: 0}
return np.mean([TF10[(y <= x[0]) and (z <= x[1])] for y,z in zip (Y_obs,Z_obs)])
F_X_12([1,0])
This is the JDF for this problem
print("\tz <= 0 \t\tz <= 1")
for x1 in range(0,2):
print("y <= %d \t" % x1,end='')
for x2 in range(0,2):
print("%.2f" % (F_X_12((x1,x2))),end='\t\t')
print('\n')
F_X_12((1,0))
F_X_12((0,0)) == F_X_12((0,1))*F_X_12((1,0))
F_X_12((0,1))*F_X_12((1,0))
# Are they indepdentent? If so, then the JDF is just the product of the
# DFs for Y and Z, but
0.865936826992103*0.955132806891601
Which is not 0.858, so they are not independent. So lets try to estimate the probability that $Y=1$ given that $Z = 1$. Lets again do that by filtering
np.mean([y for z,y in zip(Z_obs,Y_obs) if z == 1])
Compare that with the marginal probability of $Y = 1$, which is according to our JDF 1-0.866 = 0.134
# Or we can just compute it directly
np.mean(Y_obs)
What we see from this is that knowing that the words "free" or "prize" appeared in the sms text, we are much more certain that it is a spam. We also see that looking directly at the JDF this can be hard to see, although it is equivalent.
x = np.random.normal(size=100)
np.mean(x)
g = lambda x: x**2
mean = np.mean(x)
y = x-mean
np.mean(y**4)
import numpy as np
x = np.random.normal(size=100)
x
np.mean(x)
np.var(x)
Or by doing it yourself
mu = np.mean(x)
np.mean(np.power(x-mu,2))
Higher moments, we can use scipy
from scipy.stats import skew, kurtosis
skew(x)
kurtosis(x,fisher=False)
def standardize(data):
mean = np.mean(data)
std = np.sqrt(np.var(data))
return (data-mean)/std
import numpy as np
chi2 = np.random.chisquare(4,size=10000)
normal = np.random.normal(size=10000)
import matplotlib.pyplot as plt
_=plt.hist(standardize(chi2),bins=50,alpha=0.5)
_=plt.hist(standardize(normal),bins=50,alpha=0.5)
plt.xlim(-3,5)
from scipy.stats import skew, kurtosis
def print_basic_stats(data):
print("mean: %.2f\tstd: %.2f\tskew: %.2f\tkurtosis: %.2f" % (np.mean(data),np.std(data),skew(data),kurtosis(data,fisher=False)))
print_basic_stats(standardize(normal))
print_basic_stats(standardize(chi2))
print_basic_stats(standardize(np.sqrt(chi2)))
np.mean(np.power(standardize(chi2),3)) # Skewness
np.mean(np.power(standardize(chi2),4)) # kurtosis
Consider a Binomial random variable
n = 10
p = 0.5
x = np.random.binomial(n,p,size=1000)
Lets plot the empirical density
from Utils import makeEMF,makeEDF,plotEDF,plotEMF
plotEMF(makeEMF(x))
If we had the function $g(x) = \sin(x/3)$
plotEMF(makeEMF(np.sin(x)))
plotEDF(makeEDF(np.sin(x)))
Can we compute this thing? What is $\sin^{[-1]}$?
Since $X$ is discrete, we can check what $\mathbb{Y}$ is, since $\mathbb{X}=\{0,1,\ldots,10\}$.
Y_space = np.sort(np.sin(np.arange(0,11)))
sin_inv = dict(zip(np.sin(np.arange(0,11)),np.arange(0,11)))
from scipy.special import binom as binomial
plotEMF([(y,binomial(n,sin_inv[y])*(p**sin_inv[y])*((1-p)**(n-sin_inv[y]))) for y in Y_space])
plotEDF(emfToEdf([(y,binomial(n,sin_inv[y])*(p**sin_inv[y])*((1-p)**(n-sin_inv[y]))) for y in Y_space]))