Introduction to Data Science

1MS041, 2023

©2023 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

Estimation of mean

  • Load data
  • Make assumptions
  • Estimate mean
  • Write down confidence interval
In [66]:
%%bash
ls data
CORIS.csv
NYPowerBall.csv
auto.csv
co2_mm_mlo.txt
digits.csv
earthquakes.csv
earthquakes.csv.zip
earthquakes.tgz
earthquakes_small.csv
final.csv
final.csv.zip
final.tgz
flights.csv
indoor_train.csv
leukemia.csv
mammography.mat
portland.csv
pride_and_prejudice.txt
rainfallInChristchurch.csv
ratings.csv
spam.csv
visits_clean.csv
In [67]:
%%bash
head -n 2 data/NYPowerBall.csv
Draw Date,Winning Numbers,Multiplier
02/09/2019,01 02 03 07 39 25,3
In [68]:
import csv

with open("data/NYPowerBall.csv",mode='r') as f:
    reader = csv.reader(f)
    header = next(reader)
    #data = [i for i in reader]
    data = list(reader)
In [69]:
data[:2]
Out[69]:
[['02/09/2019', '01 02 03 07 39 25', '3'],
 ['02/06/2019', '05 13 28 38 63 21', '5']]
In [70]:
list_list_list = [line[1].split(' ') for line in data]
In [71]:
list_list_list[:2]
Out[71]:
[['01', '02', '03', '07', '39', '25'], ['05', '13', '28', '38', '63', '21']]
In [72]:
flattened_list = sum(list_list_list,[])
In [73]:
import numpy as  np
list_list_list_arr = np.array(list_list_list)
In [75]:
list_list_list_arr.flatten().shape
Out[75]:
(5646,)
In [77]:
int_list = [int(i) for i in flattened_list]
In [78]:
int_list[:2]
Out[78]:
[1, 2]
In [79]:
int_arr = np.array(int_list)
In [80]:
int_arr.shape
Out[80]:
(5646,)
In [81]:
np.max(int_arr)
Out[81]:
69
In [82]:
np.min(int_arr)
Out[82]:
1
In [84]:
len(np.unique(int_arr))
Out[84]:
69

Assumption about data

  • IID (Independent and identically distributed), is this really true?
  • The n:th draw of the m:th trial is IID but not the n and n-1 for instance.
  • Discrete distribution with all numbers between 1 and 69.

What to do

Compute confidence interval of what?

  • The mean of the first draw
  • Hoeffdings inequality, $a = 1$ and $b=69$
In [85]:
reshaped_int_arr = int_arr.reshape(-1,6)
In [87]:
first_draws = reshaped_int_arr[:,0]
In [94]:
list_list_list[:10]
Out[94]:
[['01', '02', '03', '07', '39', '25'],
 ['05', '13', '28', '38', '63', '21'],
 ['10', '17', '18', '43', '65', '13'],
 ['02', '12', '16', '29', '54', '06'],
 ['08', '12', '20', '21', '32', '10'],
 ['23', '25', '47', '48', '50', '24'],
 ['05', '08', '41', '65', '66', '20'],
 ['14', '29', '31', '56', '61', '01'],
 ['07', '36', '48', '57', '58', '24'],
 ['06', '19', '37', '49', '59', '22']]
In [91]:
np.unique(first_draws)
Out[91]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 43, 44, 48, 50])
In [100]:
# Double check
np.unique([int(i[5]) for i in list_list_list])
Out[100]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39])
In [101]:
power_ball = reshaped_int_arr[:,-1]
In [102]:
power_ball_mean = np.mean(power_ball)
power_ball_mean
Out[102]:
17.14452709883103
In [103]:
39/2
Out[103]:
19.5

Hoeffdings inequality $P(X \in [a,b]) = 1$ $$ P(|\overline{X}_n - E[X]| > \epsilon) \leq 2 e^{-\frac{2 n \epsilon^2}{(a-b)^2}} $$

$$ \alpha = 2 e^{-\frac{2 n \epsilon^2}{(a-b)^2}} $$$$ \epsilon = \sqrt{\frac{(a-b)^2}{2n}\ln(2/\alpha)} $$
In [104]:
def compute_epsilon(n,a,b,alpha):
    return np.sqrt((a-b)**2/(2*n) * np.log(2/alpha))
In [105]:
alpha = 0.05
a = 1
b = 39
n = len(power_ball)
epsilon = compute_epsilon(n,a,b,alpha)
epsilon
Out[105]:
1.6823680763069317
In [106]:
(power_ball_mean-epsilon,power_ball_mean+epsilon)
Out[106]:
(15.462159022524098, 18.826895175137963)
In [107]:
(39+1)/2
Out[107]:
20.0
In [108]:
from Utils import discrete_histogram
discrete_histogram(power_ball)

Likelihood of parameter

  • Load data
  • Work with dates
  • Make assumptions
  • Write down the Risk
  • Split into two parts
  • Minimize the risk numerically on train
  • Test the risk on test
  • Make confidence intervals
In [109]:
%%bash
ls data
CORIS.csv
NYPowerBall.csv
auto.csv
co2_mm_mlo.txt
digits.csv
earthquakes.csv
earthquakes.csv.zip
earthquakes.tgz
earthquakes_small.csv
final.csv
final.csv.zip
final.tgz
flights.csv
indoor_train.csv
leukemia.csv
mammography.mat
portland.csv
pride_and_prejudice.txt
rainfallInChristchurch.csv
ratings.csv
spam.csv
visits_clean.csv
In [111]:
%%bash
head -n 2 data/earthquakes.csv
publicid,eventtype,origintime,modificationtime,longitude, latitude, magnitude, depth,magnitudetype,depthtype,evaluationmethod,evaluationstatus,evaluationmode,earthmodel,usedphasecount,usedstationcount,magnitudestationcount,minimumdistance,azimuthalgap,originerror,magnitudeuncertainty
2018p368955,,2018-05-17T12:19:35.516Z,2018-05-17T12:21:54.953Z,178.4653957,-37.51944533,2.209351541,20.9375,M,,NonLinLoc,,automatic,nz3drx,12,12,6,0.1363924727,261.0977462,0.8209633086,0
In [115]:
import csv

with open("data/earthquakes.csv",mode='r') as f:
    reader = csv.reader(f,skipinitialspace=True)
    header = next(reader)
    #data = [i for i in reader]
    data = list(reader)
In [128]:
print(header[2],data[0][2])
origintime 2018-05-17T12:19:35.516Z
In [124]:
format_string = "%Y-%m-%dT%H:%M:%S.%fZ"
In [127]:
from datetime import datetime
datetime.strptime(data[0][2],format_string)
Out[127]:
datetime.datetime(2018, 5, 17, 12, 19, 35, 516000)
In [129]:
origin_time = [datetime.strptime(line[2],format_string) for line in data]
In [132]:
(origin_time[0]-origin_time[1]).total_seconds()
Out[132]:
2470.87
In [133]:
origin_time_arr = np.array(origin_time)
In [135]:
sorted_origin_time = np.sort(origin_time_arr)
In [136]:
sorted_origin_time[1:]-sorted_origin_time[:-1]
Out[136]:
array([datetime.timedelta(seconds=530, microseconds=184000),
       datetime.timedelta(seconds=551, microseconds=417000),
       datetime.timedelta(seconds=764, microseconds=156000), ...,
       datetime.timedelta(seconds=2700, microseconds=148000),
       datetime.timedelta(seconds=3098, microseconds=120000),
       datetime.timedelta(seconds=2470, microseconds=870000)],
      dtype=object)
In [138]:
time_between = np.diff(sorted_origin_time)
In [139]:
time_between_seconds = [time.total_seconds() for time in time_between]
In [143]:
import matplotlib.pyplot as plt
_=plt.hist(time_between_seconds,bins=200)

We construct the model space $\mathcal{M} = \{f_\lambda(x) = \lambda e^{-\lambda x}: \lambda > 0\}$

Now define the log-loss $$ L(f_\lambda,x) = -\ln(f_\lambda(x)) $$

The risk is the expected loss $$ R(f_\lambda) = E[L(f_\lambda,X)] $$

We instead minimize the empirical risk, given i.i.d. Data $X = \{X_1,\ldots,X_n\}$ $$ \hat R_n(f_\lambda) = \frac{1}{n} \sum_{i=1}^n L(f_\lambda,X_i) $$

The goal is to minimize $\hat R_n(f_\lambda)$ w.r.t. $\lambda$.

$$ \ln(f_\lambda) = \ln(\lambda) - \lambda x $$
$$ \hat R_n(f_\lambda) = \frac{1}{n} \sum_{i=1}^n (\lambda X_i - \ln(\lambda)) $$
$$ \frac{d}{d\lambda} \hat R_n(f_\lambda) = \frac{1}{n} \sum_{i=1}^n (X_i - 1/\lambda) $$
$$ \frac{d}{d\lambda} \hat R_n(f_\lambda) = 0 $$

gives $$ 1/\hat \lambda = \frac{1}{n} \sum_{i=1}^n X_i $$

In [146]:
import matplotlib.pyplot as plt
_=plt.hist(time_between_seconds,bins=200,density=True)
hat_lambda = 1/np.mean(time_between_seconds)
x_plot = np.linspace(0,20000,100)
plt.plot(x_plot,hat_lambda*np.exp(-hat_lambda*x_plot))
Out[146]:
[<matplotlib.lines.Line2D at 0x1165dfd30>]

Lets say I want to predict the time to next earthquake, lets say that we use our estimated $\hat \lambda$ to predict $1/\hat \lambda$.

Train, testing split. But for us, using the iid assumption we can just randomly split our dataset into two parts.

In [147]:
n_train = int(len(time_between_seconds)/2)
train_set = time_between_seconds[:n_train]
test_set = time_between_seconds[n_train:]
In [148]:
hat_lambda_train = 1/np.mean(train_set)
guess = 1/hat_lambda_train
In [149]:
guess
Out[149]:
1411.6221267726278
In [152]:
_=plt.hist(np.abs(guess-test_set),bins=100)
In [153]:
average_error = np.mean(np.abs(guess-test_set))
print(average_error)
1113.396468268911