Introduction to Data Science ¶

1MS041, 2023¶

Estimation of mean¶

Load data
Make assumptions
Estimate mean
Write down confidence interval

%%bash
ls data

CORIS.csv
NYPowerBall.csv
auto.csv
co2_mm_mlo.txt
digits.csv
earthquakes.csv
earthquakes.csv.zip
earthquakes.tgz
earthquakes_small.csv
final.csv
final.csv.zip
final.tgz
flights.csv
indoor_train.csv
leukemia.csv
mammography.mat
portland.csv
pride_and_prejudice.txt
rainfallInChristchurch.csv
ratings.csv
spam.csv
visits_clean.csv

%%bash
head -n 2 data/NYPowerBall.csv

Draw Date,Winning Numbers,Multiplier
02/09/2019,01 02 03 07 39 25,3

import csv

with open("data/NYPowerBall.csv",mode='r') as f:
    reader = csv.reader(f)
    header = next(reader)
    #data = [i for i in reader]
    data = list(reader)

data[:2]

[['02/09/2019', '01 02 03 07 39 25', '3'],
 ['02/06/2019', '05 13 28 38 63 21', '5']]

list_list_list = [line[1].split(' ') for line in data]

list_list_list[:2]

[['01', '02', '03', '07', '39', '25'], ['05', '13', '28', '38', '63', '21']]

flattened_list = sum(list_list_list,[])

import numpy as  np
list_list_list_arr = np.array(list_list_list)

list_list_list_arr.flatten().shape

(5646,)

int_list = [int(i) for i in flattened_list]

int_list[:2]

[1, 2]

int_arr = np.array(int_list)

int_arr.shape

(5646,)

np.max(int_arr)

69

np.min(int_arr)

1

len(np.unique(int_arr))

69

Assumption about data¶

IID (Independent and identically distributed), is this really true?
The n:th draw of the m:th trial is IID but not the n and n-1 for instance.
Discrete distribution with all numbers between 1 and 69.

What to do¶

Compute confidence interval of what?

The mean of the first draw
Hoeffdings inequality, $a = 1$ and $b=69$

reshaped_int_arr = int_arr.reshape(-1,6)

first_draws = reshaped_int_arr[:,0]

list_list_list[:10]

[['01', '02', '03', '07', '39', '25'],
 ['05', '13', '28', '38', '63', '21'],
 ['10', '17', '18', '43', '65', '13'],
 ['02', '12', '16', '29', '54', '06'],
 ['08', '12', '20', '21', '32', '10'],
 ['23', '25', '47', '48', '50', '24'],
 ['05', '08', '41', '65', '66', '20'],
 ['14', '29', '31', '56', '61', '01'],
 ['07', '36', '48', '57', '58', '24'],
 ['06', '19', '37', '49', '59', '22']]

np.unique(first_draws)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 43, 44, 48, 50])

# Double check
np.unique([int(i[5]) for i in list_list_list])

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39])

power_ball = reshaped_int_arr[:,-1]

power_ball_mean = np.mean(power_ball)
power_ball_mean

17.14452709883103

39/2

19.5

Hoeffdings inequality $P(X \in [a,b]) = 1$ $$ P(|\overline{X}_n - E[X]| > \epsilon) \leq 2 e^{-\frac{2 n \epsilon^2}{(a-b)^2}} $$

$$ \alpha = 2 e^{-\frac{2 n \epsilon^2}{(a-b)^2}} $$$$ \epsilon = \sqrt{\frac{(a-b)^2}{2n}\ln(2/\alpha)} $$

def compute_epsilon(n,a,b,alpha):
    return np.sqrt((a-b)**2/(2*n) * np.log(2/alpha))

alpha = 0.05
a = 1
b = 39
n = len(power_ball)
epsilon = compute_epsilon(n,a,b,alpha)
epsilon

1.6823680763069317

(power_ball_mean-epsilon,power_ball_mean+epsilon)

(15.462159022524098, 18.826895175137963)

(39+1)/2

20.0

from Utils import discrete_histogram
discrete_histogram(power_ball)

Likelihood of parameter¶

Load data
Work with dates
Make assumptions
Write down the Risk
Split into two parts
Minimize the risk numerically on train
Test the risk on test
Make confidence intervals

%%bash
ls data

CORIS.csv
NYPowerBall.csv
auto.csv
co2_mm_mlo.txt
digits.csv
earthquakes.csv
earthquakes.csv.zip
earthquakes.tgz
earthquakes_small.csv
final.csv
final.csv.zip
final.tgz
flights.csv
indoor_train.csv
leukemia.csv
mammography.mat
portland.csv
pride_and_prejudice.txt
rainfallInChristchurch.csv
ratings.csv
spam.csv
visits_clean.csv

%%bash
head -n 2 data/earthquakes.csv

publicid,eventtype,origintime,modificationtime,longitude, latitude, magnitude, depth,magnitudetype,depthtype,evaluationmethod,evaluationstatus,evaluationmode,earthmodel,usedphasecount,usedstationcount,magnitudestationcount,minimumdistance,azimuthalgap,originerror,magnitudeuncertainty
2018p368955,,2018-05-17T12:19:35.516Z,2018-05-17T12:21:54.953Z,178.4653957,-37.51944533,2.209351541,20.9375,M,,NonLinLoc,,automatic,nz3drx,12,12,6,0.1363924727,261.0977462,0.8209633086,0

import csv

with open("data/earthquakes.csv",mode='r') as f:
    reader = csv.reader(f,skipinitialspace=True)
    header = next(reader)
    #data = [i for i in reader]
    data = list(reader)

print(header[2],data[0][2])

origintime 2018-05-17T12:19:35.516Z

format_string = "%Y-%m-%dT%H:%M:%S.%fZ"

from datetime import datetime
datetime.strptime(data[0][2],format_string)

datetime.datetime(2018, 5, 17, 12, 19, 35, 516000)

origin_time = [datetime.strptime(line[2],format_string) for line in data]

(origin_time[0]-origin_time[1]).total_seconds()

2470.87

origin_time_arr = np.array(origin_time)

sorted_origin_time = np.sort(origin_time_arr)

sorted_origin_time[1:]-sorted_origin_time[:-1]

array([datetime.timedelta(seconds=530, microseconds=184000),
       datetime.timedelta(seconds=551, microseconds=417000),
       datetime.timedelta(seconds=764, microseconds=156000), ...,
       datetime.timedelta(seconds=2700, microseconds=148000),
       datetime.timedelta(seconds=3098, microseconds=120000),
       datetime.timedelta(seconds=2470, microseconds=870000)],
      dtype=object)

time_between = np.diff(sorted_origin_time)

time_between_seconds = [time.total_seconds() for time in time_between]

import matplotlib.pyplot as plt
_=plt.hist(time_between_seconds,bins=200)

We construct the model space $\mathcal{M} = \{f_\lambda(x) = \lambda e^{-\lambda x}: \lambda > 0\}$

Now define the log-loss $$ L(f_\lambda,x) = -\ln(f_\lambda(x)) $$

The risk is the expected loss $$ R(f_\lambda) = E[L(f_\lambda,X)] $$

We instead minimize the empirical risk, given i.i.d. Data $X = \{X_1,\ldots,X_n\}$ $$ \hat R_n(f_\lambda) = \frac{1}{n} \sum_{i=1}^n L(f_\lambda,X_i) $$

The goal is to minimize $\hat R_n(f_\lambda)$ w.r.t. $\lambda$.

$$ \ln(f_\lambda) = \ln(\lambda) - \lambda x $$

$$ \hat R_n(f_\lambda) = \frac{1}{n} \sum_{i=1}^n (\lambda X_i - \ln(\lambda)) $$

$$ \frac{d}{d\lambda} \hat R_n(f_\lambda) = \frac{1}{n} \sum_{i=1}^n (X_i - 1/\lambda) $$

$$ \frac{d}{d\lambda} \hat R_n(f_\lambda) = 0 $$

gives $$ 1/\hat \lambda = \frac{1}{n} \sum_{i=1}^n X_i $$

import matplotlib.pyplot as plt
_=plt.hist(time_between_seconds,bins=200,density=True)
hat_lambda = 1/np.mean(time_between_seconds)
x_plot = np.linspace(0,20000,100)
plt.plot(x_plot,hat_lambda*np.exp(-hat_lambda*x_plot))

[<matplotlib.lines.Line2D at 0x1165dfd30>]

Lets say I want to predict the time to next earthquake, lets say that we use our estimated $\hat \lambda$ to predict $1/\hat \lambda$.

Train, testing split. But for us, using the iid assumption we can just randomly split our dataset into two parts.

n_train = int(len(time_between_seconds)/2)
train_set = time_between_seconds[:n_train]
test_set = time_between_seconds[n_train:]

hat_lambda_train = 1/np.mean(train_set)
guess = 1/hat_lambda_train

guess

1411.6221267726278

_=plt.hist(np.abs(guess-test_set),bins=100)

average_error = np.mean(np.abs(guess-test_set))
print(average_error)

1113.396468268911

Introduction to Data Science¶