Introduction to Data Science ¶

1MS041, 2023¶

from Utils import showURL

showURL("https://en.wikipedia.org/wiki/Continuous_uniform_distribution")

Integrate the uniform distribution using Sympy and simulate¶

from sympy import var, integrate

a = var('a')
b = var('b')
x = var('x')

f=1/(b-a)

F_prim = integrate(f,x)

# We need that F(a) = 0
# F(a) = F_prim(a) + C
# C = -F_prim(a)

F = F_prim - a/(b-a)
F

We basically start by solving for $y$ $$ F(x) = y $$ This is solvable because F is strictly increasing at least in the interval $x \in (a,b)$.

$$ F(x) = y $$$$ \frac{x-a}{b-a} = y $$

$$ F^{-1}(y) = y(b-a)+a $$

import numpy as np
y = np.random.uniform(0,1,10000)
x = y*(20-10)+10

import matplotlib.pyplot as plt
_=plt.hist(x)

Inversion sampling¶

Inversion sampling starts with a distribution function $F$. We want to simulate from $F$. We first compute $F^{-1}(y)$ and then we let $X$ be a uniform $(0,1)$ random variable and then we define $$ Y = F^{-1}(X) $$ then $Y$ has distribution function $F$.

$$ F_Y(y) = P(Y \leq y) = P(F^{-1}(X) \leq y) = P(X \leq F(y)) $$

Rename $F(y) = x$ then $$ P(X \leq x) = F_X(x) = F_X(F(y)) $$

Since $X$ is uniform $(0,1)$ we have that $F_X(x) = x$ if $x \in (0,1)$

Conclusion is that $F_Y(y) = F(y)$.

Integrating the exponential and inversion sampling¶

showURL("https://en.wikipedia.org/wiki/Exponential_distribution")

lam = var('lambda')
x = var('x')
from sympy import exp
f = lam*exp(-lam*x)
f

F_prim = integrate(f,x)

F_prim

F = F_prim+1
F

$$ F(x) = y $$$$ 1-e^{-\lambda x} = y $$$$ 1-y = e^{-\lambda x} $$

$$ \ln(1-y)/(-\lambda) = x $$$$ F^{-1}(y) = \frac{\ln(1-y)}{-\lambda} $$

y = np.random.uniform(0,1,10000)
x = np.log(1-y)/(-1)

_=plt.hist(x,bins=100,density=True)
x_plot = np.linspace(0,9,100)
plt.plot(x_plot,np.exp(-x_plot))

[<matplotlib.lines.Line2D at 0x169c54520>]

Loading the co2 data and plotting¶

%%bash
ls data

CORIS.csv
NYPowerBall.csv
auto.csv
co2_mm_mlo.txt
digits.csv
earthquakes.csv
earthquakes.csv.zip
earthquakes.tgz
earthquakes_small.csv
final.csv
final.csv.zip
final.tgz
flights.csv
indoor_train.csv
leukemia.csv
mammography.mat
portland.csv
pride_and_prejudice.txt
rainfallInChristchurch.csv
ratings.csv
spam.csv

%%bash
head -n 100 data/co2_mm_mlo.txt

# --------------------------------------------------------------------
# USE OF NOAA ESRL DATA
# 
# These data are made freely available to the public and the
# scientific community in the belief that their wide dissemination
# will lead to greater understanding and new scientific insights.
# The availability of these data does not constitute publication
# of the data.  NOAA relies on the ethics and integrity of the user to
# ensure that ESRL receives fair credit for their work.  If the data 
# are obtained for potential use in a publication or presentation, 
# ESRL should be informed at the outset of the nature of this work.  
# If the ESRL data are essential to the work, or if an important 
# result or conclusion depends on the ESRL data, co-authorship
# may be appropriate.  This should be discussed at an early stage in
# the work.  Manuscripts using the ESRL data should be sent to ESRL
# for review before they are submitted for publication so we can
# insure that the quality and limitations of the data are accurately
# represented.
# 
# Contact:   Pieter Tans (303 497 6678; pieter.tans@noaa.gov)
# 
# File Creation:  Thu Dec  6 13:26:20 2018
# 
# RECIPROCITY
# 
# Use of these data implies an agreement to reciprocate.
# Laboratories making similar measurements agree to make their
# own data available to the general public and to the scientific
# community in an equally complete and easily accessible form.
# Modelers are encouraged to make available to the community,
# upon request, their own tools used in the interpretation
# of the ESRL data, namely well documented model code, transport
# fields, and additional information necessary for other
# scientists to repeat the work and to run modified versions.
# Model availability includes collaborative support for new
# users of the models.
# --------------------------------------------------------------------
#  
#  
# See www.esrl.noaa.gov/gmd/ccgg/trends/ for additional details.
#  
# Data from March 1958 through April 1974 have been obtained by C. David Keeling
# of the Scripps Institution of Oceanography (SIO) and were obtained from the
# Scripps website (scrippsco2.ucsd.edu).
#
# The "average" column contains the monthly mean CO2 mole fraction determined
# from daily averages.  The mole fraction of CO2, expressed as parts per million
# (ppm) is the number of molecules of CO2 in every one million molecules of dried
# air (water vapor removed).  If there are missing days concentrated either early
# or late in the month, the monthly mean is corrected to the middle of the month
# using the average seasonal cycle.  Missing months are denoted by -99.99.
# The "interpolated" column includes average values from the preceding column
# and interpolated values where data are missing.  Interpolated values are
# computed in two steps.  First, we compute for each month the average seasonal
# cycle in a 7-year window around each monthly value.  In this way the seasonal
# cycle is allowed to change slowly over time.  We then determine the "trend"
# value for each month by removing the seasonal cycle; this result is shown in
# the "trend" column.  Trend values are linearly interpolated for missing months.
# The interpolated monthly mean is then the sum of the average seasonal cycle
# value and the trend value for the missing month.
#
# NOTE: In general, the data presented for the last year are subject to change, 
# depending on recalibration of the reference gas mixtures used, and other quality
# control procedures. Occasionally, earlier years may also be changed for the same
# reasons.  Usually these changes are minor.
#
# CO2 expressed as a mole fraction in dry air, micromol/mol, abbreviated as ppm
#
#  (-99.99 missing data;  -1 no data for #daily means in month)
#
#            decimal     average   interpolated    trend    #days
#             date                             (season corr)
1958   3    1958.208      315.71      315.71      314.62     -1
1958   4    1958.292      317.45      317.45      315.29     -1
1958   5    1958.375      317.50      317.50      314.71     -1
1958   6    1958.458      -99.99      317.10      314.85     -1
1958   7    1958.542      315.86      315.86      314.98     -1
1958   8    1958.625      314.93      314.93      315.94     -1
1958   9    1958.708      313.20      313.20      315.91     -1
1958  10    1958.792      -99.99      312.66      315.61     -1
1958  11    1958.875      313.33      313.33      315.31     -1
1958  12    1958.958      314.67      314.67      315.61     -1
1959   1    1959.042      315.62      315.62      315.70     -1
1959   2    1959.125      316.38      316.38      315.88     -1
1959   3    1959.208      316.71      316.71      315.62     -1
1959   4    1959.292      317.72      317.72      315.56     -1
1959   5    1959.375      318.29      318.29      315.50     -1
1959   6    1959.458      318.15      318.15      315.92     -1
1959   7    1959.542      316.54      316.54      315.66     -1
1959   8    1959.625      314.80      314.80      315.81     -1
1959   9    1959.708      313.84      313.84      316.55     -1
1959  10    1959.792      313.26      313.26      316.19     -1
1959  11    1959.875      314.80      314.80      316.78     -1
1959  12    1959.958      315.58      315.58      316.52     -1
1960   1    1960.042      316.43      316.43      316.51     -1
1960   2    1960.125      316.97      316.97      316.47     -1
1960   3    1960.208      317.58      317.58      316.49     -1
1960   4    1960.292      319.02      319.02      316.86     -1
1960   5    1960.375      320.03      320.03      317.24     -1
1960   6    1960.458      319.59      319.59      317.36     -1

with open("data/co2_mm_mlo.txt",mode="r") as f:
    current_line = f.readline()
    while (current_line[0] == '#'):
        current_line = f.readline()
    print(current_line)

1958   3    1958.208      315.71      315.71      314.62     -1

import csv
data_raw = []
with open("data/co2_mm_mlo.txt",mode="r") as f:
    current_pos = f.tell()
    current_line = f.readline()
    while (current_line[0] == '#'):
        current_pos = f.tell()
        current_line = f.readline()
    f.seek(current_pos)
    csv_reader = csv.reader(f,delimiter=' ',skipinitialspace=True)
    for line in csv_reader:
        data_raw.append(line)

data_raw[0]

['1958', '3', '1958.208', '315.71', '315.71', '314.62', '-1']

len(data_raw)

729

What is a schema?¶

schema = [int, int, float, float, float, float, int]

type(int("123"))

int

data_parsed = [[typ(item) for item,typ in zip(line,schema)] for line in data_raw]

data_parsed[0]

[1958, 3, 1958.208, 315.71, 315.71, 314.62, -1]

Loading data into a numpy array and types¶

data_array = np.array(data_parsed)

data_array.shape

(729, 7)

data_array[0,0]

1958.0

Some basic stats using the Utils package¶

# Basic stats
from Utils import basic_stats
basic_stats(data_array[:,4])

mean: 353.79	std: 27.53	skew: 0.34	kurtosis: 1.93

data_array[0:10,4]

array([315.71, 317.45, 317.5 , 317.1 , 315.86, 314.93, 313.2 , 312.66,
       313.33, 314.67])

x = np.random.normal(0,27.53,10000)
basic_stats(x)

mean: -0.02	std: 27.70	skew: 0.01	kurtosis: 3.01

_=plt.hist(data_array[:,4],bins=50)

# EDF
from Utils import makeEDF, plotEDF
plotEDF(makeEDF(data_array[:,4]))

Introduction to Data Science¶