Make sure you pass the # ... Test
cells and
submit your solution notebook in the corresponding assignment on the course website. You can submit multiple times before the deadline and your highest score will be used.
The purpose of this problem is to show that you can implement your own sampler, this will be built in the following three steps:
Using the Accept-Reject sampler (Algorithm 1 in TFDS notes) with sampling density given by the uniform $[0,1]$ distribution.
def problem1_LCG(size=None, seed = 0):
"""
A linear congruential generator that generates pseudo random numbers according to size.
Parameters
-------------
size : an integer denoting how many samples should be produced
seed : the starting point of the LCG, i.e. u0 in the notes.
Returns
-------------
out : a list of the pseudo random numbers
"""
XXX
return XXX
def problem1_uniform(generator=None, period = 1, size=None, seed=0):
"""
Takes a generator and produces samples from the uniform [0,1] distribution according
to size.
Parameters
-------------
generator : a function of type generator(size,seed) and produces the same result as problem1_LCG, i.e. pseudo random numbers in the range {0,1,...,period-1}
period : the period of the generator
seed : the seed to be used in the generator provided
size : an integer denoting how many samples should be produced
Returns
--------------
out : a list of the uniform pseudo random numbers
"""
XXX
return XXX
def problem1_accept_reject(uniformGenerator=None, n_iterations=None, seed=0):
"""
Takes a generator that produces uniform pseudo random [0,1] numbers
and produces samples from (pi/2)*abs(sin(x*2*pi)) using an Accept-Reject
sampler with the uniform distribution as the proposal distribution.
Runs n_iterations
Parameters
-------------
generator : a function of the type generator(size,seed) that produces uniform pseudo random
numbers from [0,1]
seed : the seed to be used in the generator provided
n_iterations : an integer denoting how many attempts should be made in the accept-reject sampler
Returns
--------------
out : a list of the pseudo random numbers with the specified distribution
"""
XXX
return XXX
Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.
# If you managed to solve all three parts you can test the following code to see if it runs
# you have to change the period to match your LCG though, this is marked as XXX.
# It is a very good idea to check these things using the histogram function in sagemath
# try with a larger number of samples, up to 10000 should run
print("LCG output: %s" % problem1_LCG(size=10, seed = 1))
period = 2147483648
print("Uniform sampler %s" % problem1_uniform(generator=problem1_LCG, period = period, size=10, seed=1))
uniform_sampler = lambda size,seed: problem1_uniform(generator=problem1_LCG, period = period, size=size, seed=seed)
print("Accept-Reject sampler %s" % problem1_accept_reject(uniformGenerator = uniform_sampler,n_iterations=20,seed=1))
# If however you did not manage to implement either part 1 or part 2 but still want to check part 3, you can run the code below
def testUniformGenerator(size,seed):
import random
random.seed(seed)
return [random.uniform(0,1) for s in range(size)]
print("Accept-Reject sampler %s" % problem1_accept_reject(uniformGenerator=testUniformGenerator, n_iterations=20, seed=1))
The dataset Travel Dataset - Datathon 2019
is a simulated dataset designed to mimic real corporate travel systems -- focusing on flights and hotels. The file is at data/flights.csv
, i.e. you can use the path data/flights.csv
from the notebook to access the file.
data/flights.csv
'Aracaju (SE)'
would correspond to $0$. Each row of the file corresponds to one flight, i.e. it has a starting city and an ending city. We model this as a stationary Markov chain, i.e. each user's travel trajectory is a realization of the Markov chain, $X_t$. Here, $X_t$ is the current city the user is at, at step $t$, and $X_{t+1}$ is the city the user travels to at the next time step. This means that to each row in the file there is a corresponding pair $(X_{t},X_{t+1})$. The stationarity assumption gives that for all $t$ there is a transition density $p$ such that $P(X_{t+1} = y | X_t = x) = p(x,y)$ (for all $x,y$). The transition matrix should be n_cities
x n_citites
in size.number_of_cities = XXX
number_of_userCodes = XXX
number_of_observations = XXX
# This is a very useful function that you can use for part 2. You have seen this before when parsing the
# pride and prejudice book.
def makeFreqDict(myDataList):
'''Make a frequency mapping out of a list of data.
Param myDataList, a list of data.
Return a dictionary mapping each unique data value to its frequency count.'''
freqDict = {} # start with an empty dictionary
for res in myDataList:
if res in freqDict: # the data value already exists as a key
freqDict[res] = freqDict[res] + 1 # add 1 to the count using sage integers
else: # the data value does not exist as a key value
freqDict[res] = 1 # add a new key-value pair for this new data value, frequency 1
return freqDict # return the dictionary created
cities = XXX
unique_cities = sorted(set(cities)) # The unique cities
n_cities = len(unique_cities) # The number of unique citites
# Count the different transitions
transitions = XXX # A list containing tuples ex: ('Aracaju (SE)','Rio de Janeiro (RJ)') of all transitions in the text
transition_counts = XXX # A dictionary that counts the number of each transition
# ex: ('Aracaju (SE)','Rio de Janeiro (RJ)'):4
indexToCity = XXX # A dictionary that maps the n-1 number to the n:th unique_city,
# ex: 0:'Aracaju (SE)'
cityToIndex = XXX # The inverse function of indexToWord,
# ex: 'Aracaju (SE)':0
# Part 3, finding the maximum likelihood estimate of the transition matrix
transition_matrix = XXX # a numpy array of size (n_cities,n_cities)
# The transition matrix should be ordered in such a way that
# p_{'Aracaju (SE)','Rio de Janeiro (RJ)'} = transition_matrix[cityToIndex['Aracaju (SE)'],cityToIndex['Rio de Janeiro (RJ)']]
# and represents the probability of travelling Aracaju (SE)->Rio de Janeiro (RJ)
# Make sure that the transition_matrix does not contain np.nan from division by zero for instance
# This should be a numpy array of length n_cities which sums to 1 and is all positive
stationary_distribution_problem2 = XXX
# Compute the return probability for part 3 of PROBLEM 2
return_probability_problem2 = XXX
Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.
# Once you have created all your functions, you can make a small test here to see
# what would be generated from your model.
import numpy as np
start = np.zeros(shape=(n_cities,1))
start[cityToIndex['Aracaju (SE)'],0] = 1
current_pos = start
for i in range(10):
random_word_index = np.random.choice(range(n_cities),p=current_pos.reshape(-1))
current_pos = np.zeros_like(start)
current_pos[random_word_index] = 1
print(indexToCity[random_word_index],end='->')
current_pos = (current_pos.T@transition_matrix).T
Derive the maximum likelihood estimate for $n$ IID samples from a random variable with the following probability density function: $$ f(x; \lambda) = \frac{1}{24} \lambda^5 x^4 \exp(-\lambda x), \qquad \text{ where, } \lambda>0, x > 0 $$
You can solve the MLe by hand (using pencil paper or using key-strokes). Present your solution as the return value of a function called def MLeForAssignment2Problem3(x)
, where x
is a list of $n$ input data points.
# do not change the name of the function, just replace XXX with the appropriate expressions for the MLe
def MLeForAssignment2Problem3(x):
'''write comment of what this function does'''
XXX
XXX
return XXX
Use the Multi-dimensional Constrained Optimisation example (in 07-Optimization.ipynb
) to numerically find the MLe for the mean and variance parameter based on normallySimulatedDataSamples
, an array obtained by a specific simulation of $30$ IID samples from the $Normal(10,2)$ random variable.
Recall that $Normal(\mu, \sigma^2)$ RV has the probability density function given by:
$$ f(x ;\mu, \sigma) = \displaystyle\frac{1}{\sigma\sqrt{2\pi}}\exp\left(\frac{-1}{2\sigma^2}(x-\mu)^2\right) $$The two parameters, $\mu \in \mathbb{R} := (-\infty,\infty)$ and $\sigma \in (0,\infty)$, are sometimes referred to as the location and scale parameters.
You know that the log likelihood function for $n$ IID samples from a Normal RV with parameters $\mu$ and $\sigma$ simply follows from $\sum_{i=1}^n \log(f(x_i; \mu,\sigma))$, based on the IID assumption.
NOTE: When setting bounding boxes for $\mu$ and $\sigma$ try to start with some guesses like $[-20,20]$ and $[0.1,5.0]$ and make it larger if the solution is at the boundary. Making the left bounding-point for $\sigma$ too close to $0.0$ will cause division by zero Warnings. Other numerical instabilities can happen in such iterative numerical solutions to the MLe. You need to be patient and learn by trial-and-error. You will see the mathematical theory in more details in a future course in scientific computing/optimisation. So don't worry too much now except learning to use it for our problems.
import numpy as np
from scipy import optimize
# do NOT change the next three lines
np.random.seed(123456) # set seed
# simulate 30 IID samples drawn from Normal(10,2)RV
normallySimulatedDataSamples = np.random.normal(10,2,30)
# define the negative log likelihoo function you want to minimise by editing XXX
def negLogLklOfIIDNormalSamples(parameters):
'''return the -log(likelihood) of normallySimulatedDataSamples with mean and var parameters'''
mu_param=parameters[0]
sigma_param=parameters[1]
XXX
XXX # add more or less lines as you need
return XXX
# you should only change XXX below and not anything else
parameter_bounding_box=((XXX, XXX), (XXX, XXX)) # specify the constraints for each parameter - some guess work...
initial_arguments = np.array([XXX, XXX]) # point in 2D to initialise the minimize algorithm
result_Ass2Prob4 = optimize.minimize(XXX, initial_arguments, bounds=parameter_bounding_box, XXX)
# call the minimize method above finally! you need to play a bit to get initial conditions and bounding box ok
result_Ass2Prob4