Assignment 3 for Course 1MS041¶
Make sure you pass the # ... Test cells and
submit your solution notebook in the corresponding assignment on the course website. You can submit multiple times before the deadline and your highest score will be used.
Download the updated data folder from the course github website or just download directly the file https://github.com/datascience-intro/1MS041-2024/blob/main/notebooks/data/smhi.csv from the github website and put it inside your data folder, i.e. you want the path data/smhi.csv. The data was aquired from SMHI (Swedish Meteorological and Hydrological Institute) and constitutes per hour measurements of wind in the Uppsala Aut station. The data consists of windspeed and direction. Your goal is to load the data and work with it a bit. The code you produce should load the file as it is, please do not alter the file as the autograder will only have access to the original file.
The file information is in Swedish so you need to use some translation service, for instance Google translate or ChatGPT.
- [2p] Load the file, for instance using the
csvpackage. Put the wind-direction as a numpy array and the wind-speed as another numpy array. - [2p] Use the wind-direction which is an angle in degrees and convert it into a point on the unit circle. Store the
x_coordinateas one array and they_coordinateas another. From these coordinates, construct the wind-velocity vector. - [2p] Calculate the average wind velocity and convert it back to direction and compare it to just taking average of the wind direction as given in the data-file.
- [2p] The wind velocity is a $2$-dimensional random variable, calculate the empirical covariance matrix which should be a numpy array of shape (2,2).
For you to wonder about, is it more likely for you to have headwind or not when going to the university in the morning.
problem1_wind_direction = XXX
problem1_wind_speed = XXX
# The wind direction is given as a compass direction in degrees (0-360)
# convert it to x and y coordinates using the standard mathematical convention
problem1_wind_direction_x_coordinate = XXX
problem1_wind_direction_y_coordinate = XXX
problem1_wind_velocity_x_coordinate = XXX
problem1_wind_velocity_y_coordinate = XXX
# Put the average wind velocity x and y coordinates here in these variables
problem1_average_wind_velocity_x_coordinate = XXX
problem1_average_wind_velocity_y_coordinate = XXX
# First calculate the angle of the average wind velocity vector in degrees
problem1_average_wind_velocity_angle_degrees = XXX
# Then calculate the average angle of the wind direction in degrees (using the wind direction in the data)
problem1_average_wind_direction_angle_degrees = XXX
# Finally, are they the same? Answer as a boolean value (True or False)
problem1_same_angle = XXX
problem1_wind_velocity_covariance_matrix = XXX
For this problem you will need the pandas package and the sklearn package. Inside the data folder from the course website you will find a file called indoor_train.csv, this file includes a bunch of positions in (X,Y,Z) and also a location number. The idea is to assign a room number (Location) to the coordinates (X,Y,Z).
- [2p] Take the data in the file
indoor_train.csvand load it using pandas into a dataframedf_train - [3p] From this dataframe
df_train, create two numpy arrays, oneXtrainandYtrain, they should have sizes(1154,3)and(1154,)respectively. Theirdtypeshould befloat64andint64respectively. - [3p] Train a Support Vector Classifier,
sklearn.svc.SVC, onXtrain, Ytrainwithkernel='linear'and name the trained modelsvc_train.
To mimic how kaggle works, the Autograder has access to a hidden test-set and will test your fitted model.
df_train = XXX
Xtrain = XXX
Ytrain = XXX
svc_train = XXX
Let us build a proportional model ($\mathbb{P}(Y=1 \mid X) = G(\beta_0+\beta \cdot X)$ where $G$ is the logistic function) for the spam vs not spam data. Here we assume that the features are presence vs not presence of a word, let $X_1,X_2,X_3$ denote the presence (1) or absence (0) of the words $("free", "prize", "win")$.
[2p] Load the file
data/spam.csvand create two numpy arrays,problem3_Xwhich has shape (n_texts,3) where each feature inproblem3_Xcorresponds to $X_1,X_2,X_3$ from above,problem3_Ywhich has shape (n_texts,) and consists of a $1$ if the email is spam and $0$ if it is not. Split this data into a train-calibration-test sets where we have the split $40\%$, $20\%$, $40\%$, put this data in the designated variables in the code cell.[2p] Follow the calculation from the lecture notes where we derive the logistic regression and implement the final loss function inside the class
ProportionalSpam. You can use theTestcell to check that it gives the correct value for a test-point.[2p] Train the model
problem3_pson the training data. The goal is to calibrate the probabilities output from the model. Start by creating a new variableproblem3_X_pred(shape(n_samples,1)) which consists of the predictions ofproblem3_pson the calibration dataset. Then train a calibration model usingsklearn.tree.DecisionTreeRegressor, store this trained model inproblem3_calibrator. Recall that calibration error is the following for a fixed function $f$ $$ \sqrt{\mathbb{E}[|\mathbb{E}[Y \mid f(X)] - f(X)|^2]}. $$[2p] Use the trained model
problem3_psand the calibratorproblem3_calibratorto make final predictions on the testing data, store the prediction inproblem3_final_predictions.
problem3_X = XXX
problem3_Y = XXX
problem3_X_train = XXX
problem3_X_calib = XXX
problem3_X_test = XXX
problem3_Y_train = XXX
problem3_Y_calib = XXX
problem3_Y_test = XXX
print(problem3_X_train.shape,problem3_X_calib.shape,problem3_X_test.shape,problem3_Y_train.shape,problem3_Y_calib.shape,problem3_Y_test.shape)
class ProportionalSpam(object):
def __init__(self):
self.coeffs = None
self.result = None
# define the objective/cost/loss function we want to minimise
def loss(self,X,Y,coeffs):
return XXX
def fit(self,X,Y):
import numpy as np
from scipy import optimize
#Use the f above together with an optimization method from scipy
#to find the coefficients of the model
opt_loss = lambda coeffs: self.loss(X,Y,coeffs)
initial_arguments = np.zeros(shape=X.shape[1]+1)
self.result = optimize.minimize(opt_loss, initial_arguments,method='cg')
self.coeffs = self.result.x
def predict(self,X):
#Use the trained model to predict Y
if (self.coeffs is not None):
G = lambda x: np.exp(x)/(1+np.exp(x))
return np.round(10*G(np.dot(X,self.coeffs[1:])+self.coeffs[0]))/10 # This rounding is to help you with the calibration
problem3_ps = XXX
problem3_X_pred = XXX
problem3_calibrator = XXX
problem3_final_predictions = XXX
Local Test for Assignment 3, PROBLEM 3¶
Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.
try:
import numpy as np
test_instance = ProportionalSpam()
test_loss = test_instance.loss(np.array([[1,0,1],[0,1,1]]),np.array([1,0]),np.array([1.2,0.4,0.3,0.9]))
assert (np.abs(test_loss-1.2828629432232497) < 1e-6)
print("Your loss was correct for a test point")
except:
print("Your loss was not correct on a test point")