Make sure you pass the # ... Test
cells and
submit your solution notebook in the corresponding assignment on the course website. You can submit multiple times before the deadline and your highest score will be used.
Consider the data X
and y
, in the cell below. X
denotes $20$ points in $\mathbb{R}^2$ and y
corresponds to the labels for these points, i.e. it is a classification problem.
perceptron
by filling in XXX
.[2p] Use your implemented perceptron
function to compute a vector (numpy array) $\hat w$ with shape (3,1)
such that
$$
(\hat w \cdot \hat x_i) l_i > 0, \quad \forall i=1,\ldots,20
$$
put your answer in hat_w
below (the last dimension is the bias dimension, i.e. the added dimension we used to derive the perceptron)
[3p] Use the vector $\hat w$ that you just found and compute $r = \max_i |x_i|$ (put your result in r
), finally use this to give an upper bound to the number of iterations needed for the perceptron algorithm to converge on this dataset, see chapter 8 in the ITDS notes. Put the result in iteration_bound
.
import numpy as np
X = np.array([[0.14774693918368506,0.8537253157278155],[-0.1755517430286779,0.8979710703337818],[0.5227216475286975,0.7448281947022451],[-0.5071170511153492,0.8002027400836075],[-0.39436968212400453,1.0177689414422981],[-0.3983065780966649,1.0443663197782966],[-0.08652771617599643,0.48036820824519255],[0.15352541170101042,0.6820807981911706],[-0.3303348532791869,1.120673883903539],[-0.2656220857139274,0.8526638282828739],[0.7259603693529442,0.25428467532034965],[0.4577253912481767,-0.2358809079980879],[0.9722462145222105,0.13128550836973255],[0.4089349951770505,-0.09503914544452634],[0.9718156747909192,0.3524307824261209],[1.2009353774940565,-0.25004126389987974],[1.271791635779178,-0.07571928320750206],[0.36784476124502913,-0.23743021661715671],[0.8918396050420891,-0.1029336332277948],[0.4501578013678095,-0.13188266835015783]])+np.array([10,0]).reshape(1,-1)
y = np.array([1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0])
# Part 1
def perceptron(X_in,labels,max_iter=1000):
'''Runs the perceptron algorithm on X_in, labels, and does a maximum of max_iter iterations'''
w = XXX
# Dont forget the addition of the extra dimension to encode the
# bias in the perceptron, i.e. adding the extra dimension with value 1
return w #Make sure that w has the shape described in the problem
# Part 1
hat_w = XXX
# Part 2
r = XXX
iteration_bound = XXX
For this problem you will need the pandas package and the sklearn package. If you download the updated data
folder from the course website you will find a file called indoor_train.csv
, this file includes a bunch of positions in (X,Y,Z) and also a location number. The idea is to assign a room number (Location) to the coordinates (X,Y,Z).
indoor_train.csv
and load it using pandas into a dataframe df_train
df_train
, create two numpy arrays, one Xtrain
and Ytrain
, they should have sizes (1154,3)
and (1154,)
respectively. Their dtype
should be float64
and int64
respectively.sklearn.svc.SVC
, on Xtrain, Ytrain
with kernel='linear'
and name the trained model svc_train
.To mimic how kaggle works, the Autograder has access to a hidden test-set and will test your fitted model.
df_train = XXX
Xtrain = XXX
Ytrain = XXX
svc_train = XXX
In the following problem we will explore SMS spam texts. The dataset is the SMS Spam Collection Dataset
and we have provided for you a way to load the data. If you run the appropriate cell below, the result will be in the spam_no_spam
variable. The result is a list
of tuples
with the first position in the tuple being the SMS text and the second being a flag 0 = not spam
and 1 = spam
.
[3p] Let $X$ be the random variable that represents each SMS text (an entry in the list), and let $Y$ represent whether text is spam or not i.e. $Y \in \{0,1\}$. Thus $\mathbb{P}(Y = 1)$ is the probability that we get a spam. The goal is to estimate:
$$
\mathbb{P}(Y = 1 | \text{"free" or "prize" is in } X) \enspace .
$$
That is, the probability that the SMS is spam given that "free" or "prize" occurs in the SMS. (This is precision). Hint: it is good to remove the upper/lower case of words so that we can also find "Free" and "Prize"; this can be done with text.lower()
if text
a string.
[3p] Estimate the probability that the word "free" or "prize" is in the text given that it is spam. (This is recall) I.e. estimate $$ \mathbb{P}(\text{"free" or "prize" is in } X \mid Y = 1) \enspace . $$
# Run this cell to get the SMS text data
def load_sms():
import csv
lines = []
hamspam = {'ham': 0, 'spam': 1}
with open('data/spam.csv', mode='r',encoding='latin-1') as f:
reader = csv.reader(f)
header = next(reader)
lines = [(line[1],hamspam[line[0]]) for line in reader]
return lines
spam_no_spam = load_sms()
# fill in the estimate for part 1 here (should be a number between 0 and 1)
problem4_hatP = XXX
# fill in the estimate for part 2 here (should be a number between 0 and 1)
problem4_hatP2 = XXX
# fill in the calculated l from part 3 here
problem4_l = XXX