Make sure you pass the # ... Test
cells and
submit your solution notebook in the corresponding assignment on the course website. You can submit multiple times before the deadline and your highest score will be used.
This time the assignment only consists of one problem, but we will do a more comprehensive analysis instead.
Consider the dataset mammography.mat
that you can download from http://odds.cs.stonybrook.edu/mammography-dataset/. Below you can find the code to load this file into X
and Y
, you just need to put the file in your data
folder. During mammography the doctor would be looking for something called calcification
, which is calcium deposits in the breast tissue and is used as an indicator of cancer. In this dataset the X
corresponds to some measurements, there are 6 features. The Y
is a 0-1 label where 1 represents calcification and 0 does not.
LogisticRegression
, but for it to work well you would need StandardScaler
and put everything in a Pipeline
.Utils
and compute the intervals for precision-recall etc on the test set with union_bound correction
with 95% confidence.complete filling the function cost
to compute the cost of a prediction model under a certain prediction threshold (recall our precision recall lecture and the predict_proba
function from trained models). What would be the cost of having a model that always classifies someone as not calcified on the test set?
import scipy.io as so
import numpy as np
data = so.loadmat('data/mammography.mat')
np.random.seed(0)
shuffle_index = np.arange(0,len(data['X']))
np.random.shuffle(shuffle_index)
X = data['X'][shuffle_index,:]
Y = data['y'][shuffle_index,:].flatten()
# Part 1
# Split the X,Y into three parts, make sure that the sizes are
# (6709, 6), (1677, 6), (2797, 6), (6709,), (1677,), (2797,)
X_train, X_test, X_valid, Y_train, Y_test, Y_valid = XXX
# Part 2
# Use X_train,Y_train to train a model from sklearn. Make sure it has the predict_proba function
# for instance LogisticRegression
model = XXX
# Part 3
# Compute precision and recall on the test set using
from Utils import classification_report_interval
# Each precision and recall should be a tuple, for instance you can write
# precision0 = (0.9,0.95)
precision0 = XXX
recall0 = XXX
precision1 = XXX
recall1 = XXX
# The code below will check that you supply the proper type
assert(type(precision0) == tuple)
assert(len(precision0) == 2)
assert(type(recall0) == tuple)
assert(len(recall0) == 2)
assert(type(precision1) == tuple)
assert(len(precision1) == 2)
assert(type(recall1) == tuple)
assert(len(recall1) == 2)
# Part 4
def cost(model,threshold,X,Y):
pred_proba = model.predict_proba(X)[:,1]
predictions = (pred_proba >= threshold)*1
# Fill in what is missing to compute the cost and return it
# Note that we are interested in average cost (cost per person)
return XXX
# Part 4
# Fill in the naive cost of the model that would classify all as non-calcified on the test set
naive_cost = XXX
# Part 5
optimal_threshold = XXX
cost_at_optimal_threshold = XXX
# Part 6
cost_at_optimal_threshold_validation = XXX
# Report the cost interval as a tuple cost_interval = (a,b)
cost_interval = XXX
# The code below will tell you if you filled in the intervals correctly
assert(type(cost_interval) == tuple)
assert(len(cost_interval) == 2)
# Part 7
variance_of_C = XXX
# Part 7
# A 95% confidence interval of the random variable C using Bennett's inequality
interval_of_C = XXX
assert(type(interval_of_C) == tuple)
assert(len(interval_of_C) == 2)