Introduction to Data Science

1MS041, 2023

©2023 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

Other measurements of performance

Recall that in the logistic regression case our function $G(x) \in [0,1]$ and represents the probability of the label being $1$, we then used the following rule to construct a decision function from this $G$, i.e. $$ g(x) = \begin{cases} 1, & \text{if } G(x) > 1/2 \\ 0, & \text{otherwise.} \end{cases} $$ The parameter $1/2$ can be changed in order to create a trade-off between precision and recall.

Lets consider the function $$ g_\alpha(x) = \begin{cases} 1, & \text{if } G(x) > \alpha \\ 0, & \text{otherwise.} \end{cases} $$ where $\alpha \in [0,1]$, then for each such $\alpha$ we get a precision and recall, i.e. $$ \begin{aligned} \text{Precision:} \quad \text{Pr}(\alpha) = P(Y = 1 \mid g_\alpha(X) = 1) \\ \text{Recall:} \quad \text{Re} (\alpha) = P(g_\alpha(X) = 1 \mid Y = 1). \end{aligned} $$

These functions can be plotted as functions of $\alpha$, we can see that below

In [1]:
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
digits = load_digits()

from sklearn.svm import SVC

labels = digits['target'] >= 5

X = digits['data']

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,labels)

per = SVC(kernel='linear',probability=True)

per.fit(X_train,Y_train)
Out[1]:
SVC(kernel='linear', probability=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [10]:
per_rbf = SVC(kernel='rbf',probability=True)

per_rbf.fit(X_train,Y_train)
Out[10]:
SVC(probability=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [2]:
from Utils import classification_report_interval
In [3]:
print(classification_report_interval(Y_test,per.predict(X_test)))
            labels           precision             recall

             False  0.89 : [0.77,1.00] 0.86 : [0.74,0.99]
              True  0.87 : [0.74,0.99] 0.90 : [0.77,1.00]

          accuracy                                        0.88 : [0.79,0.97]

In [ ]:
per.predict_proba(X_test)
In [11]:
from sklearn.metrics import precision_recall_curve
prec,rec,thresh = precision_recall_curve(Y_test,per.predict_proba(X_test)[:,1])
prec_rbf,rec_rbf,thresh_rbf = precision_recall_curve(Y_test,per_rbf.predict_proba(X_test)[:,1])

Tradeoff between precision and recall

In [8]:
import matplotlib.pyplot as plt
plt.plot(thresh,prec[:-1])
plt.plot(thresh,rec[:-1])
Out[8]:
[<matplotlib.lines.Line2D at 0x168cf3970>]

Precision and recall curve

It is customary to plot the precision as a function of recall into the so called precision and recall curve

In [14]:
plt.plot(rec,prec)
plt.plot(rec_rbf,prec_rbf)
plt.xlim(0,1)
plt.ylim(0,1)
Out[14]:
(0.0, 1.0)

Average precision

$$ p(r) = \text{Pr}(\text{Re}^{-1}(r)) $$

then $p(r)$ is the curve above where the $x$-axis is $r$ (recall) and the $y$ axis is precision. Average precision is thus just $$ \text{AP} = \int_0^1 p(r) dr $$ or also called Area Under the Precision Recall Curve.

If we call $\mathbb{P}(g_\alpha(X) = 1) = t$, the detection level, where $t(1) = 0$ and $t(0) = 1$. Then $$ p(t) = \text{Pr}(\alpha) = \mathbb{P}(Y=1 \mid g_\alpha(X)=1) = Re(\alpha) \frac{\mathbb{P}(Y = 1)}{\mathbb{P}(g_\alpha(X) = 1)} $$ and $$ r(t) = \text{Re}(\alpha) $$ then we get $$ p(t) = \frac{\mathbb{P}(Y = 1) r(t)}{t} $$

This equation captures the trade-off between precision and recall as functions of the detection level.

It may look like average precision is a strange quantity, and indeed it is. See for instance the review paper on the course website.

In [16]:
from sklearn.metrics import average_precision_score
average_precision_score(Y_test,per.predict_proba(X_test)[:,1])
Out[16]:
0.9481477402057775
In [17]:
average_precision_score(Y_test,per_rbf.predict_proba(X_test)[:,1])
Out[17]:
0.9950172870414039

Reciever Operating Characteristic (ROC)

Lets consider Recall for the $0$ class vs Recall for the $1$ class $$ \text{FPR}(\alpha) = \mathbb{P}(g(X)=1 \mid Y = 0) \quad \text{also goes by the name false positive rate} $$ $$ \text{Re}(\alpha) = \mathbb{P}(g(X)=1 \mid Y = 1) \quad \text{also goes by the name true positive rate} $$

We can plot these using sklearn as follows

In [21]:
from sklearn.metrics import roc_curve
fpr,tpr,thresholds = roc_curve(Y_test,per.predict_proba(X_test)[:,1])
fpr_rbf,tpr_rbf,thresholds_rbf = roc_curve(Y_test,per_rbf.predict_proba(X_test)[:,1])
In [19]:
plt.plot(thresholds,fpr)
plt.plot(thresholds,tpr)
Out[19]:
[<matplotlib.lines.Line2D at 0x1690fa490>]

However more common is to consider plotting them against eachother

In [22]:
plt.plot(fpr,tpr)
plt.plot(fpr_rbf,tpr_rbf)
Out[22]:
[<matplotlib.lines.Line2D at 0x1692daa60>]

This is the plot of $\text{Re}(\text{FPR}^{-1}(r))$.

There is also the AUC, which is Area Under the Curve, which is defined as $$ \int_0^1 \text{Re}(\text{FPR}^{-1}(r)) dr = -\int_0^1 \text{Re}(\alpha) \text{FPR}'(\alpha) d\alpha $$

Both the AP (Average Precision) and the (AUC) is used as a single performance metric of a classifier.

Let $Z = G(X)$, where $G$ is the predicted probability, then $Z$ has density $F_Z$. Let $F_{Z,Y}$ be the joint distribution of $Z$ and $Y$

Let $$ \begin{aligned} f_1(z) &= f_{Z \mid Y=1} \\ f_0(z) &= f_{Z \mid Y=0} \end{aligned} $$ thus we simply get $$ \begin{aligned} \text{FPR}(\alpha) = \int_{\alpha}^1 f_0(z)dz \\ \text{Re}(\alpha) = \int_{\alpha}^1 f_1(z)dz \end{aligned} $$ As such $$ \text{Re}'(\alpha) = -f_1(\alpha) $$ and we can write $$ -\int_0^1 \text{Re}(\alpha) \text{FPR}'(\alpha) d\alpha = \int_0^1 \int_z^1 f_1(z') f_0(z) dz dz' $$ Consider $Z_1$ be a random variable sampled from $f_1$ and $Z_0$ be sampled from $f_0$. Then we can write the above as $$ \mathbb{P}(Z_1 > Z_0) = \int_0^1 \int_z^1 f_1(z') f_0(z) dz dz' $$ It is useful to think about what this probability means. That is, if we take a randomly chosen sample from the positive class and call it $X_1$ and do the same with class $0$ and call that $X_0$, then the AUC is the probability that $G(X_1) > G(X_0)$.