12. Linear Regression

  • Regression
    • linear models and their least-squares estimators
    • assessing fit using diagnostic plots (residual analysis)
    • multiple linear regression
    • prediction
    • prelude to statistical ML
  • Introduction to R in SageMath Jupyter IPython Notebook - SageMath/R

Introduction

Regression is a method for studying the relationship between a response variable $Y$ and a covariate $X$. The covariate is also called a feature or a predictor variable.

A simple way to summarise the relationship between $X$ and $Y$ is through the regression function $r(x)$:

$$ r(x) = E(Y | X=x) = \int y \, f(y|x) dy $$

Our objective is to estimate the regression function $r(x)$ from data of the form:

$$ (Y_1,X_1),(Y_2,X_2),\ldots,(Y_n,X_n) \overset{IID}{\sim} F_{X,Y} $$

We assume that $F_{X,Y}$, the joint distribution of $X$ and $Y$, is parametric and $r$ is linear.

Simple Linear Regression

The simple linear regression model is when $X_i$ is real-valued (one-dimensional) and $r(x)$ is assumed to be linear:

$$ r(x) = \beta_0 + \beta_1 x, \qquad \text{and } \quad V(Y | X=x)=\sigma^2 \, \text{ is independent of } x $$

Thus simple linear regression model is the following:

$$ \boxed{ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \qquad \text{ where, } \quad E(\epsilon_i | X_i)=0 \text{ and } V(\epsilon_i | X_i)=\sigma^2 } $$

The unknown parameters and their estimates in the model are:

  • the intercept $\beta_0$ and its estimate $\widehat{\beta}_0$,
  • the slope $\beta_1$ and its estimate $\widehat{\beta}_1$ and
  • the variance $\sigma^2$ and its estimate $\widehat{\sigma}^2$

The fitted line is: $$ \widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x $$

The fitted or predicted values are: $$ \widehat{Y}_i = \widehat{r}(X_i) $$

The residuals are: $$ \widehat{\epsilon}_i = Y_i-\widehat{Y}_i=Y_i-\left(\widehat{\beta}_0 + \widehat{\beta}_1 X_i\right) $$

The residual sum of squares or RSS, that measures how well the line fits the data, is defined by $$ RSS = \sum_{i=1}^n \widehat{\epsilon}_i^2 $$

The least squares estimates are the values $\widehat{\beta}_0$ and $\widehat{\beta}_1$ that minimise $RSS$ and they are given by:

$$ \boxed{ \widehat{\beta}_1 = \displaystyle{\frac{\sum_{i=1}^n(X_i-\overline{X}_n)(Y_i-\overline{Y}_n)}{\sum_{i=1}^n(X_i-\overline{X}_n)^2}} \, , \qquad \widehat{\beta}_0 = \displaystyle{\overline{Y}_n - \widehat{\beta}_1 \overline{X}_n} \, , \qquad \widehat{\sigma}^2 = \displaystyle{\left(\frac{1}{n-2}\right) \sum_{i=1}^n \widehat{\epsilon}_i^2} } $$

Interactive Animations for Regression

Check out:

Least Squares and Maximum Likelihood

Suppose we add the assumption about the model's noise that

$$\boxed{\displaystyle{\epsilon_i | X_i \sim Normal(0,\sigma^2) \quad \text{ i.e., }\quad Y_i|X_i \sim Normal(\mu_i,\sigma^2), \quad \text{ where } \quad \mu_i = \beta_0+\beta_1 X_i }}$$

Then, the total likelihood function is:

$$ \begin{align} \displaystyle{\prod_{i=1}^n f(X_i,Y_i)} \, &= \displaystyle{\prod_{i=1}^n f_X(X_i) \, f_{Y|X}(Y_i|X_i)}\\ &= \displaystyle{\prod_{i=1}^n f_X(X_i) \, \prod_{i=1}^n f_{Y|X}(Y_i|X_i)}\\ &=: L_{n,X} \, L_{n,Y|X} \end{align} $$

where, $L_{n,X}:=\prod_{i=1}^n f_X(X_i)$ is the marginal likelihood of $X_1,\ldots,X_n$ that does not depend on the parameters $(\beta_0,\beta_1,\sigma)$, and $L_{n,Y|X}:=\prod_{i=1}^n f_{Y|X}(Y_i|X_i)$ is the conditional likelihood that does depend on the parameters. Therefore the conditional likelihood is the one we consider:

$$ \begin{align} L(\beta_0,\beta_1,\sigma) \quad &\propto \quad \displaystyle{\prod_{i=1}^n f(X_i,Y_i)} \\ &\propto \quad L_{n,Y|X} = \displaystyle{\prod_{i=1}^n f_{Y|X}(Y_i|X_i)}\\ &\propto \quad \displaystyle{\sigma^{-n} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n\left(Y_i-\mu_i\right)^2 \right)}\\ \end{align} $$

and the conditional log-likelihood is:

$$ \boxed{ l(\beta_0,\beta_1,\sigma) \quad =\quad \displaystyle{-n \log(\sigma) -\frac{1}{2 \sigma^2} \sum_{i=1}^n\left(Y_i-\mu_i\right)^2 } } $$

To find the MLE of $(\beta_0,\beta_1)$ we need to maximise $\ell(\beta_0,\beta_1,\sigma)$ for a given $\sigma$. From the above expresion it is clear that maximising the log-likelihood is equivalent to minimising the residual sum of squares or RSS given by

$$ \boxed{ \sum_{i=1}^n\left(Y_i-\mu_i\right)^2 } $$

Therefore, we have shown the following Theorem.

Theorem [MLE is LSE]

Under the assumption of normally distributed noise, the maximum likelihood estimator (MLE) is the least squares estimator (LSE).

We can maximise $l(\beta_0,\beta_1,\sigma)$ over $\sigma$ and obtain the MLE for $\sigma$ as follows:

$$ \widehat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \ \widehat{\epsilon}^2 \, . $$

But it is more common in practise to use the unbiased estimator, with $E(\widehat{\sigma}^2)=\sigma^2$, that we saw earlier for sample size $n>2$:

$$ \widehat{\sigma}^2 = \displaystyle{\left(\frac{1}{n-2}\right) \sum_{i=1}^n \widehat{\epsilon}_i^2} \, . $$

However the bias is usually quite small if $n$ is large.

Properties of the Least Squares Estimator (LSE)

It's finally time to obtain the standard errors and limititng distribution of the least quares estimator (also the MLE).

In regression we are interested in the properties of the estimators conditional on the covariates

$$X_{1:n}:= (X_1,X_2,\ldots,X_n)$$

Conditional Mean and Variance of LSE

Let $\widehat{\beta}^T=(\widehat{\beta}_0,\widehat{\beta}_1)^T$ denote the least squares estimators (which is also the MLE). Then

$$ \begin{align} E \left(\widehat{\beta} \, | \, X_{1:n} \right) &= \displaystyle{\left( {\begin{array}{c} \beta_0 \\ \beta_1 \\ \end{array} } \right)}\\ V \left(\widehat{\beta} \, | \, X_{1:n} \right) &= \displaystyle{\frac{\sigma^2}{n s_X^2} \left( {\begin{array}{cc} \frac{1}{n}\sum_{i=1}^n X_i^2 & -\overline{X}_n \\ -\overline{X}_n & 1\\ \end{array} } \right)} \end{align} $$

where,

$$ s_X^2 = \frac{1}{n} \sum_{i=1}^n \left(X_i -\overline{X}_n\right)^2 $$

Estimated Standard Errors

The estimated standard errors for $\widehat{\beta}_0$ and $\widehat{\beta}_1$, or more precisely, the estimated standard errors conditional on the covariates, are given by the square-root of the diagonal terms of the variance-covariance matrix $V \left(\widehat{\beta} \, | \, X_{1:n} \right) $ and substituting the estimate $\widehat{\sigma}$ for $\sigma$, as follows:

$$ \begin{align} \widehat{se}\left(\widehat{\beta}_0\right) := \widehat{se}\left(\widehat{\beta}_0 \, | \, X_{1:n} \right) \, &= \, \frac{\widehat{\sigma}}{s_X \sqrt{n}} \sqrt{\frac{\sum_{i=1}^nX_i^2}{n}}\\ \widehat{se}\left(\widehat{\beta}_1\right) := \widehat{se}\left(\widehat{\beta}_0 \, | \, X_{1:n}\right) \, &= \, \frac{\widehat{\sigma}}{s_X \sqrt{n}} \end{align} $$

Thus under appropriate modeling assumptions in simple leinear regression we have the following four properties.

Four Asymptotic Properties of the LSE

1. Asymptotic Consistency

As $n \to \infty$, the LSE, i.e. $\widehat{\beta}_0$ and $\widehat{\beta}_1$, converges in probability to the parameters, i.e., $\beta_0,\beta_1$, generating the data $(Y_1,X_1),(Y_2,X_2),\ldots,(Y_n,X_n)$ as summarised below.

$$ \boxed{ \widehat{\beta}_0 \overset{P}{\to} \beta_0 \quad \text{ and } \quad \widehat{\beta}_1 \overset{P}{\to} \beta_1 } $$

2. Asymptotic Normality

As $n \to \infty$, the LSE, i.e. $\widehat{\beta}_0$ and $\widehat{\beta}_1$, converges in distribution to the parameters, i.e., $\beta_0,\beta_1$, generating the data $(Y_1,X_1),(Y_2,X_2),\ldots,(Y_n,X_n)$ as summarised below.

$$ \boxed{ \frac{\widehat{\beta}_0 - \beta_0}{\widehat{se}\left(\widehat{\beta}_0\right)} \overset{d}{\to} Normal(0,1) \quad \text{ and } \quad \frac{\widehat{\beta}_1 - \beta_1}{\widehat{se}\left(\widehat{\beta}_1\right)} \overset{d}{\to} Normal(0,1) } $$

Note: in the above the exact distribution is actually the Student-t distribution, but nevertheless the normal distribution is a good approximation if $n > 30$.

3. Approximate $1-\alpha$ Confidence Interval

The $1-\alpha$ confidence interval for $\beta_0$ and $\beta_1$ that is obtained from the approximately normal distribution as $n$ gets large is:

$$ \boxed{ \widehat{\beta}_0 \, \pm \, z_{\alpha/2} \, \widehat{se}\left(\widehat{\beta}_0\right) \quad \text{ and } \quad \widehat{\beta}_1 \, \pm \, z_{\alpha/2} \, \widehat{se}\left(\widehat{\beta}_1\right) } $$

4. The Wald Test

Recall Wald test statistic for testing the null hypothesis with the null value $\beta^{(0)}$:

$$ H_0: \beta = \beta^{(0)} \quad \text{ versus } \quad H_1: \beta \neq \beta^{(0)} \quad { is } \quad W = \frac{\left(\widehat{\beta}-\beta^{(0)}\right)}{\widehat{se}\left(\widehat{\beta}\right)} $$

Thus the Wald test for testing $H_0: \beta_1=0$ versus $H_1: \beta_1 \neq 0$ is to reject $H_0$ if $|W| > z_{\alpha/2}$ where $W=\frac{\widehat{\beta}_1}{\widehat{se}\left(\widehat{\beta}_1\right)}$.

Implementing Simple Linear Regression from Scratch

Using the above formulas we can implement Python functions to calculate the least squares estimates, $\widehat{\beta}_0$ and $\widehat{\beta}_1$, that minimise $RSS$.

In [9]:
import numpy as np 
import matplotlib.pyplot as plt

def estimate_coefficients(x, y): 
    # size of the dataset  
    n = np.size(x) 
    # mean of x and y
    mean_x, mean_y = np.mean(x), np.mean(y) 
    # xy cross-deviation and xx deviatio
    SS_xy = np.sum(y*x) - n*mean_y*mean_x
    SS_xx = np.sum(x*x) - n*mean_x*mean_x
    # calculating LSE of regression coefficients 
    b1_hat = SS_xy / SS_xx 
    b0_hat = mean_y - b1_hat*mean_x 
    sigma_hat2 = np.mean((y - (b0_hat + b1_hat * x))^2)
    if n>2:
        sigma_hat2 = sigma_hat2*n/(n-2)
    sigma_hat=np.sqrt(sigma_hat2)
    return(b0_hat, b1_hat, sigma_hat)

def standard_errors(x,y):
    n = np.size(x) 
    b0_hat,b1_hat,s_hat = estimate_coefficients(x,y)
    mean_x = np.mean(x)
    s2X = np.mean( (x-mean_x)^2 )
    se_b1 = s_hat/np.sqrt(s2X*n)
    se_b0 = se_b1*np.sqrt(np.mean(x^2))
    return (se_b0, se_b1)

def plot_regression_line(x, y, b): 
    # plotting the data points on a graph
    plt.scatter(x, y, color = "m",marker = "o", s = 10) 
    # predicted response vector 
    y_pred = b[0] + b[1]*x 
    # plotting the fitted regression line
    plt.plot(x, y_pred, color = "b")
    # putting generic labels for x and y axis
    plt.xlabel('x') 
    plt.ylabel('y') 
    # function to show plotted graph
    plt.show()

def SimpleLinearRegression(x,y): 
    # estimating coefficients 
    b = estimate_coefficients(x, y) 
    print("Estimated coefficients:\nb0_hat = {} \nb1_hat = {}\nsigma_hat = {}".format(b[0], b[1],b[2])) 
    # plotting fitted regression line  with data
    plot_regression_line(x, y, b)
In [10]:
# Datasets for x and y 
LSAT=np.array([576, 635, 558, 578, 666, 580, 555, 661, 651, 605, 653, 575, 545, 572, 594]) # LSAT data
GPA=np.array([3.39, 3.30, 2.81, 3.03, 3.44, 3.07, 3.00, 3.43, 3.36, 3.13, 3.12, 2.74, 2.76, 2.88, 3.96]) # GPA data

SimpleLinearRegression(LSAT,GPA)
Estimated coefficients:
b0_hat = 0.5998612958803515 
b1_hat = 0.004267223518635576
sigma_hat = 0.28404612916999544

We can look at the residuals of the fitted line as follows.

In [12]:
predictedGPA = 0.5998612958803515 + 0.004267223518635576*LSAT
residuals = GPA - predictedGPA
plt.scatter(LSAT, residuals, color = "k",marker = "o", s = 10) 
plt.axhline()
# putting generic labels for x and y axis
plt.ylabel('$\\epsilon_i$') 
plt.xlabel('LSAT') # draw a y=0 line
plt.show()
# in general we want residuals to be Normally distributes about 0 with the same variance

Residual Analysis

Looking at the residuals $\epsilon_i$'s in the above plot we can notice how just $4$ of the $15$ datapoints are abov $0$. If $\epsilon_i$ were truly IID $Normal(0,\sigma^2)$, we would expect roughly the same number of points to be spread above and below zero, i.e., the $x$-axis, in an equally likely manner. Also, we would expect more points to be closer to zero and fewer points to be further away.

In conclusion, the residuals of the linear regression of LSAT and GPA do not look like they are normall distributed.

We could try different approaches to improve the model. For example, we could try to increase the sample size or standardise the scales by subtracting the sample mean and dividing by the the sample standard deviation for the $x$ and $y$ values separately and doing regression with the standardised data, etc.

The real wiki page has some simple examples of residual plots and they are useful for insights:

Examples of residual plots are shown in the following figure. (a) is a satisfactory plot with the residuals falling in a horizontal band with no systematic pattern. Such a plot indicates an appropriate regression model. (b) shows residuals falling in a funnel shape. Such a plot indicates increase in variance of residuals and the assumption of constant variance is violated here. Transformation on may be helpful in this case (see Transformations). If the residuals follow the pattern of (c) or (d), then this is an indication that the linear regression model is not adequate. Addition of higher order terms to the regression model or transformation on or may be required in such cases. A plot of residuals may also show a pattern as seen in (e), indicating that the residuals increase (or decrease) as the run order sequence or time progresses. This may be due to factors such as operator-learning or instrument-creep and should be investigated further.)

We can finally obtain 95% confidence intervals for the fitted parameters in the simple linear regression model and do a Wald test as follows.

In [13]:
b0_hat, b1_hat, s_hat = estimate_coefficients(LSAT,GPA)
se_b0,se_b1 = standard_errors(LSAT,GPA)
print ("Estimated standard errors for beta_0_hat and beta_1_hat are:")
print (se_b0,se_b1 )
print ("and the approximate 95% confidence intervals for beta_0_hat is:")
print ("          [ ", b0_hat-2*se_b0," , ", b0_hat+2*se_b0, " ]")
print ("and the approximate 95% confidence intervals for beta_1_hat is:")
print ("          [ ", b1_hat-2*se_b1," , ", b1_hat+2*se_b1, " ]")
print ("The Wald test for the null hypothesis H0 that beta_1 = 0 is:")
W = (b1_hat-0)/se_b1
if abs(W > 2):
    print ("Reject H0 that beta_1=0 at alpha=0.05, since W = ",W)
else:
    print ("fail to reject H0 that beta_1=0 at alpha=0.05, since W = ",W)
Estimated standard errors for beta_0_hat and beta_1_hat are:
1.0927735237389298 0.0018163754932013147
and the approximate 95% confidence intervals for beta_0_hat is:
          [  -1.585685751597508  ,  2.785408343358211  ]
and the approximate 95% confidence intervals for beta_1_hat is:
          [  0.0006344725322329466  ,  0.007899974505038206  ]
The Wald test for the null hypothesis H0 that beta_1 = 0 is:
Reject H0 that beta_1=0 at alpha=0.05, since W =  2.34930692172834

Multiple Regression

This is just as simple, except we have more than one covariate

Now, let's suppose that the covariate, feature, predictor or dependent variable is a vector of length $k$. So our data for regression is of the following form:

$$ (Y_1,X_1), (Y_2,X_2), \ldots, (Y_i,X_i), \ldots, (Y_n,X_n) $$

where, $X_i$ is a vector of length $k$ for the $i$-th observation or datapoint $(Y_i,X_i)$.

$$ X_i = (X_{i,1},X_{i,2},\ldots,X_{i,k}) \, . $$

Then the linear regression model is:

$$ Y_i = \displaystyle{\sum_{j=0}^k \beta_j X_{i,j} + \epsilon_i, \quad \text{ for } i \in \{1,2,\ldots,n\} } $$

where $\beta_0$ is the intercept term with $X_{i,0}=1$ for each $i \in \{1,2,\ldots,n\}$ and

$$ E \left( \epsilon_i | X_{1,i}, X_{2,i}, \ldots, X_{k,i} \right) = 0. $$

We can denote the model using matrices and vectors more conveniently as follows:

$$ Y = \displaystyle{\left( {\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array} } \right)} \, , \qquad X = \displaystyle{\left( {\begin{array}{cccc} 1& X_{1,1}& \ldots& X_{1,k} \\ 1& X_{2,1}& \ldots& X_{2,k} \\ \vdots & \vdots & \vdots & \vdots\\ 1& X_{n,1}& \ldots& X_{n,k} \end{array} } \right)} \, , \qquad \beta = \displaystyle{\left( {\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{array} } \right)} \, , \qquad \epsilon = \displaystyle{\left( {\begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{array} } \right)} \, . $$

With $X \in \mathbb{R}^{n \times (k+1)}$, i.e., $X$ being a $n \times (k+1)$ matrix, $\beta \in \mathbb{R}^{(k+1) \times 1}$, i.e., $\beta$ being a a column vector with $k+1$ rows, and $\epsilon \in \mathbb{R}^{n \times 1}$, i.e., $\epsilon$ being a column vector with $n$ rows, we obtain the multiple regression model:

$$ \boxed{ Y = X \beta + \epsilon \, . } $$

Just as in the 1D case with $k=1$, the least sqaures estimate is as follows, under the assumption that $X^T X$ is invertible:

$$ \boxed{ \begin{align} \widehat{\beta} &= \left( X^T X\right)^{-1} X^T Y\\ V\left(\widehat{\beta} | X_{1:n} \right) &= \sigma^2 \left( X^T X \right)^{-1} \\ \widehat{\beta} &\approx Normal \left(\beta, \sigma^2 \left( X^T X\right)^{-1} \right) \end{align} \, . } $$

The estimate of the regression function is:

$$ \boxed{ \widehat{r}(x) = \sum_{j=0}^k \widehat{\beta}_j \, x_j \, . } $$

An unbiased estimate of $\sigma^2$ is:

$$ \widehat{\sigma}^2 = \left( \frac{1}{n-(k+1)} \right) \sum_{i=1}^n \widehat{\epsilon}_i^2 \, $$

where $\widehat{\epsilon}$ is the vector of residuals:

$$ \boxed{ \widehat{\epsilon}=X \widehat{\beta} - Y } \ , \text{ i.e.,} \quad \widehat{\epsilon} = \displaystyle{\left( {\begin{array}{c} \widehat{\epsilon}_1 \\ \widehat{\epsilon}_2 \\ \vdots \\ \widehat{\epsilon}_n \end{array} } \right)} = \displaystyle{\left( {\begin{array}{cccc} 1& X_{1,1}& \ldots& X_{1,k} \\ 1& X_{2,1}& \ldots& X_{2,k} \\ \vdots & \vdots & \vdots & \vdots\\ 1& X_{n,1}& \ldots& X_{n,k} \end{array} } \right)} \ \displaystyle{\left( {\begin{array}{c} \widehat{\beta}_0 \\ \widehat{\beta}_1 \\ \vdots \\ \widehat{\beta}_k \end{array} } \right)} \ - \displaystyle{\left( {\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array} } \right)} $$

An approximate $1-\alpha$ confidence interval for $\beta_j$ is

$$ \boxed{ \widehat{\beta}_j \pm z_{\alpha/2} \widehat{se}(\widehat{\beta}_j) } $$

where $\left(\widehat{se}(\widehat{\beta}_j)\right)^2$ is the $j$-th diagonal entry of the matrix $\widehat{\sigma}^2 (X^T X)^{-1}$.

Solving Least Squares Using Numerical Linear Algebra Routine in scipy

We can use scipy.linalg.lstsq to get the least squares solution to our regression problems quite easily, including generalisation to multiple linear regression when the covariates are in more than 1 dimension.

Let us try to understand the code in the previous cell by learning how to do a least squares fit by setting up the right design matrix.

Example 1: Fitting a Line is Simple Linear Regression

In [7]:
from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np

# suppose we have the following data
x = np.array([1, 2.5, 3.5, 4, 5, 7, 8.5])
y = np.array([0.3, 1.1, 1.5, 2.0, 3.2, 6.6, 8.6])

#We want to fit a line of the form y = a + b*x to this data. We first form the 
#“design matrix” M, with a constant column of 1s and a column containing x
M1 = x[:, np.newaxis]^[0, 1]
M1
Out[7]:
array([[1. , 1. ],
       [1. , 2.5],
       [1. , 3.5],
       [1. , 4. ],
       [1. , 5. ],
       [1. , 7. ],
       [1. , 8.5]])
In [8]:
#We want to find the least-squares solution to 
#M1.dot(p) = y, where p is a vector with length 2 that holds the parameters a and b.
p, res, rnk, s = lstsq(M1, y)
p
Out[8]:
array([-1.93080357,  1.16875   ])
In [9]:
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()

Example 2: Fitting a Quadratic is also Simple Linear Regresssion

Suppose we want to fit a quadratic polynomial of the form $y = a + b*x^2$ to the same data. Then we first form the design matrix M2, with a constant column of 1s and a column containing x^2 as follows:

In [10]:
M2 = x[:, np.newaxis]^[0, 2]
M2
Out[10]:
array([[ 1.  ,  1.  ],
       [ 1.  ,  6.25],
       [ 1.  , 12.25],
       [ 1.  , 16.  ],
       [ 1.  , 25.  ],
       [ 1.  , 49.  ],
       [ 1.  , 72.25]])
In [11]:
# least square solution with M2
p, res, rnk, s = lstsq(M2, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx^2
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()

Example 3: Fitting a 3rd Order Polynomial is Multiple Linear Regresssion

Suppose we want to fit a degree-3 polynomial of the form $y = \beta_0 + \beta_1 x + \beta_2 x^2+ \beta_3 x^3$ to the same data. Then we first form the design matrix M3, with a constant column of 1s with x^0 and three additional columns containing x^1, x^2 and x^3 as follows:

In [12]:
# Fitting a cubic polynolial is the same idea
M3 = x[:, np.newaxis]^[0, 1, 2, 3]
M3
Out[12]:
array([[  1.   ,   1.   ,   1.   ,   1.   ],
       [  1.   ,   2.5  ,   6.25 ,  15.625],
       [  1.   ,   3.5  ,  12.25 ,  42.875],
       [  1.   ,   4.   ,  16.   ,  64.   ],
       [  1.   ,   5.   ,  25.   , 125.   ],
       [  1.   ,   7.   ,  49.   , 343.   ],
       [  1.   ,   8.5  ,  72.25 , 614.125]])
In [13]:
p, res, rnk, s = lstsq(M3, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx + p[2]*xx^2 + p[3]*xx^3
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()

Sample Exam Problem 8

Using the lstsq method shown above, and data arrays x and y in the next cell that contain log light intensity and log surface temperature in a give range of measurements from nearby stars, compute the least squares estimates of $\beta_0$ and $\beta_1$ under the simple linear regression model with an intercept and a slope term. Make a plot similar to the one above with the data points and the fitted regression line.

In [14]:
# Sample Exam Problem 8 
# do not change this import and data block ########################
from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]
########### end of import and data block ##########################

# Replace only ZZZ by the right values
M1 = ZZZ # design matrix M1
b, res, rnk, s = lstsq(ZZZ,ZZZ)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(ZZZ, ZZZ, 101)
yy = ZZZ *xx
plt.plot(xx, yy, label='least squares fit')
plt.xlabel('log light intensity (X)')
plt.ylabel('log surface temperature (Y)')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.text(4, 4.7, r'$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x, \quad \
\widehat{\beta}_0 = $ %(b0)0.3f , $\widehat{\beta}_1 = $ %(b1)0.3f' % {'b0': b[0], 'b1': b[1]} )
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-6a4cd5f4215d> in <module>()
     16 
     17 # Replace only ZZZ by the right values
---> 18 M1 = ZZZ # design matrix M1
     19 b, res, rnk, s = lstsq(ZZZ,ZZZ)
     20 plt.plot(x, y, 'o', label='data')

NameError: name 'ZZZ' is not defined
In [15]:
# Sample Exam Problem 8 Solution
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]

from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
M1 = x[:, np.newaxis]^[0, 1]
b, res, rnk, s = lstsq(M1, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(3.9, 5.8, 101)
yy = b[0] + b[1]*xx
plt.plot(xx, yy, label='least squares fit')
plt.xlabel('log light intensity (X)')
plt.ylabel('log surface temperature (Y)')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.text(4, 4.7, r'$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x, \quad \
\widehat{\beta}_0 = $ %(b0)0.3f , $\widehat{\beta}_1 = $ %(b1)0.3f' % {'b0': b[0], 'b1': b[1]} )
plt.show()

Prediction

Let's consider the 1D setting for simplicity of notation. Suppose we have estimated a regression model: $$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x $$ from data $(X_1,Y_1), (X_2,Y_2), \ldots, (X_n,Y_n)$.

Now suppose we observe the value $X=x_*$ of the covariate of a new observarion but do not observe the response $Y_*$ and want to predict it. An estimate of $Y_*$ is

$$ \boxed{ \widehat{Y}_* = \widehat{\beta}_0 + \widehat{\beta}_1 x_* \, . } $$

By the formula for the variance of the sum of two random variables:

$$ V(\widehat{Y}_*) = V(\widehat{\beta}_0 + \widehat{\beta}_1 x_*) = V(\widehat{\beta}_0) + x_*^2 V(\widehat{\beta}_1 ) + 2 x_* Cov (\widehat{\beta}_0,\widehat{\beta}_1) $$

We have all the needed terms to compute $V(\widehat{Y}_*)$ from the earlier result on the conditional variance of the least squares estimate:

$$ V \left( \widehat{\beta} \, | \, X_{1:n} \right) = \frac{\sigma^2}{n s_X^2} \left( {\begin{array}{cc} \frac{1}{n}\sum_{i=1}^n X_i^2 & -\overline{X}_n \\ -\overline{X}_n & 1\\ \end{array}} \right) $$

The estimated standard error $\widehat{se}(\widehat{Y}_*)$ is just $\sqrt{V(\widehat{Y}_*)}$ with $\widehat{\sigma}^2$ substituted in for $\sigma^2$. An approximate $1-\alpha$ confidence interval for $Y^*$ is called an approximate $1-\alpha$ prediction interval for $Y_*$ and is given by

$$ \boxed{ \widehat{Y}_* \pm z_{\alpha/2} \widehat{\xi}_n \, , \quad \text{ where } \quad \widehat{\xi}^2_n = \widehat{\sigma}^2 \left( \frac{\sum_{i=1}^n (X_i-X_*)^2}{n \sum_{i=1}^n (X_i-\overline{X})^2} + 1 \right) } $$

Multiple Regression on 2018 Swedish Election Data

If you are interested, you already have the basic skills to look at the data from Swedish election using these ideas.

Try to model, say the $\log$ of the number of district-level votes for the top two most voted parties.

You can introduce latitude of the district centres (if you have such information from geospatial database you could join), distance of the district to one of the four largest cities in Sweden, or the socio-economic indicators of the district for Swedish Central Statistical Bureau, etc., as covariates.

But this is a good project and beyond current scope (mainly due to time limitations).

Prelude to Statistical Machine Learning

Here, we just start you off on the path to more statistical modeling for purposes of prediction. Now statistical learning from the 1970s is needed to mathematically justify the methods.

Loss functions and gradient descent

In the above example with linear regression we wanted to minimize the vertical distance between the fitted line and the data, this vertical distance is a prime example of a loss function. In general when we are faced with a regression problem we want a way of measure how good our model is, this quantity that we want to minimise is called the loss function and its expectation (over different sample data-points is called the risk). The mathematical statistical justification for this approach towards minimising expected loss or risk is called empirical risk minimisation, as we will see in the sequel in more detail.

Let us circle back to linear regression once again. The way the np.argmin method searched for the minimum of:

$$L(a,b) = \sum_{i=1}^N (y_i - f_{a,b}(x_i))^2$$

was by simply evaluating $L(a,b)$ for each value of $a$ in the array prop_a with our guessed values for $a$ and picking the $a$ that minimised $L(a,b)$. Recall we fixed $b$ in the search.

In [16]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
x = np.linspace(-1,1,10)
epsilon = np.random.normal(loc=0,scale=1,size=len(x),)
y = 2*x + 1 + epsilon
In [17]:
# These are our guesses for b
prop_b = np.linspace(0,4,100)

# Compute the residual if we use these b's, remember this is with a = 0,
# as noted above the a can be taken to be the mean of the unadjusted residual for each b

unadj_residual = y.reshape(-1,1)-prop_b.reshape(1,-1)*x.reshape(-1,1)

# Compute the variance of the unadjusted residual
variance_residual = np.std(unadj_residual,axis=0)**2

# Plot the variance for each proposed value of b
plt.plot(prop_b,variance_residual)

# Find the minimum variance
min_index = np.argmin(variance_residual)

# Set a and b based on this minimum
a = np.mean(unadj_residual[:,min_index])
b = prop_b[min_index]

print("Optimal a: %f, optimal b: %f" % (a,b))
Optimal a: 0.902859, optimal b: 1.818182

np.argmin? # see the docstring for np.argmin and other functions/methods we are using throughout if you need to know right away.

This approaching of evaluating the loss at a set of parameter values quickly becomes infeasible when the dimension of the problem is larger than $1$.

Even if we just have two guess for each dimension of the parameter space with $d$ dimensions, then we will need to evaluate the loss at $2^d$ parameter values. When $d=10,100,1000$ the number of evaluation points become $1024$, $1.268e30, 1.072e301$, respectively.

Often in big-data settings, the number of dimensions for the regression problem can easily extend over a few thousands. Thus, we need a systematic way to find the optimal parameters, i.e., the parameters that minimise the loss function.

In [18]:
def L(a,b):
    return np.mean((y-(a+b*x))^2)
In [19]:
plot3d(L,(a,0.5,1.5),(b,1.5,2.5))
Out[19]:

The iterative solution is called gradient descent and it goes like this:

  • Initialise: Let us start with some initial parameters, say in our linear regression example $(a,b) = (0,0)$, say at iteration $i=0$.
  • Update: then we construct an update rule like the following to update the parameter values at $i+1$ from those at iteration $i$:
    • $a_i = a_{i-1}-l \frac{dL}{da}(a_{i-1},b_{i-1})$
    • $b_i = b_{i-1}-l \frac{dL}{db}(a_{i-1},b_{i-1}) $
    • where $l > 0$ is called the learning rate.
  • Stop: Finally we stop when a stopping rule like the following is satisfied: $$\sqrt( ( L(a_{i+1},b_{i+1})- L(a_{i},b_{i}))^2) < \tau, \qquad \text{ where, $\tau$ is some tolerance threshold that says we are close enough to the minimum value found by our iteration}.$$
In [20]:
G(a,b) = L(a,b) # Create a symbolic expression
In [51]:
@interact
def gradient_descent_3d(x_start=input_box(0,type=float,label='x_start'), y_start=input_box(0,type=float,label='y_start'), n_steps=(1,(1,30))):
    P=points([])

    lr = 0.1
    point = (x_start,y_start)
    levels = []
    for i in range(0,n_steps):
        dLdb = G.diff(b).subs(b=point[1]).subs(a=point[0]).n(10)
        dLda = G.diff(a).subs(b=point[1]).subs(a=point[0]).n(10)
        new_a = point[0] - lr*dLda
        new_b = point[1] - lr*dLdb

        P+=arrow((point[0],point[1],L(point[0],point[1])),(new_a,new_b,L(new_a,new_b)),color='red',width=0.5)
        #plt.arrow(point[0],point[1],-lr*dLda*0.7,-lr*dLdb*0.7,head_width=0.05)
        point = (new_a,new_b)
    x_min = min(x_start,new_a)-0.1
    x_max = max(x_start,new_a)+0.1
    y_min = min(y_start,new_b)-0.1
    y_max = max(y_start,new_b)+0.1
    P+=plot3d(L,(a,x_min,x_max),(b,y_min,y_max))
    P.show()
In [44]:
import numpy as np

@interact
def gradient_descent_contours(n_steps=(1,(1,30))):
    # make a contour plot
    P=contour_plot(G, (a, 0.3, 1.5), (b, 1.2, 2.5),fill=False, aspect_ratio=1, contours=20)
    lr = 0.2
    point = (0.3,1.2)

    for i in range(0,n_steps):
        dLdb = G.diff(b).subs(b=point[1]).subs(a=point[0]).n(10)
        dLda = G.diff(a).subs(b=point[1]).subs(a=point[0]).n(10)
        new_a = point[0] - lr*dLda
        new_b = point[1] - lr*dLdb

        P+=arrow((point[0],point[1]),(new_a,new_b),color='red',width=0.3)
        #plt.arrow(point[0],point[1],-lr*dLda*0.7,-lr*dLdb*0.7,head_width=0.05)
        point = (new_a,new_b)

    P.show()

Introduction to R in SageMath Jupyter IPython Notebook

  1. How to run R commands in SageMath
    • doing linear regression regression using R's builtin lm (linear model) package in SageMath/R
    • installing non-builtin packages, loading libraries and data

Running R in SageMath is "easy as":

  • Use %%r to denote that the Code cell is of language R

First note that SageMath/Python and R kernels will be available in the SageMath Jupyter notebook.

In [20]:
# this is x and y available as numpy arrays in SageMath/Python
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]
print (x)
print (y)
[5.23 5.74 4.93 5.74 5.19 5.46 5.27 5.57 5.12 5.45 5.42 4.05 4.26 4.58
 3.94 4.18 4.18 4.38 4.22 4.42 4.85 5.02 4.66 4.66 4.9  4.39 4.42 5.1
 5.22 4.34 5.62 5.1  5.22 5.18 5.57 4.62 5.06 5.34 5.34 5.54 4.98 4.5 ]
[4.37 4.56 4.26 4.56 4.3  4.46 4.57 4.26 4.37 4.43 4.48 4.01 4.29 4.42
 4.23 4.42 4.23 4.29 4.29 4.42 4.49 4.38 4.42 4.29 4.38 4.22 4.38 4.56
 4.45 4.23 4.62 4.53 4.45 4.53 4.43 4.38 4.45 4.5  4.45 4.55 4.45 4.42]

Assigning to x and y in SageMath/R

We use the assignment operator, <-, in R, as follows:

In [21]:
%%r
x <- c(5.23,  5.74,  4.93,  5.74,  5.19,  5.46,  5.27,  5.57,  5.12,
         5.45,  5.42,  4.05,  4.26,  4.58,  3.94,  4.18,  4.18,  4.38,
         4.22,  4.42,  4.85,  5.02,  4.66,  4.66,  4.9 ,  4.39,  4.42,
         5.1 ,  5.22,  4.34,  5.62,  5.1 ,  5.22,  5.18,  5.57,  4.62,
         5.06,  5.34,  5.34,  5.54,  4.98,  4.5)
y <- c(4.37,  4.56,  4.26,  4.56,  4.3 ,  4.46,  4.57,  4.26,  4.37,
         4.43,  4.48,  4.01,  4.29,  4.42,  4.23,  4.42,  4.23,  4.29,
         4.29,  4.42,  4.49,  4.38,  4.42,  4.29,  4.38,  4.22,  4.38,
         4.56,  4.45,  4.23,  4.62,  4.53,  4.45,  4.53,  4.43,  4.38,
         4.45,  4.5 ,  4.45,  4.55,  4.45,  4.42)
 [1] 4.37 4.56 4.26 4.56 4.30 4.46 4.57 4.26 4.37 4.43 4.48 4.01 4.29 4.42 4.23
[16] 4.42 4.23 4.29 4.29 4.42 4.49 4.38 4.42 4.29 4.38 4.22 4.38 4.56 4.45 4.23
[31] 4.62 4.53 4.45 4.53 4.43 4.38 4.45 4.50 4.45 4.55 4.45 4.42

Doing Linear Regression in SameMath/R

In [22]:
%%r
linearRegressionModel <- lm(formula = y ~ x + I(x^2))

summary(linearRegressionModel)
Call:
lm(formula = y ~ x + I(x^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22916 -0.05145  0.01121  0.06263  0.16072 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.87480    1.44471   1.298    0.202
x            0.87443    0.59759   1.463    0.151
I(x^2)      -0.07272    0.06128  -1.187    0.243

Residual standard error: 0.09108 on 39 degrees of freedom
Multiple R-squared:  0.484,	Adjusted R-squared:  0.4576 
F-statistic: 18.29 on 2 and 39 DF,  p-value: 2.491e-06

Running R in SageMath is "easy as":

Sometimes you need additional R packages.

  • Installing R packages with install.packages(...)

Note: Once a package is installed on a particular machine using install.packages("wantedpackage") then you only need to load that library using library(wantedpackage) when you are using the same machine.

Additional Packages

One often needs several additional packages to run certain desired R commands. Let's get some such packages.

In other words, you don't have to install packages that are already installed and thus can be automatically found by R in the default location it will be installed at. In the case below, you can see where the package was installed from the following line:

  • Installing package into ‘/some_path_to_where_the_package_is_installed’
In [23]:
%%r
# there will be further dependencies, you may need to recursively install...
#install.packages("Flury")
#library(Flury)
#data(dead.beetles)
<rpy2.rinterface_lib.sexp.NULLType object at 0x33b8d4aa0> [RTYPES.NILSXP]

SageMath/R docs

For example, you can find in the docs more systematic/programmatic way to assign SageMath/Python objects to SageMath/R objects.