©2021 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)
Regression is a method for studying the relationship between a response variable $Y$ and a covariate $X$. The covariate is also called a feature or a predictor variable.
A simple way to summarise the relationship between $X$ and $Y$ is through the regression function $r(x)$:
$$ r(x) = E(Y | X=x) = \int y \, f(y|x) dy $$Our objective is to estimate the regression function $r(x)$ from data of the form:
$$ (Y_1,X_1),(Y_2,X_2),\ldots,(Y_n,X_n) \overset{IID}{\sim} F_{X,Y} $$We assume that $F_{X,Y}$, the joint distribution of $X$ and $Y$, is parametric and $r$ is linear.
The simple linear regression model is when $X_i$ is real-valued (one-dimensional) and $r(x)$ is assumed to be linear:
$$ r(x) = \beta_0 + \beta_1 x, \qquad \text{and } \quad V(Y | X=x)=\sigma^2 \, \text{ is independent of } x $$Thus simple linear regression model is the following:
$$ \boxed{ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \qquad \text{ where, } \quad E(\epsilon_i | X_i)=0 \text{ and } V(\epsilon_i | X_i)=\sigma^2 } $$The unknown parameters and their estimates in the model are:
The fitted line is: $$ \widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x $$
The fitted or predicted values are: $$ \widehat{Y}_i = \widehat{r}(X_i) $$
The residuals are: $$ \widehat{\epsilon}_i = Y_i-\widehat{Y}_i=Y_i-\left(\widehat{\beta}_0 + \widehat{\beta}_1 X_i\right) $$
The residual sum of squares or RSS, that measures how well the line fits the data, is defined by $$ RSS = \sum_{i=1}^n \widehat{\epsilon}_i^2 $$
The least squares estimates are the values $\widehat{\beta}_0$ and $\widehat{\beta}_1$ that minimise $RSS$ and they are given by:
$$ \boxed{ \widehat{\beta}_1 = \displaystyle{\frac{\sum_{i=1}^n(X_i-\overline{X}_n)(Y_i-\overline{Y}_n)}{\sum_{i=1}^n(X_i-\overline{X}_n)^2}} \, , \qquad \widehat{\beta}_0 = \displaystyle{\overline{Y}_n - \widehat{\beta}_1 \overline{X}_n} \, , \qquad \widehat{\sigma}^2 = \displaystyle{\left(\frac{1}{n-2}\right) \sum_{i=1}^n \widehat{\epsilon}_i^2} } $$Check out:
Suppose we add the assumption about the model's noise that
$$\boxed{\displaystyle{\epsilon_i | X_i \sim Normal(0,\sigma^2) \quad \text{ i.e., }\quad Y_i|X_i \sim Normal(\mu_i,\sigma^2), \quad \text{ where } \quad \mu_i = \beta_0+\beta_1 X_i }}$$Then, the total likelihood function is:
$$ \begin{aligned} \displaystyle{\prod_{i=1}^n f(X_i,Y_i)} \, &= \displaystyle{\prod_{i=1}^n f_X(X_i) \, f_{Y|X}(Y_i|X_i)}\\ &= \displaystyle{\prod_{i=1}^n f_X(X_i) \, \prod_{i=1}^n f_{Y|X}(Y_i|X_i)}\\ &=: L_{n,X} \, L_{n,Y|X} \end{aligned} $$where, $L_{n,X}:=\prod_{i=1}^n f_X(X_i)$ is the marginal likelihood of $X_1,\ldots,X_n$ that does not depend on the parameters $(\beta_0,\beta_1,\sigma)$, and $L_{n,Y|X}:=\prod_{i=1}^n f_{Y|X}(Y_i|X_i)$ is the conditional likelihood that does depend on the parameters. Therefore the conditional likelihood is the one we consider:
$$ \begin{aligned} L(\beta_0,\beta_1,\sigma) \quad &\propto \quad \displaystyle{\prod_{i=1}^n f(X_i,Y_i)} \\ &\propto \quad L_{n,Y|X} = \displaystyle{\prod_{i=1}^n f_{Y|X}(Y_i|X_i)}\\ &\propto \quad \displaystyle{\sigma^{-n} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n\left(Y_i-\mu_i\right)^2 \right)}\\ \end{aligned} $$and the conditional log-likelihood is:
$$ \boxed{ l(\beta_0,\beta_1,\sigma) \quad =\quad \displaystyle{-n \log(\sigma) -\frac{1}{2 \sigma^2} \sum_{i=1}^n\left(Y_i-\mu_i\right)^2 } } $$To find the MLE of $(\beta_0,\beta_1)$ we need to maximise $\ell(\beta_0,\beta_1,\sigma)$ for a given $\sigma$. From the above expresion it is clear that maximising the log-likelihood is equivalent to minimising the residual sum of squares or RSS given by
$$ \boxed{ \sum_{i=1}^n\left(Y_i-\mu_i\right)^2 } $$Therefore, we have shown the following Theorem.
Under the assumption of normally distributed noise, the maximum likelihood estimator (MLE) is the least squares estimator (LSE).
We can maximise $l(\beta_0,\beta_1,\sigma)$ over $\sigma$ and obtain the MLE for $\sigma$ as follows:
$$ \widehat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \ \widehat{\epsilon}^2 \, . $$But it is more common in practise to use the unbiased estimator, with $E(\widehat{\sigma}^2)=\sigma^2$, that we saw earlier for sample size $n>2$:
$$ \widehat{\sigma}^2 = \displaystyle{\left(\frac{1}{n-2}\right) \sum_{i=1}^n \widehat{\epsilon}_i^2} \, . $$However the bias is usually quite small if $n$ is large.
It's finally time to obtain the standard errors and limititng distribution of the least quares estimator (also the MLE).
In regression we are interested in the properties of the estimators conditional on the covariates
$$X_{1:n}:= (X_1,X_2,\ldots,X_n)$$Let $\widehat{\beta}^T=(\widehat{\beta}_0,\widehat{\beta}_1)^T$ denote the least squares estimators (which is also the MLE). Then
$$ \begin{aligned} E \left(\widehat{\beta} \, | \, X_{1:n} \right) &= \displaystyle{\left( {\begin{array}{c} \beta_0 \\ \beta_1 \\ \end{array} } \right)}\\ V \left(\widehat{\beta} \, | \, X_{1:n} \right) &= \displaystyle{\frac{\sigma^2}{n s_X^2} \left( {\begin{array}{cc} \frac{1}{n}\sum_{i=1}^n X_i^2 & -\overline{X}_n \\ -\overline{X}_n & 1\\ \end{array} } \right)} \end{aligned} $$where,
$$ s_X^2 = \frac{1}{n} \sum_{i=1}^n \left(X_i -\overline{X}_n\right)^2 $$The estimated standard errors for $\widehat{\beta}_0$ and $\widehat{\beta}_1$, or more precisely, the estimated standard errors conditional on the covariates, are given by the square-root of the diagonal terms of the variance-covariance matrix $V \left(\widehat{\beta} \, | \, X_{1:n} \right)$ and substituting the estimate $\widehat{\sigma}$ for $\sigma$, as follows:
$$ \begin{aligned} \widehat{se}\left(\widehat{\beta}_0\right) := \widehat{se}\left(\widehat{\beta}_0 \, | \, X_{1:n} \right) \, &= \, \frac{\widehat{\sigma}}{s_X \sqrt{n}} \sqrt{\frac{\sum_{i=1}^nX_i^2}{n}}\\ \widehat{se}\left(\widehat{\beta}_1\right) := \widehat{se}\left(\widehat{\beta}_0 \, | \, X_{1:n}\right) \, &= \, \frac{\widehat{\sigma}}{s_X \sqrt{n}} \end{aligned} $$Thus under appropriate modeling assumptions in simple leinear regression we have the following four properties.
As $n \to \infty$, the LSE, i.e. $\widehat{\beta}_0$ and $\widehat{\beta}_1$, converges in probability to the parameters, i.e., $\beta_0,\beta_1$, generating the data $(Y_1,X_1),(Y_2,X_2),\ldots,(Y_n,X_n)$ as summarised below.
$$ \boxed{ \widehat{\beta}_0 \overset{P}{\to} \beta_0 \quad \text{ and } \quad \widehat{\beta}_1 \overset{P}{\to} \beta_1 } $$As $n \to \infty$, the LSE, i.e. $\widehat{\beta}_0$ and $\widehat{\beta}_1$, converges in distribution to the parameters, i.e., $\beta_0,\beta_1$, generating the data $(Y_1,X_1),(Y_2,X_2),\ldots,(Y_n,X_n)$ as summarised below.
$$ \boxed{ \frac{\widehat{\beta}_0 - \beta_0}{\widehat{se}\left(\widehat{\beta}_0\right)} \overset{d}{\to} Normal(0,1) \quad \text{ and } \quad \frac{\widehat{\beta}_1 - \beta_1}{\widehat{se}\left(\widehat{\beta}_1\right)} \overset{d}{\to} Normal(0,1) } $$Note: in the above the exact distribution is actually the Student-t distribution, but nevertheless the normal distribution is a good approximation if $n > 30$.
The $1-\alpha$ confidence interval for $\beta_0$ and $\beta_1$ that is obtained from the approximately normal distribution as $n$ gets large is:
$$ \boxed{ \widehat{\beta}_0 \, \pm \, z_{\alpha/2} \, \widehat{se}\left(\widehat{\beta}_0\right) \quad \text{ and } \quad \widehat{\beta}_1 \, \pm \, z_{\alpha/2} \, \widehat{se}\left(\widehat{\beta}_1\right) } $$Recall Wald test statistic for testing the null hypothesis with the null value $\beta^{(0)}$:
$$ H_0: \beta = \beta^{(0)} \quad \text{ versus } \quad H_1: \beta \neq \beta^{(0)} \quad { is } \quad W = \frac{\left(\widehat{\beta}-\beta^{(0)}\right)}{\widehat{se}\left(\widehat{\beta}\right)} $$Thus the Wald test for testing $H_0: \beta_1=0$ versus $H_1: \beta_1 \neq 0$ is to reject $H_0$ if $|W| > z_{\alpha/2}$ where $W=\frac{\widehat{\beta}_1}{\widehat{se}\left(\widehat{\beta}_1\right)}$.
Using the above formulas we can implement Python functions to calculate the least squares estimates, $\widehat{\beta}_0$ and $\widehat{\beta}_1$, that minimise $RSS$.
import numpy as np
import matplotlib.pyplot as plt
def estimate_coefficients(x, y):
# size of the dataset
n = np.size(x)
# mean of x and y
mean_x, mean_y = np.mean(x), np.mean(y)
# xy cross-deviation and xx deviatio
SS_xy = np.sum(y*x) - n*mean_y*mean_x
SS_xx = np.sum(x*x) - n*mean_x*mean_x
# calculating LSE of regression coefficients
b1_hat = SS_xy / SS_xx
b0_hat = mean_y - b1_hat*mean_x
sigma_hat2 = np.mean((y - (b0_hat + b1_hat * x))^2)
if n>2:
sigma_hat2 = sigma_hat2*n/(n-2)
sigma_hat=np.sqrt(sigma_hat2)
return(b0_hat, b1_hat, sigma_hat)
def standard_errors(x,y):
n = np.size(x)
b0_hat,b1_hat,s_hat = estimate_coefficients(x,y)
mean_x = np.mean(x)
s2X = np.mean( (x-mean_x)^2 )
se_b1 = s_hat/np.sqrt(s2X*n)
se_b0 = se_b1*np.sqrt(np.mean(x^2))
return (se_b0, se_b1)
def plot_regression_line(x, y, b):
# plotting the data points on a graph
plt.scatter(x, y, color = "m",marker = "o", s = 10)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the fitted regression line
plt.plot(x, y_pred, color = "b")
# putting generic labels for x and y axis
plt.xlabel('x')
plt.ylabel('y')
# function to show plotted graph
plt.show()
def SimpleLinearRegression(x,y):
# estimating coefficients
b = estimate_coefficients(x, y)
print("Estimated coefficients:\nb0_hat = {} \nb1_hat = {}\nsigma_hat = {}".format(b[0], b[1],b[2]))
# plotting fitted regression line with data
plot_regression_line(x, y, b)
# Datasets for x and y
LSAT=np.array([576, 635, 558, 578, 666, 580, 555, 661, 651, 605, 653, 575, 545, 572, 594]) # LSAT data
GPA=np.array([3.39, 3.30, 2.81, 3.03, 3.44, 3.07, 3.00, 3.43, 3.36, 3.13, 3.12, 2.74, 2.76, 2.88, 3.96]) # GPA data
SimpleLinearRegression(LSAT,GPA)
Estimated coefficients: b0_hat = 0.5998612958803515 b1_hat = 0.004267223518635576 sigma_hat = 0.28404612916999544
We can look at the residuals of the fitted line as follows.
predictedGPA = 0.5998612958803515 + 0.004267223518635576*LSAT
residuals = GPA - predictedGPA
plt.scatter(LSAT, residuals, color = "k",marker = "o", s = 10)
plt.axhline()
# putting generic labels for x and y axis
plt.ylabel('$\\epsilon_i$')
plt.xlabel('LSAT') # draw a y=0 line
plt.show()
# in general we want residuals to be Normally distributes about 0 with the same variance
Looking at the residuals $\epsilon_i$'s in the above plot we can notice how just $4$ of the $15$ datapoints are abov $0$. If $\epsilon_i$ were truly IID $Normal(0,\sigma^2)$, we would expect roughly the same number of points to be spread above and below zero, i.e., the $x$-axis, in an equally likely manner. Also, we would expect more points to be closer to zero and fewer points to be further away.
In conclusion, the residuals of the linear regression of LSAT and GPA do not look like they are normall distributed.
We could try different approaches to improve the model. For example, we could try to increase the sample size or standardise the scales by subtracting the sample mean and dividing by the the sample standard deviation for the $x$ and $y$ values separately and doing regression with the standardised data, etc.
The real wiki page has some simple examples of residual plots and they are useful for insights:
Examples of residual plots are shown in the following figure. (a) is a satisfactory plot with the residuals falling in a horizontal band with no systematic pattern. Such a plot indicates an appropriate regression model. (b) shows residuals falling in a funnel shape. Such a plot indicates increase in variance of residuals and the assumption of constant variance is violated here. Transformation on may be helpful in this case (see Transformations). If the residuals follow the pattern of (c) or (d), then this is an indication that the linear regression model is not adequate. Addition of higher order terms to the regression model or transformation on or may be required in such cases. A plot of residuals may also show a pattern as seen in (e), indicating that the residuals increase (or decrease) as the run order sequence or time progresses. This may be due to factors such as operator-learning or instrument-creep and should be investigated further.)
We can finally obtain 95% confidence intervals for the fitted parameters in the simple linear regression model and do a Wald test as follows.
b0_hat, b1_hat, s_hat = estimate_coefficients(LSAT,GPA)
se_b0,se_b1 = standard_errors(LSAT,GPA)
print ("Estimated standard errors for beta_0_hat and beta_1_hat are:")
print (se_b0,se_b1 )
print ("and the approximate 95% confidence intervals for beta_0_hat is:")
print (" [ ", b0_hat-2*se_b0," , ", b0_hat+2*se_b0, " ]")
print ("and the approximate 95% confidence intervals for beta_1_hat is:")
print (" [ ", b1_hat-2*se_b1," , ", b1_hat+2*se_b1, " ]")
print ("The Wald test for the null hypothesis H0 that beta_1 = 0 is:")
W = (b1_hat-0)/se_b1
if abs(W > 2):
print ("Reject H0 that beta_1=0 at alpha=0.05, since W = ",W)
else:
print ("fail to reject H0 that beta_1=0 at alpha=0.05, since W = ",W)
Estimated standard errors for beta_0_hat and beta_1_hat are: 1.0927735237389298 0.0018163754932013147 and the approximate 95% confidence intervals for beta_0_hat is: [ -1.585685751597508 , 2.785408343358211 ] and the approximate 95% confidence intervals for beta_1_hat is: [ 0.0006344725322329466 , 0.007899974505038206 ] The Wald test for the null hypothesis H0 that beta_1 = 0 is: Reject H0 that beta_1=0 at alpha=0.05, since W = 2.34930692172834
Now, let's suppose that the covariate, feature, predictor or dependent variable is a vector of length $k$. So our data for regression is of the following form:
$$ (Y_1,X_1), (Y_2,X_2), \ldots, (Y_i,X_i), \ldots, (Y_n,X_n) $$where, $X_i$ is a vector of length $k$ for the $i$-th observation or datapoint $(Y_i,X_i)$.
$$ X_i = (X_{i,1},X_{i,2},\ldots,X_{i,k}) \, . $$Then the linear regression model is:
$$ Y_i = \displaystyle{\sum_{j=0}^k \beta_j X_{i,j} + \epsilon_i, \quad \text{ for } i \in \{1,2,\ldots,n\} } $$where $\beta_0$ is the intercept term with $X_{i,0}=1$ for each $i \in \{1,2,\ldots,n\}$ and
$$ E \left( \epsilon_i | X_{1,i}, X_{2,i}, \ldots, X_{k,i} \right) = 0. $$We can denote the model using matrices and vectors more conveniently as follows:
$$ Y = \displaystyle{\left( {\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array} } \right)} \, , \qquad X = \displaystyle{\left( {\begin{array}{cccc} 1& X_{1,1}& \ldots& X_{1,k} \\ 1& X_{2,1}& \ldots& X_{2,k} \\ \vdots & \vdots & \vdots & \vdots\\ 1& X_{n,1}& \ldots& X_{n,k} \end{array} } \right)} \, , \qquad \beta = \displaystyle{\left( {\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{array} } \right)} \, , \qquad \epsilon = \displaystyle{\left( {\begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{array} } \right)} \, . $$With $X \in \mathbb{R}^{n \times (k+1)}$, i.e., $X$ being a $n \times (k+1)$ matrix, $\beta \in \mathbb{R}^{(k+1) \times 1}$, i.e., $\beta$ being a a column vector with $k+1$ rows, and $\epsilon \in \mathbb{R}^{n \times 1}$, i.e., $\epsilon$ being a column vector with $n$ rows, we obtain the multiple regression model:
$$ \boxed{ Y = X \beta + \epsilon \, . } $$Just as in the 1D case with $k=1$, the least sqaures estimate is as follows, under the assumption that $X^T X$ is invertible:
$$ \boxed{ \begin{aligned} \widehat{\beta} &= \left( X^T X\right)^{-1} X^T Y\\ V\left(\widehat{\beta} | X_{1:n} \right) &= \sigma^2 \left( X^T X \right)^{-1} \\ \widehat{\beta} &\approx Normal \left(\beta, \sigma^2 \left( X^T X\right)^{-1} \right) \end{aligned} \, . } $$The estimate of the regression function is:
$$ \boxed{ \widehat{r}(x) = \sum_{j=0}^k \widehat{\beta}_j \, x_j \, . } $$An unbiased estimate of $\sigma^2$ is:
$$ \widehat{\sigma}^2 = \left( \frac{1}{n-(k+1)} \right) \sum_{i=1}^n \widehat{\epsilon}_i^2 \, $$where $\widehat{\epsilon}$ is the vector of residuals:
$$ \boxed{ \widehat{\epsilon}=X \widehat{\beta} - Y } \ , \text{ i.e.,} \quad \widehat{\epsilon} = \displaystyle{\left( {\begin{array}{c} \widehat{\epsilon}_1 \\ \widehat{\epsilon}_2 \\ \vdots \\ \widehat{\epsilon}_n \end{array} } \right)} = \displaystyle{\left( {\begin{array}{cccc} 1& X_{1,1}& \ldots& X_{1,k} \\ 1& X_{2,1}& \ldots& X_{2,k} \\ \vdots & \vdots & \vdots & \vdots\\ 1& X_{n,1}& \ldots& X_{n,k} \end{array} } \right)} \ \displaystyle{\left( {\begin{array}{c} \widehat{\beta}_0 \\ \widehat{\beta}_1 \\ \vdots \\ \widehat{\beta}_k \end{array} } \right)} \ - \displaystyle{\left( {\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array} } \right)} $$An approximate $1-\alpha$ confidence interval for $\beta_j$ is
$$ \boxed{ \widehat{\beta}_j \pm z_{\alpha/2} \widehat{se}(\widehat{\beta}_j) } $$where $\left(\widehat{se}(\widehat{\beta}_j)\right)^2$ is the $j$-th diagonal entry of the matrix $\widehat{\sigma}^2 (X^T X)^{-1}$.
We can use scipy.linalg.lstsq to get the least squares solution to our regression problems quite easily, including generalisation to multiple linear regression when the covariates are in more than 1 dimension.
Let us try to understand the code in the previous cell by learning how to do a least squares fit by setting up the right design matrix.
from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
# suppose we have the following data
x = np.array([1, 2.5, 3.5, 4, 5, 7, 8.5])
y = np.array([0.3, 1.1, 1.5, 2.0, 3.2, 6.6, 8.6])
#We want to fit a line of the form y = a + b*x to this data. We first form the
#“design matrix” M, with a constant column of 1s and a column containing x
M1 = x[:, np.newaxis]^[0, 1]
M1
array([[1. , 1. ], [1. , 2.5], [1. , 3.5], [1. , 4. ], [1. , 5. ], [1. , 7. ], [1. , 8.5]])
#We want to find the least-squares solution to
#M1.dot(p) = y, where p is a vector with length 2 that holds the parameters a and b.
p, res, rnk, s = lstsq(M1, y)
p
array([-1.93080357, 1.16875 ])
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()
Suppose we want to fit a quadratic polynomial of the form $y = a + b*x^2$ to the same data.
Then we first form the design matrix M2
, with a constant column of 1
s and a column containing x^2
as follows:
M2 = x[:, np.newaxis]^[0, 2]
M2
array([[ 1. , 1. ], [ 1. , 6.25], [ 1. , 12.25], [ 1. , 16. ], [ 1. , 25. ], [ 1. , 49. ], [ 1. , 72.25]])
# least square solution with M2
p, res, rnk, s = lstsq(M2, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx^2
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()
Suppose we want to fit a degree-3 polynomial of the form $y = \beta_0 + \beta_1 x + \beta_2 x^2+ \beta_3 x^3$ to the same data.
Then we first form the design matrix M3
, with a constant column of 1
s with x^0
and three additional columns containing x^1
, x^2
and x^3
as follows:
# Fitting a cubic polynolial is the same idea
M3 = x[:, np.newaxis]^[0, 1, 2, 3]
M3
array([[ 1. , 1. , 1. , 1. ], [ 1. , 2.5 , 6.25 , 15.625], [ 1. , 3.5 , 12.25 , 42.875], [ 1. , 4. , 16. , 64. ], [ 1. , 5. , 25. , 125. ], [ 1. , 7. , 49. , 343. ], [ 1. , 8.5 , 72.25 , 614.125]])
p, res, rnk, s = lstsq(M3, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx + p[2]*xx^2 + p[3]*xx^3
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()
Using the lstsq
method shown above, and data arrays x
and y
in the next cell that contain log light intensity and log surface temperature in a give range of measurements from nearby stars, compute the least squares estimates of $\beta_0$ and $\beta_1$ under the simple linear regression model with an intercept and a slope term. Make a plot similar to the one above with the data points and the fitted regression line.
# Sample Exam Problem 8
# do not change this import and data block ########################
from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]
########### end of import and data block ##########################
# Replace only ZZZ by the right values
M1 = ZZZ # design matrix M1
b, res, rnk, s = lstsq(ZZZ,ZZZ)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(ZZZ, ZZZ, 101)
yy = ZZZ *xx
plt.plot(xx, yy, label='least squares fit')
plt.xlabel('log light intensity (X)')
plt.ylabel('log surface temperature (Y)')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.text(4, 4.7, r'$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x, \quad \
\widehat{\beta}_0 = $ %(b0)0.3f , $\widehat{\beta}_1 = $ %(b1)0.3f' % {'b0': b[0], 'b1': b[1]} )
plt.show()
# Sample Exam Problem 8 Solution
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]
from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
M1 = x[:, np.newaxis]^[0, 1]
b, res, rnk, s = lstsq(M1, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(3.9, 5.8, 101)
yy = b[0] + b[1]*xx
plt.plot(xx, yy, label='least squares fit')
plt.xlabel('log light intensity (X)')
plt.ylabel('log surface temperature (Y)')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.text(4, 4.7, r'$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x, \quad \
\widehat{\beta}_0 = $ %(b0)0.3f , $\widehat{\beta}_1 = $ %(b1)0.3f' % {'b0': b[0], 'b1': b[1]} )
plt.show()
Let's consider the 1D setting for simplicity of notation. Suppose we have estimated a regression model: $$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x $$ from data $(X_1,Y_1), (X_2,Y_2), \ldots, (X_n,Y_n)$.
Now suppose we observe the value $X=x_*$ of the covariate of a new observarion but do not observe the response $Y_*$ and want to predict it. An estimate of $Y_*$ is
$$ \boxed{ \widehat{Y}_* = \widehat{\beta}_0 + \widehat{\beta}_1 x_* \, . } $$By the formula for the variance of the sum of two random variables:
$$ V(\widehat{Y}_*) = V(\widehat{\beta}_0 + \widehat{\beta}_1 x_*) = V(\widehat{\beta}_0) + x_*^2 V(\widehat{\beta}_1 ) + 2 x_* Cov (\widehat{\beta}_0,\widehat{\beta}_1) $$We have all the needed terms to compute $V(\widehat{Y}_*)$ from the earlier result on the conditional variance of the least squares estimate:
$$ V \left( \widehat{\beta} \, | \, X_{1:n} \right) = \frac{\sigma^2}{n s_X^2} \left( {\begin{array}{cc} \frac{1}{n}\sum_{i=1}^n X_i^2 & -\overline{X}_n \\ -\overline{X}_n & 1\\ \end{array}} \right) $$The estimated standard error $\widehat{se}(\widehat{Y}_*)$ is just $\sqrt{V(\widehat{Y}_*)}$ with $\widehat{\sigma}^2$ substituted in for $\sigma^2$. An approximate $1-\alpha$ confidence interval for $Y^*$ is called an approximate $1-\alpha$ prediction interval for $Y_*$ and is given by
$$ \boxed{ \widehat{Y}_* \pm z_{\alpha/2} \widehat{\xi}_n \, , \quad \text{ where } \quad \widehat{\xi}^2_n = \widehat{\sigma}^2 \left( \frac{\sum_{i=1}^n (X_i-X_*)^2}{n \sum_{i=1}^n (X_i-\overline{X})^2} + 1 \right) } $$If you are interested, you already have the basic skills to look at the data from Swedish election using these ideas.
Try to model, say the $\log$ of the number of district-level votes for the top two most voted parties.
You can introduce latitude of the district centres (if you have such information from geospatial database you could join), distance of the district to one of the four largest cities in Sweden, or the socio-economic indicators of the district for Swedish Central Statistical Bureau, etc., as covariates.
But this is a good project and beyond current scope (mainly due to time limitations).
Here, we just start you off on the path to more statistical modeling for purposes of prediction. Now statistical learning from the 1970s is needed to mathematically justify the methods.
In the above example with linear regression we wanted to minimize the vertical distance between the fitted line and the data, this vertical distance is a prime example of a loss function. In general when we are faced with a regression problem we want a way of measure how good our model is, this quantity that we want to minimise is called the loss function and its expectation (over different sample data-points is called the risk). The mathematical statistical justification for this approach towards minimising expected loss or risk is called empirical risk minimisation, as we will see in the sequel in more detail.
Let us circle back to linear regression once again. The way the np.argmin
method searched for the minimum of:
was by simply evaluating $L(a,b)$ for each value of $a$ in the array prop_a
with our guessed values for $a$ and picking the $a$ that minimised $L(a,b)$. Recall we fixed $b$ in the search.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
x = np.linspace(-1,1,10)
epsilon = np.random.normal(loc=0,scale=1,size=len(x),)
y = 2*x + 1 + epsilon
# These are our guesses for b
prop_b = np.linspace(0,4,100)
# Compute the residual if we use these b's, remember this is with a = 0,
# as noted above the a can be taken to be the mean of the unadjusted residual for each b
unadj_residual = y.reshape(-1,1)-prop_b.reshape(1,-1)*x.reshape(-1,1)
# Compute the variance of the unadjusted residual
variance_residual = np.std(unadj_residual,axis=0)**2
# Plot the variance for each proposed value of b
plt.plot(prop_b,variance_residual)
# Find the minimum variance
min_index = np.argmin(variance_residual)
# Set a and b based on this minimum
a = np.mean(unadj_residual[:,min_index])
b = prop_b[min_index]
print("Optimal a: %f, optimal b: %f" % (a,b))
Optimal a: 0.902859, optimal b: 1.818182
np.argmin? # see the docstring for np.argmin and other functions/methods we are using throughout if you need to know right away.
This approaching of evaluating the loss at a set of parameter values quickly becomes infeasible when the dimension of the problem is larger than $1$.
Even if we just have two guess for each dimension of the parameter space with $d$ dimensions, then we will need to evaluate the loss at $2^d$ parameter values. When $d=10,100,1000$ the number of evaluation points become $1024$, $1.268e30, 1.072e301$, respectively.
Often in big-data settings, the number of dimensions for the regression problem can easily extend over a few thousands. Thus, we need a systematic way to find the optimal parameters, i.e., the parameters that minimise the loss function.
def L(a,b):
return np.mean((y-(a+b*x))^2)
plot3d(L,(a,0.5,1.5),(b,1.5,2.5))
The iterative solution is called gradient descent and it goes like this:
G(a,b) = L(a,b) # Create a symbolic expression
@interact
def gradient_descent_3d(x_start=input_box(0,type=float,label='x_start'), y_start=input_box(0,type=float,label='y_start'), n_steps=(1,(1,30))):
P=points([])
lr = 0.1
point = (x_start,y_start)
levels = []
for i in range(0,n_steps):
dLdb = G.diff(b).subs(b=point[1]).subs(a=point[0]).n(10)
dLda = G.diff(a).subs(b=point[1]).subs(a=point[0]).n(10)
new_a = point[0] - lr*dLda
new_b = point[1] - lr*dLdb
P+=arrow((point[0],point[1],L(point[0],point[1])),(new_a,new_b,L(new_a,new_b)),color='red',width=0.5)
#plt.arrow(point[0],point[1],-lr*dLda*0.7,-lr*dLdb*0.7,head_width=0.05)
point = (new_a,new_b)
x_min = min(x_start,new_a)-0.1
x_max = max(x_start,new_a)+0.1
y_min = min(y_start,new_b)-0.1
y_max = max(y_start,new_b)+0.1
P+=plot3d(L,(a,x_min,x_max),(b,y_min,y_max))
P.show() # increase the number of steps by toggling n_steps or play with initial values x_start and y_start
import numpy as np
@interact
def gradient_descent_contours(n_steps=(1,(1,30))):
# make a contour plot
P=contour_plot(G, (a, 0.3, 1.5), (b, 1.2, 2.5),fill=False, aspect_ratio=1, contours=20)
lr = 0.2
point = (0.3,1.2)
for i in range(0,n_steps):
dLdb = G.diff(b).subs(b=point[1]).subs(a=point[0]).n(10)
dLda = G.diff(a).subs(b=point[1]).subs(a=point[0]).n(10)
new_a = point[0] - lr*dLda
new_b = point[1] - lr*dLdb
P+=arrow((point[0],point[1]),(new_a,new_b),color='red',width=0.3)
#plt.arrow(point[0],point[1],-lr*dLda*0.7,-lr*dLdb*0.7,head_width=0.05)
point = (new_a,new_b)
P.show()
lm
(linear model) package in SageMath/R%%r
to denote that the Code
cell is of language R
First note that SageMath/Python and R kernels will be available in the SageMath Jupyter notebook.
# this is x and y available as numpy arrays in SageMath/Python
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]
print (x)
print (y)
[5.23 5.74 4.93 5.74 5.19 5.46 5.27 5.57 5.12 5.45 5.42 4.05 4.26 4.58 3.94 4.18 4.18 4.38 4.22 4.42 4.85 5.02 4.66 4.66 4.9 4.39 4.42 5.1 5.22 4.34 5.62 5.1 5.22 5.18 5.57 4.62 5.06 5.34 5.34 5.54 4.98 4.5 ] [4.37 4.56 4.26 4.56 4.3 4.46 4.57 4.26 4.37 4.43 4.48 4.01 4.29 4.42 4.23 4.42 4.23 4.29 4.29 4.42 4.49 4.38 4.42 4.29 4.38 4.22 4.38 4.56 4.45 4.23 4.62 4.53 4.45 4.53 4.43 4.38 4.45 4.5 4.45 4.55 4.45 4.42]
x
and y
in SageMath/R¶We use the assignment operator, <-
, in R, as follows:
%%r
x <- c(5.23, 5.74, 4.93, 5.74, 5.19, 5.46, 5.27, 5.57, 5.12,
5.45, 5.42, 4.05, 4.26, 4.58, 3.94, 4.18, 4.18, 4.38,
4.22, 4.42, 4.85, 5.02, 4.66, 4.66, 4.9 , 4.39, 4.42,
5.1 , 5.22, 4.34, 5.62, 5.1 , 5.22, 5.18, 5.57, 4.62,
5.06, 5.34, 5.34, 5.54, 4.98, 4.5)
y <- c(4.37, 4.56, 4.26, 4.56, 4.3 , 4.46, 4.57, 4.26, 4.37,
4.43, 4.48, 4.01, 4.29, 4.42, 4.23, 4.42, 4.23, 4.29,
4.29, 4.42, 4.49, 4.38, 4.42, 4.29, 4.38, 4.22, 4.38,
4.56, 4.45, 4.23, 4.62, 4.53, 4.45, 4.53, 4.43, 4.38,
4.45, 4.5 , 4.45, 4.55, 4.45, 4.42)
[1] 4.37 4.56 4.26 4.56 4.30 4.46 4.57 4.26 4.37 4.43 4.48 4.01 4.29 4.42 4.23 [16] 4.42 4.23 4.29 4.29 4.42 4.49 4.38 4.42 4.29 4.38 4.22 4.38 4.56 4.45 4.23 [31] 4.62 4.53 4.45 4.53 4.43 4.38 4.45 4.50 4.45 4.55 4.45 4.42
%%r
linearRegressionModel <- lm(formula = y ~ x + I(x^2))
summary(linearRegressionModel)
Call: lm(formula = y ~ x + I(x^2)) Residuals: Min 1Q Median 3Q Max -0.22916 -0.05145 0.01121 0.06263 0.16072 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.87480 1.44471 1.298 0.202 x 0.87443 0.59759 1.463 0.151 I(x^2) -0.07272 0.06128 -1.187 0.243 Residual standard error: 0.09108 on 39 degrees of freedom Multiple R-squared: 0.484, Adjusted R-squared: 0.4576 F-statistic: 18.29 on 2 and 39 DF, p-value: 2.491e-06
Sometimes you need additional R
packages.
install.packages(...)
Note: Once a package is installed on a particular machine using install.packages("wantedpackage")
then you only need to load that library using library(wantedpackage)
when you are using the same machine.
One often needs several additional packages to run certain desired R
commands. Let's get some such packages.
In other words, you don't have to install packages that are already installed and thus can be automatically found by R
in the default location it will be installed at. In the case below, you can see where the package was installed from the following line:
Installing package into ‘/some_path_to_where_the_package_is_installed’
%%r
# there will be further dependencies, you may need to recursively install...
#install.packages("Flury")
#library(Flury)
#data(dead.beetles)
<rpy2.rinterface_lib.sexp.NULLType object at 0x7f29b952dd08> [RTYPES.NILSXP]
For example, you can find in the docs more systematic/programmatic way to assign SageMath/Python objects to SageMath/R objects.