Simple linear regression

```Simple linear regression
January 20, 2015
1
Simple linear regression
The first type of model, which we will spend a lot of time on, is the simple linear regresssion model. One
simple way to think of it is via scatter plots. Below are heights of mothers and daughters collected by Karl
Pearson in the late 19th century.
In [2]: heights_fig = plt.figure(figsize=(10,6))
axes = heights_fig.gca()
axes.scatter(M, D, c=’red’)
axes.set_xlabel("Mother’s height (inches)", size=20)
axes.set_ylabel("Daughter’s height (inches)", size=20)
Out[2]: <matplotlib.text.Text at 0x10cf1ce10>
A simple linear regression model fits a line through the above scatter plot in a particular way. Specifically,
it tries to estimate the height of a new daughter in this population, say Dnew , whose mother had height
Hnew . It does this by considering each slice of the data. Here is a slice of the data near M = 66, the slice is
taken over a window of size 1 inch.
1
In [3]: X = 66
from matplotlib import mlab
xf, yf = mlab.poly_between([X-.5,X+.5], [50,50], [75, 75])
selected_points = (M <= X+.5) * (M >= X-.5)
mean_within_slice = D[selected_points].mean()
scatterplot_slice = axes.fill(xf, yf, facecolor=’blue’, alpha=0.1, hatch=’/’)[0]
axes.scatter([X],[mean_within_slice], s=130, c=’yellow’, marker=’^’)
heights_fig
Out[3]:
In [4]: print mean_within_slice
65.1733333333
We see that, in our sample, the average height of daughters whose height fell within our slice is about
65.2 inches. Of course this height varies by slice. For instance, at 60 inches:
In [5]: X = 60
selected_points = (M <= X+.5) * (M >= X-.5)
mean_within_slice = D[selected_points].mean()
print mean_within_slice
62.4282894737
The regression model puts a line through this scatter plot in an optimal fashion.
In [6]: %%R -o slope,intercept
parameters = lm(D ~ M)\$coef
print(parameters)
intercept = parameters[1]
slope = parameters[2]
2
(Intercept)
29.917437
M
0.541747
In [7]: axes.plot([M.min(), M.max()], [intercept + slope * M.min(), intercept + slope * M.max()],
linewidth=3)
heights_fig
Out[7]:
1.0.1
What is a “regression” model?
A regression model is a model of the relationships between some covariates (predictors) and an outcome.
Specifically, regression is a model of the average outcome given the covariates.
1.0.2
Mathematical formulation
For height of couples data: a mathematical model:
Daughter = f (Mother) + ε
where f gives the average height of the daughter of a mother of height Mother and ε is the random variation
within the slice.
1.0.3
Linear regression models
• A linear regression model says that the function f is a sum (linear combination) of functions of Mother.
3
• Simple linear regression model:
f (Mother) = β0 + β1 · Mother
for some unknown parameter vector (β0 , β1 ).
• Could also be a sum (linear combination) of fixed functions of Mother:
f (Mother) = β0 + β1 · Mother + β2 · Mother2
1.0.4
Simple linear regression model
• Simple linear regression is the case when there is only one predictor:
f (Mother) = β0 + β1 · Mother.
• Let Yi be the height of the i-th daughter in the sample, Xi be the height of the i-th mother.
• Model:
Yi =
β0 + β1 Xi
| {z }
regression equation
+ εi
|{z}
error
2
where εi ∼ N (0, σ ) are independent.
• This specifies a distribution for the Y ’s given the X’s, i.e. it is a statistical model.
1.0.5
Fitting the model
• We will be using least squares regression. This measures the goodness of fit of a line by the sum of
squared errors, SSE.
• Least squares regression chooses the line that minimizes
SSE(β0 , β1 ) =
n
X
(Yi − β0 − β1 · Xi )2 .
i=1
• In principle, we might measure goodness of fit differently:
n
X
|Yi − β0 − β1 · Xi |.
i=1
• For some loss function L we might try to minimize
L(β0 , β1 ) =
n
X
L(Yi − β0 − β1 Xi )
i=1
1.0.6
Why least squares?
• With least squares, the minimizers have explicit formulae – not so important with today’s computer
power – especially when L is convex.
• Resulting formulae are linear in the outcome Y . This is important for inferential reasons. For only
predictive power, this is also not so important.
• If assumptions are correct, then this is maximum likelihood estimation.
• Statistical theory tells us the maximum likelihood estimators (MLEs) are generally good estimators.
4
1.0.7
Choice of loss function
The choice of the function we use to measure goodness of fit, or the loss function, has an outcome on what
sort of estimates we get out of our procedure. For instance, if, instead of fitting a line to a scatterplot, we
were estimating a center of a distribution, which we denote by µ, then we might consider minimizing several
loss functions.
• If we choose the sum of squared errors:
SSE(µ) =
n
X
(Yi − µ)2 .
i=1
Then, we know that the minimizer of SSE(µ) is the sample mean.
• On the other hand, if we choose the sum of the absolute errors
n
X
|Yi − µ|.
i=1
Then, the resulting minimizer is the sample median.
• Both of these minimization problems also have population versions as well. For instance, the population
mean minimizes, as a function of µ
E((Y − µ)2 )
while the population median minimizes
E(|Y − µ|).
1.0.8
Visualizing the loss function
Let’s take some a random scatter plot and view the loss function.
In [8]: X = np.random.standard_normal(50)
Y = np.random.standard_normal(50) * 2 + 1.5 + 0.1 * X
plt.scatter(X, Y)
5
We’ve briefly seen how to fit a model in R. Let’s take a look at how we might fit it in python directly.
Now let’s plot the loss as a function of the parameters. Note that the true intercept is 1.5 while the true
slope is 0.1.
In [11]: squared_error_loss
Out[11]:
Let’s contrast this with the sum of absolute errors.
In [13]: absolute_error_loss
Out[13]:
6
1.0.9
Geometry of least squares
The following picture will be with us, in various guises, throughout much of the course. It depicts the
geometric picture involved in least squares regression.
It requires some imagination but the picture should be thought as representing vectors in n-dimensional
space, l where n is the number of points in the scatterplot. In our height data, n = 1375. The bottom
two axes should be thought of as 2-dimensional, while the axis marked “⊥” should be thought of as (n − 2)
dimensional, or, 1373 in this case.
1.1
Important lengths
The (squared) lengths of the above vectors are important quantities in what follows.
7
There are three to note:
SSE =
n
n
X
X
(Yi − Ybi )2 =
(Yi − βb0 − βb1 Xi )2
i=1
i=1
n
n
X
X
(Y − βb0 − βb1 Xi )2
SSR =
(Y − Ybi )2 =
i=1
i=1
n
X
SST =
(Yi − Y )2 = SSE + SSR
i=1
R2 =
SSE
SSR
d X , Y )2 .
=1−
= Cor(X
SST
SST
An important summary of the fit is the ratio
R2 =
SSE
SSR
=1−
SST
SST
which measures how much variability in Y is explained by X.
1.2
Example: wages vs. experience
In this example, we’ll look at the output of lm for the wage data and verify that some of the equations
we present for the least squares solutions agree with the output. The data was compiled from a study in
econometrics Learning about Heterogeneity in Returns to Schooling.
In [14]: %%R
url = ’http://stats191.stanford.edu/data/wage.csv’
mean(logwage)
education logwage
1 16.75000 2.845000
2 15.00000 2.446667
3 10.00000 1.560000
4 12.66667 2.099167
5 15.00000 2.490000
6 15.00000 2.330833
In order to access the variables in wages we attach it so that the variables are in the toplevel namespace.
In [15]: %%R
attach(wages)
mean(logwage)
[1] 2.240279
Let’s fit the linear regression model.
In [16]: %%R
wages.lm = lm(logwage ~ education)
print(wages.lm)
8
Call:
lm(formula = logwage ~ education)
Coefficients:
(Intercept)
1.2392
education
0.0786
As in the mother-daughter data, we might want to plot the data and add the regression line.
In [17]: %%R -h 800 -w 800
plot(education, logwage, pch=23, bg=’red’, cex=2, cex.lab=3)
abline(wages.lm, lwd=4, col=’black’)
9
1.2.1
Least squares estimators
There are explicit formulae for the least squares estimators, i.e. the minimizers of the error sum of squares.
For the slope, βˆ1 , it can be shown that
Pn
d
(Xi − X)(Yi − Y )
Cov(X,
Y)
b
=
β1 = i=1
.
Pn
2
d
V ar(X)
i=1 (Xi − X)
Knowing the slope estimate, the intercept estimate can be found easily:
βb0 = Y − βb1 · X.
Wages example
In [18]: %%R
beta.1.hat = cov(education, logwage) / var(education)
beta.0.hat = mean(logwage) - beta.1.hat * mean(education)
print(c(beta.0.hat, beta.1.hat))
print(coef(wages.lm))
[1] 1.23919433 0.07859951
(Intercept)
education
1.23919433 0.07859951
1.2.2
Estimate of σ 2
There is one final quantity needed to estimate all of our parameters in our (statistical) model for the
scatterplot. This is σ 2 , the variance of the random variation within each slice (the regression model assumes
this variance is constant within each slice. . . ).
The estimate most commonly used is
n
σ
ˆ2 =
SSE
1 X
(Yi − βˆ0 − βˆ1 Xi )2 =
= M SE
n − 2 i=1
n−2
Above, note the practice of replacing the quantity SSE(βˆ0 , βˆ1 ), i.e. the minimum of this function, with
just SSE.
The term MSE above refers to mean squared error: a sum of squares divided by what we call its degrees
of freedom. The degrees of freedom of SSE, the error sum of squares is therefore n − 2. Remember this n − 2
corresponded to ⊥ in the picture above. . .
Using some statistical calculations that we will not dwell on, if our simple linear regression model is
correct, then we can see that
χ2n−2
σ
ˆ2
∼
σ2
n−2
where the right hand side denotes a chi-squared distribution with n − 2 degrees of freedom.
Wages example
In [19]: %%R
sigma.hat = sqrt(sum(resid(wages.lm)^2) / wages.lm\$df.resid)
sigma.hat
[1] 0.4037828
10
The summary from R also contains this estimate of σ:
In [20]: %%R
summary(wages.lm)
Call:
lm(formula = logwage ~ education)
Residuals:
Min
1Q
-1.78239 -0.25265
Median
0.01636
3Q
0.27965
Max
1.61101
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.239194
0.054974
22.54
<2e-16 ***
education
0.078600
0.004262
18.44
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4038 on 2176 degrees of freedom
F-statistic:
340 on 1 and 2176 DF, p-value: < 2.2e-16
2
Inference for the simple linear regression model
2.0.3
What do we mean by inference?
• Generally, by inference, we mean “learning something about the relationship between the sample
(X1 , . . . , Xn ) and (Y1 , . . . , Yn ).”
• In the simple linear regression model, this often means learning about β0 , β1 . Particular forms of
inference are confidence intervals or hypothesis tests. More on these later.
• Most of the questions of inference in this course can be answered in terms of t-statistics or F -statistics.
• First we will talk about t-statistics, later F -statistics.
2.0.4
Examples of (statistical) hypotheses
• One sample problem: given an independent sample X = (X1 , . . . , Xn ) where Xi ∼ N (µ, σ 2 ), the null
hypothesis H0 : µ = µ0 says that in fact the population mean is some specified value µ0 .
• Two sample problem: given two independent samples Z = (Z1 , . . . , Zn ), W = (W1 , . . . , Wm ) where
Zi ∼ N (µ1 , σ 2 ) and Wi ∼ N (µ2 , σ 2 ), the null hypothesis H0 : µ1 = µ2 says that in fact the population
means from which the two samples are drawn are identical.
2.0.5
Testing a hypothesis
We test a null hypothesis, H0 based on some test statistic T whose distribution is fully known when H0 is
true.
11
¯ is the sample mean of our sample (X1 , . . . , Xn ) and
For example, in the one-sample problem, if X
n
S2 =
1 X
¯ 2
(Xi − X)
n − 1 i=1
is the sample variance. Then
T =
¯ − µ0
X
√
S/ n
has what is called a [Student’s t](http://en.wikipedia.org/wiki/Student’s t-distribution) distribution with
n − 1 degrees of freedom when H0 : µ = µ0 is true. When the null hypothesis is not true, it does not have
this distribution!
2.0.6
General form of a (Student’s) T statistic
• A t statistic with k degrees of freedom, has a form that becomes easy to recognize after seeing it several
times.
p
• It has two main parts: a numerator and a denominator. The numerator Z ∼ N (0, 1) while D ∼ χ2k /k
that is assumed independent of Z.
• The t-statistic has the form
T =
Z
.
D
• Another form of the t-statistic is
T =
estimate of parameter − true parameter
.
accuracy of the estimate
• In more formal terms, we write this as
T =
θˆ − θ
.
ˆ
SE(θ)
Note that the denominator is the accuracy of the estimate and not the true parameter (which is usually
assumed fixed, at least for now). The term SE or standard error will, in this course, usually refer to
an estimate of the accuracy of estimator. Therefore, it is often the square root of an estimate of the
variance of an estimator.
• In our simple linear regression model, a natural t-statistic is
βˆ1 − β1
.
SE(βˆ1 )
We’ve seen how to compute βˆ1 , we never get to see the true β1 , so the only quantity we have anything
left to say about is the standard error SE(βˆ1 ).
• How many degrees of freedom would this T have?
2.0.7
Comparison of Student’s t to normal distribution
As the degrees of freedom increases, the population histogram, or density, of the Tk distribution looks more
and more like the standard normal distribution usually denoted by N (0, 1).
In [22]: density_fig
Out[22]:
12
This change in the density has an effect on the rejection rule for hypothesis tests based on the Tk
distribution. For instance, for the standard normal, the 5% rejection rule is to reject if the so-called Z-score
is larger than about 2 in absolute value.
In [24]: density_fig
Out[24]:
13
For the T10 distribution, however, this rule must be modified.
In [26]: density_fig
Out[26]:
2.0.8
One sample problem revisited
Above, we used the one sample problem as an example of a t-statistic. Let’s be a little more specific.
• Given an independent sample X = (X1 , . . . , Xn ) where Xi ∼ N (µ, σ 2 ) we can test H0 : µ = 0 using a
T -statistic.
• We can prove that the random variables
X ∼ N (µ, σ 2 /n),
2
χ2
SX
∼ n−1
2
σ
n−1
are independent.
• Therefore, whatever the true µ is
√
X −µ
(X − µ)/(σ/ n)
√ =
∼ tn−1 .
SX /σ
SX / n
• Our null hypothesis specifies a particular value for µ, i.e. 0. Therefore, under H0 : µ = 0 (i.e. assuming
that H0 is true),
√
X/(SX / n) ∼ tn−1 .
14
2.0.9
Confidence interval
The following are examples of confidence intervals you may have already seen.
• One sample problem: instead of deciding whether µ = 0, we might want to come up with an (random)
interval [L, U ] based on the sample X such that the probability the true (nonrandom) µ is contained
in [L, U ] equal to 1 − α, i.e. 95%.
• Two sample problem: find a (random) interval [L, U ] based on the sampl es Z and W such that the
probability the true (nonrandom) µ1 − µ2 is contained in [L, U ] is equal to 1 − α, i.e. 95%.
2.0.10
Confidence interval for one sample problem
• In the one sample problem, we might be interested in a confidence interval for the unknown µ.
• Given an independent sample (X1 , . . . , Xn ) where Xi ∼ N (µ, σ 2 ) we can test construct a (1 − α) ∗ 100%
using the numerator and denominator of the t-statistic.
• Let q = tn−1,(1−α/2)
µ−X
√ ≤q
1 − α = P −q ≤
SX / n
√ √
= P −q · SX / n ≤ µ − X ≤ q · SX / n
√
√ = P X − q · SX / n ≤ µ ≤ X + q · SX / n
√
• Therefore, the interval X ± q · SX / n is a (1 − α) ∗ 100% confidence interval for µ.
2.0.11
Inference for β0 or β1
• Recall our model
Yi = β0 + β1 Xi + εi ,
errors εi are independent N (0, σ 2 ).
• In our heights example, we might want to now if there really is a linear association between Daughter =
Y and Mother = X. This can be answered with a hypothesis test of the null hypothesis H0 : β1 = 0.
This assumes the model above is correct, but that β1 = 0.
• Alternatively, we might want to have a range of values that we can be fairly certain β1 lies within.
This is a confidence interval for β1 .
2.0.12
Geometric picture of test
The hypothesis test has a geometric interpretation which we will revisit later for other models. It is a
comparison of two models. The first model is our original model.
The second model is the null model in which we have set β1 = 0. This model says that
Yi = β0 + εi .
This model says that the mean of the Y ’s is unrelated to that of X.
Strictly speaking, we should write Yi |X on the left hand side as this is a model of the Yi ’s given the entire
set of X observations. If the pairs of mothers and daughters are drawn from some population independently
than we may write Yi |Xi .
15
2.0.13
Setup for inference
• Let L be the subspace of Rn spanned 1 = (1, . . . , 1) and X = (X1 , . . . , X n ).
• Then,
Y = PL Y + (Y − PL Y ) = Yb + (Y − Yb ) = Yb + e
• In our model µ = β01 + β1 X ∈ L so that
Yb = µ + PL ε,
e = PL ⊥ Y = PL ⊥ ε
• Our assumption that εi ’s are independent N (0, σ 2 ) tells us that: e and Yb are independent; σ
b2 =
2
2
2
kek /(n − 2) ∼ σ · χn−2 /(n − 2).
• In turn, this implies
βb1 ∼ N
σ2
β1 , Pn
2
i=1 (Xi − X)
.
• Therefore,
βb1 − β1
σ
q
1
2
i=1 (Xi −X)
∼ N ( 0, 1).
Pn
• The other quantity we need is the standard error or SE of βˆ1 . This is obtained from estimating the
variance of βb1 , which, in this case means simply plugging in our estimate of σ, yielding
s
1
SE(βb1 ) = σ
b Pn
independent of βb1
2
(X
i − X)
i=1
2.0.14
Testing H0 : β1 = β10
• Suppose we want to test that β1 is some pre-specified value, β10 (this is often 0: i.e. is there a linear
association)
• Under H0 : β1 = β10
βb1 − β10
σ
b
q
1
2
i=1 (Xi −X)
=
σ
b
σ
Pn
·σ
βb − β10
q1
1
2
i=1 (Xi −X)
∼ tn−2 .
Pn
• Reject H0 : β1 = β10 if |T | > tn−2,1−α/2 .
Wage example Let’s perform this test for the wage data.
In [27]: %%R
SE.beta.1.hat = (sigma.hat * sqrt(1 / sum((education - mean(education))^2)))
Tstat = beta.1.hat / SE.beta.1.hat
data.frame(beta.1.hat, SE.beta.1.hat, Tstat)
beta.1.hat SE.beta.1.hat
Tstat
1 0.07859951
0.004262471 18.43989
Let’s look at the output of the lm function again.
In [28]: %%R
summary(wages.lm)
16
Call:
lm(formula = logwage ~ education)
Residuals:
Min
1Q
-1.78239 -0.25265
Median
0.01636
3Q
0.27965
Max
1.61101
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.239194
0.054974
22.54
<2e-16 ***
education
0.078600
0.004262
18.44
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4038 on 2176 degrees of freedom
F-statistic:
340 on 1 and 2176 DF, p-value: < 2.2e-16
We see that R performs this test in the second row of the Coefficients table. It is clear that wages are
correlated with education.
2.0.15
Why reject for large |T|?
• Observing a large |T | is unlikely if β1 = β10 : reasonable to conclude that H0
is false.
• Common to report p-value:
P(|Tn−2 | > |T |) = 2P(Tn−2 > |T |)
In [29]: %%R
2*(1 - pt(Tstat, wages.lm\$df.resid))
[1] 0
In [30]: %%R
detach(wages)
2.0.16
Confidence interval based on Student’s t distribution
b such that
• Suppose we have a parameter estimate θb ∼ N (θ, σθ2 ), and standard error SE(θ)
θb − θ
∼ tν .
b
SE(θ)
• We can find a (1 − α) · 100% confidence interval by:
b · tν,1−α/2 .
θb ± SE(θ)
• To prove this, expand the absolute value as we did for the one-sample CI
!
θb − θ 1−α=P < tν,1−α/2 .
b
SE(θ)
17
2.0.17
Confidence interval for regression parameters
• Applying the above to the parameter β1 yields a confidence interval of the form
βˆ1 ± SE(βˆ1 ) · tn−2,1−α/2 .
• We will need to compute SE(βˆ1 ). This can be computed using this formula
v
u 2
ua
(a0 X − a1 )2
ˆ
ˆ
SE(a0 β0 + a1 β1 ) = σ
ˆ t 0 + Pn
2 .
n
Xi − X
i=1
We also need to find the quantity tn−2,1−α/2 . This is defined by
P(Tn−2 ≥ tn−2,1−α/2 ) = α/2.
In R, this is computed by the function qt.
In [31]: %%R
alpha = 0.05
n = length(M)
qt(1-0.5*alpha,n-2)
[1] 1.961693
Not surprisingly, this is close to that of the normal distribution, which is a Student’s t with ∞ for degrees
of freedom.
In [32]: %%R
qnorm(1-0.5*alpha)
[1] 1.959964
We will not need to use these explicit formulae all the time, as R has some built in functions to compute
confidence intervals.
In [33]: %%R
L = beta.1.hat - qt(0.975, wages.lm\$df.resid) * SE.beta.1.hat
U = beta.1.hat + qt(0.975, wages.lm\$df.resid) * SE.beta.1.hat
data.frame(L, U)
L
U
1 0.07024057 0.08695845
In [34]: %%R
confint(wages.lm)
2.5 %
97.5 %
(Intercept) 1.13138690 1.34700175
education
0.07024057 0.08695845
18
2.0.18
Predicting the mean
Once we have estimated a slope (βˆ1 ) and an intercept (βˆ0 ), we can predict the height of the daughter born to
a mother of any particular height by the plugging-in the height of the new mother, Mnew into our regression
equation:
ˆ new = βˆ0 + βˆ1 Mnew .
D
This equation says that our best guess at the height of the new daughter born to a mother of height Mnew is
ˆ new . Does this say that the height will be exactly this value? No, there is some random variation in each
D
slice, and we would expect the same random variation for this new daughter’s height as well.
We might also want a confidence interval for the average height of daughters born to a mother of height
Mnew = 66 inches:
βˆ0 + 66 · βˆ1 ± SE(βˆ0 + 66 · βˆ1 ) · tn−2,1−α/2 .
Recall that the parameter of interest is the average within the slice. Let’s look at our picture again:
In [35]: heights_fig
Out[35]:
In [36]: %%R
height.lm = lm(D~M)
predict(height.lm, list(M=c(66,60)), interval=’confidence’, level=0.90)
fit
lwr
upr
1 65.67274 65.49082 65.85466
2 62.42226 62.27698 62.56753
19
2.0.19
Forecasting intervals
There is yet another type of interval we might consider: can we find an interval that covers the height of a
particular daughter knowing only that her mother’s height as 66 inches?
This interval has to cover the variability of the new random variation with our slice at 66 inches. So, it
must be at least as wide as σ, and we estimate its width to be at least as wide as σ
ˆ.
In [37]: %%R
predict(height.lm, list(M=66), interval=’prediction’, level=0.90)
fit
lwr
upr
1 65.67274 61.93804 69.40744
In [38]: %%R
(69.41-61.94)
[1] 7.47
With so much data in our heights example, this 90% interval will have width roughly 2 * qnorm(0.95)
* sigma.hat.height.
In [39]: %%R
sigma.hat.height = sqrt(sum(resid(height.lm)^2) / height.lm\$df.resid)
2 * qnorm(0.95) * sigma.hat.height
[1] 7.455501
The actual width will depend on how accurately we have estimated (β0 , β1 ) as well as σ
ˆ . Here is the full
formula. Again it is based on the t distribution, the only thing we need to change is what we use for the SE.
v
u
u
1
(X − Xnew )2
b
b
SE(β0 + β1 Xnew + εnew ) = σ
bt1 + + Pn
2 .
n
Xi − X
i=1
The final interval is
βˆ0 + βˆ1 Xnew ± tn−2,1−α/2 · SE(βˆ0 + βˆ1 Xnew + εnew ).
In [40]:
20
```