4 Tutorial 3: Simple OLS — Residuals, Assumptions, Unbiasedness, Variance, and Interpretation
We observe an i.i.d. sample \(\{(Y_i, X_i)\}_{i=1}^n\) with \(\sum_{i=1}^n (X_i-\bar X)^2>0\). The simple linear regression model is
\[ Y = \beta_0 + \beta_1 X + u. \]
In the sample, OLS chooses \((\hat\beta_0,\hat\beta_1)\) to minimize the sum of squared residuals:
\[ \min_{\beta_0,\beta_1}\ \sum_{i=1}^n (Y_i-\beta_0-\beta_1X_i)^2. \]
Define the fitted values and residuals:
\[ \hat Y_i = \hat\beta_0+\hat\beta_1X_i, \qquad \hat u_i = Y_i-\hat Y_i. \]
4.1 Part A. What residuals are and what OLS forces them to satisfy
Narrative idea. OLS picks the “best” line (in squared-error sense). Once the line is chosen, each residual \(\hat u_i\) is the part of \(Y_i\) that the line does not explain. The key point is: OLS does not leave residuals arbitrary. The first-order conditions imply exact sample moment conditions—mechanical identities that hold in any dataset whenever you run OLS with an intercept.
We will use the normal equations as facts (we derived them last tutorial), and focus on what they imply.
4.1.1 A1. Normal equations: two sample moment conditions
The OLS normal equations imply:
\[ \boxed{\sum_{i=1}^n \hat u_i = 0} \qquad\text{and}\qquad \boxed{\sum_{i=1}^n X_i \hat u_i = 0.} \]
Interpretation. - \(\sum \hat u_i=0\) means residuals average to zero: OLS does not systematically over- or under-predict \(Y\) in the sample. - \(\sum X_i\hat u_i=0\) means residuals are “orthogonal” to \(X\) in the sample: once the slope is chosen, there is no remaining linear association between \(X\) and the residuals.
4.1.2 A2. Orthogonality to centered \(X\): what it really means
4.1.2.1 Question
Show that residuals are also orthogonal to deviations of \(X\) around its mean:
\[ \boxed{\sum_{i=1}^n (X_i-\bar X)\hat u_i = 0.} \]
4.1.2.2 Solution
Start from the identity:
\[ \sum_{i=1}^n (X_i-\bar X)\hat u_i = \sum_{i=1}^n X_i\hat u_i - \bar X\sum_{i=1}^n \hat u_i. \]
By the normal equations (A1), \(\sum X_i\hat u_i=0\) and \(\sum \hat u_i=0\). Therefore the right-hand side is \(0-\bar X\cdot 0=0\), so:
\[ \boxed{\sum_{i=1}^n (X_i-\bar X)\hat u_i = 0.} \]
Interpretation. After fitting the line, observations with above-average \(X\) are not systematically above/below the line relative to observations with below-average \(X\).
4.1.3 A3 Prove that the OLS regression line passes through \((\bar{X}, \bar{Y})\)
Step 1: Write the fitted value equation
For each observation \(i\), the fitted value is:
\[ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i \]
Step 2: Compute the mean of the fitted values
Sum all \(\hat{Y}_i\) and divide by \(n\):
\[ \bar{\hat{Y}} = \frac{1}{n} \sum_{i=1}^{n} \hat{Y}_i = \frac{1}{n} \sum_{i=1}^{n} \left( \hat{\beta}_0 + \hat{\beta}_1 X_i \right) \]
Step 3: Split the summation
\[ \bar{\hat{Y}} = \frac{1}{n} \sum_{i=1}^{n} \hat{\beta}_0 + \frac{1}{n} \sum_{i=1}^{n} \hat{\beta}_1 X_i \]
Since \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are constants, they factor out:
\[ \bar{\hat{Y}} = \hat{\beta}_0 + \hat{\beta}_1 \frac{1}{n} \sum_{i=1}^{n} X_i = \hat{\beta}_0 + \hat{\beta}_1 \bar{X} \]
Step 4: Substitute \(\hat{\beta}_0\)
Recall that the OLS intercept estimator is:
\[ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \]
Substituting:
\[ \bar{\hat{Y}} = \left( \bar{Y} - \hat{\beta}_1 \bar{X} \right) + \hat{\beta}_1 \bar{X} \]
Step 5: Simplify
The terms \(-\hat{\beta}_1 \bar{X}\) and \(+\hat{\beta}_1 \bar{X}\) cancel out:
\[ \bar{\hat{Y}} = \bar{Y} \]
Step 6: Geometric interpretation
Evaluating the fitted line at \(X = \bar{X}\):
\[ \hat{Y}(\bar{X}) = \hat{\beta}_0 + \hat{\beta}_1 \bar{X} = \bar{Y} \]
That is, when \(X\) equals its mean, the line predicts exactly \(\bar{Y}\).
Conclusion: The OLS regression line always passes through the point \((\bar{X}, \bar{Y})\). \(\blacksquare\)
4.1.4 A4. Prove the ANOVA decomposition: \(TSS = ESS + RSS\)
4.1.4.1 Context
In regression analysis we want to know how much of the total variation in \(Y\) is explained by the model. To answer this, we decompose the total variation into two parts: one attributed to the fitted line and one to the residuals. This decomposition is the foundation of the \(R^2\) statistic and the \(F\)-test in regression.
We define:
\[ TSS=\sum_{i=1}^n (Y_i-\bar Y)^2,\quad ESS=\sum_{i=1}^n (\hat Y_i-\bar Y)^2,\quad RSS=\sum_{i=1}^n \hat u_i^2. \]
- TSS (Total Sum of Squares): measures the total variability of \(Y\) around its mean.
- ESS (Explained Sum of Squares): measures how much of that variability is captured by the fitted values \(\hat{Y}_i\).
- RSS (Residual Sum of Squares): measures the leftover variability not captured by the model.
The key identity we start from is:
\[ Y_i - \bar{Y} = (\hat{Y}_i - \bar{Y}) + \hat{u}_i \]
This simply says: the deviation of each observation from the mean equals the deviation explained by the model plus the residual.
4.1.4.2 Question
(a) Square both sides of the identity above and sum over all \(i = 1, \dots, n\). Write the result in terms of \(TSS\), \(ESS\), \(RSS\), and a cross-term.
(b) Using the result from A2 (\(\sum (X_i - \bar{X})\hat{u}_i = 0\)) and from A3 (\(\hat{Y}_i - \bar{Y} = \hat{\beta}_1(X_i - \bar{X})\)), show that the cross-term equals zero and conclude:
\[ \boxed{TSS = ESS + RSS.} \]
4.1.4.3 Solution
(a) Squaring and summing:
\[ \sum_{i=1}^n (Y_i-\bar Y)^2 = \sum_{i=1}^n (\hat Y_i-\bar Y)^2 + \sum_{i=1}^n \hat u_i^2 + 2\sum_{i=1}^n (\hat Y_i-\bar Y)\hat u_i \]
That is:
\[ TSS = ESS + RSS + 2\sum_{i=1}^n (\hat Y_i-\bar Y)\hat u_i \]
(b) We need to show the cross-term is zero.
Step 1. From A3, the regression line passes through \((\bar{X}, \bar{Y})\), so:
\[ \hat Y_i - \bar Y = (\hat\beta_0 + \hat\beta_1 X_i) - (\hat\beta_0 + \hat\beta_1 \bar X) = \hat\beta_1(X_i - \bar X) \]
Step 2. Substitute into the cross-term:
\[ \sum_{i=1}^n (\hat Y_i - \bar Y)\hat u_i = \hat\beta_1 \sum_{i=1}^n (X_i - \bar X)\hat u_i \]
Step 3. By the result from A2, \(\sum_{i=1}^n (X_i - \bar X)\hat u_i = 0\), so the entire cross-term vanishes.
Step 4. Substituting back:
\[ TSS = ESS + RSS + 2 \cdot 0 = ESS + RSS \]
\[ \boxed{TSS = ESS + RSS} \]
4.1.4.4 Interpretation
This result tells us that the total variation in \(Y\) can be cleanly split into two non-overlapping parts: what the model explains (\(ESS\)) and what it does not (\(RSS\)). This clean split is what makes the coefficient of determination meaningful:
\[ R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS} \]
Without the cross-term being zero, this decomposition would not hold, and \(R^2\) would lose its interpretation as the proportion of variance explained.
4.1.4.5 Computational check in R
This chunk verifies the ANOVA decomposition numerically in a simulated dataset.
set.seed(123)
n <- 80
X <- rnorm(n, mean = 2, sd = 1)
u <- rnorm(n, mean = 0, sd = 2)
Y <- 1 + 1.5 * X + u
fit <- lm(Y ~ X)
uhat <- resid(fit)
yhat <- fitted(fit)
TSS <- sum((Y - mean(Y))^2)
ESS <- sum((yhat - mean(Y))^2)
RSS <- sum(uhat^2)
c(
cross_term = sum((yhat - mean(Y)) * uhat),
TSS_minus_ESS_RSS = TSS - ESS - RSS
)## cross_term TSS_minus_ESS_RSS
## 2.914335e-15 1.136868e-13
Both values should be essentially zero (up to floating-point precision), confirming the decomposition.
4.2 Part B. The Population Regression Function (PRF)
4.2.1 Why do we need this?
In Part A we worked entirely with sample data: we found formulas for \(\hat{\beta}_0\) and \(\hat{\beta}_1\), proved algebraic properties of residuals, and showed the ANOVA decomposition \(TSS = ESS + RSS\). All of that was purely mechanical — it holds for any dataset, with no assumptions about where the data came from.
But the ultimate goal of regression is not just to draw a line through a particular sample. We want to learn something about the underlying population. This raises fundamental questions:
- What exactly is OLS trying to estimate?
- Under what conditions do our sample estimates \(\hat{\beta}_0, \hat{\beta}_1\) actually tell us something true about the world?
- When can we trust these estimates, and when might they mislead us?
To answer these questions, we need to define the population counterpart of the sample regression — the Population Regression Function. Part B builds the theoretical framework that will allow us, in later sections, to prove that OLS is unbiased and to understand when it can fail.
The figure below makes this concrete. The grey cloud is your data. The red bell curves show how \(Y\) is distributed within each slice of \(X\) — these distributions shift as \(X\) changes. The blue line connects their centres. That line is the PRF.
![The Population Regression Function (blue line) connects the conditional means E[Y|X=x] at every value of x. The red curves show the conditional distribution of Y given X. OLS estimates this line from sample data.](bookdownproj_files/figure-html/prf-diagram-1.png)
Figure 4.1: The Population Regression Function (blue line) connects the conditional means E[Y|X=x] at every value of x. The red curves show the conditional distribution of Y given X. OLS estimates this line from sample data.
4.2.2 B1.a Definition (concept)
The Population Regression Function is:
\[ m(x) \equiv \mathbb{E}[Y \mid X = x]. \]
Interpretation. For each value of \(x\), \(m(x)\) is the average of \(Y\) among all units in the population with \(X = x\). This is the “true” relationship between \(X\) and \(Y\) — the signal we are trying to recover from noisy data.
In many applications we approximate this conditional expectation with a linear function:
\[ \mathbb{E}[Y \mid X = x] = \beta_0 + \beta_1 x. \]
This is the linear PRF assumption. Here \(\beta_0\) and \(\beta_1\) are fixed, unknown population parameters — the quantities that the sample estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) from Part A are trying to approximate.
4.2.3 B2. Zero conditional mean: what it is and where it comes from
4.2.3.1 Context
In Part A we defined residuals \(\hat{u}_i = Y_i - \hat{Y}_i\) and showed they satisfy convenient algebraic properties (summing to zero, being orthogonal to \(X\)). Those were sample properties that hold by construction.
Now we ask: does the population error term \(u\) satisfy analogous properties? The answer is yes — but not by construction. It follows from the linear PRF assumption. This result, called the zero conditional mean condition, is the single most important assumption in OLS theory: it is what makes our estimators unbiased.
4.2.3.2 Question
Assume the PRF is linear:
\[ \mathbb{E}[Y \mid X = x] = \beta_0 + \beta_1 x. \]
Define the population error term:
\[ u \equiv Y - (\beta_0 + \beta_1 X). \]
Show that the linear PRF implies:
\[ \boxed{\mathbb{E}[u \mid X] = 0.} \]
4.2.3.3 Solution
Step 1. Apply the conditional expectation to the definition of \(u\):
\[ \mathbb{E}[u \mid X] = \mathbb{E}[Y - (\beta_0 + \beta_1 X) \mid X] = \mathbb{E}[Y \mid X] - (\beta_0 + \beta_1 X) \]
Step 2. Substitute the linear PRF, \(\mathbb{E}[Y \mid X] = \beta_0 + \beta_1 X\):
\[ \mathbb{E}[u \mid X] = (\beta_0 + \beta_1 X) - (\beta_0 + \beta_1 X) = 0 \]
\[ \boxed{\mathbb{E}[u \mid X] = 0} \]
Interpretation. After accounting for the linear effect of \(X\), the remaining component \(u\) has no systematic pattern left — on average, it is zero regardless of the value of \(X\). Compare this with Part A, where we showed \(\sum \hat{u}_i = 0\) and \(\sum X_i \hat{u}_i = 0\) by algebra. Here, the analogous population property holds because of the PRF assumption, not by construction.
4.2.4 B3. What zero conditional mean implies (useful corollaries)
4.2.4.1 Context
The condition \(\mathbb{E}[u \mid X] = 0\) is a conditional statement — it says something about \(u\) for every possible value of \(X\). This is a strong requirement, and it automatically implies weaker unconditional properties. These unconditional properties connect back to the moment conditions we saw in Part A (recall: \(\sum \hat{u}_i = 0\) and \(\sum X_i \hat{u}_i = 0\) were the sample analogues).
4.2.4.2 Question
Assuming \(\mathbb{E}[u \mid X] = 0\), show that:
- \(\boxed{\mathbb{E}[u] = 0}\)
- \(\boxed{\operatorname{Cov}(X, u) = 0}\)
4.2.4.3 Solution
1) By the law of iterated expectations:
\[ \mathbb{E}[u] = \mathbb{E}\big[\mathbb{E}[u \mid X]\big] = \mathbb{E}[0] = 0 \]
2) Start from the definition:
\[ \operatorname{Cov}(X, u) = \mathbb{E}[Xu] - \mathbb{E}[X]\,\mathbb{E}[u] \]
Since \(\mathbb{E}[u] = 0\) from part 1, it suffices to show \(\mathbb{E}[Xu] = 0\):
\[ \mathbb{E}[Xu] = \mathbb{E}\big[\mathbb{E}[Xu \mid X]\big] = \mathbb{E}\big[X \cdot \mathbb{E}[u \mid X]\big] = \mathbb{E}[X \cdot 0] = 0 \]
Therefore \(\operatorname{Cov}(X, u) = 0\).
Interpretation. These are the population analogues of the Part A results:
| Part A (sample, by construction) | Part B (population, by assumption) |
|---|---|
| \(\sum \hat{u}_i = 0\) | \(\mathbb{E}[u] = 0\) |
| \(\sum X_i \hat{u}_i = 0\) | \(\operatorname{Cov}(X, u) = 0\) |
The sample properties hold automatically for any OLS fit. The population properties require the zero conditional mean assumption. This parallel is not a coincidence — OLS is designed so that its sample moment conditions mimic the population ones.
4.2.5 B4. Mean independence vs zero conditional mean (and why we care)
4.2.5.1 Context
Students sometimes encounter different versions of the “no relationship between \(u\) and \(X\)” assumption. Here we clarify the two most common ones and their logical relationship.
4.2.5.2 B4.a Definitions
Zero conditional mean: \[ \mathbb{E}[u \mid X] = 0 \]
Mean independence: \[ \mathbb{E}[u \mid X] = \mathbb{E}[u] \]
Mean independence says the conditional mean of \(u\) does not depend on \(X\) at all; zero conditional mean additionally pins that constant to zero.
4.2.5.3 Question
- If \(\mathbb{E}[u] = 0\), show that mean independence implies zero conditional mean.
- In 2 lines: which is stronger, and why?
4.2.5.4 Solution
1) If mean independence holds, then \(\mathbb{E}[u \mid X] = \mathbb{E}[u]\). If also \(\mathbb{E}[u] = 0\):
\[ \mathbb{E}[u \mid X] = 0 \]
2) Mean independence is stronger because it requires \(\mathbb{E}[u \mid X]\) to be the same constant for every value of \(X\). Zero conditional mean is a special case where that constant is zero — and it is the specific condition needed for OLS unbiasedness.
4.2.7 Quick simulation intuition in R
This chunk is just to build intuition: it shows that when we generate data with \(\mathbb{E}[u\mid X]=0\), sample correlation between \(X\) and residual-like noise fluctuates around zero.
set.seed(123)
n <- 500
X <- rnorm(n)
u <- rnorm(n) # independent of X => E[u|X]=0
Y <- 1 + 2*X + u
# Check sample correlation between X and u (should be close to 0 on average)
cor(X, u)## [1] -0.05193691
4.3 Part C. Unbiasedness of OLS (with solutions)
Narrative idea. Part A gave sample identities (orthogonality) that hold mechanically.
Part B defined the population object we care about (the PRF) and introduced the key assumption \(\mathbb{E}[u\mid X]=0\).
Part C connects the two: we use an algebraic representation of the OLS slope and show that under random sampling and zero conditional mean, OLS is unbiased.
We work with the population model (for each observation \(i\)):
\[ Y_i = \beta_0 + \beta_1 X_i + u_i. \]
Assume i.i.d. sampling and the zero conditional mean assumption:
\[ \mathbb{E}[u_i\mid X_i] = 0. \]
4.3.1 C1. Key representation of the OLS slope
4.3.1.1 Question
Using the known formula for the OLS slope,
\[ \hat\beta_1=\frac{\sum_{i=1}^n (X_i-\bar X)(Y_i-\bar Y)}{\sum_{i=1}^n (X_i-\bar X)^2}, \]
show that:
\[ \boxed{\hat\beta_1 = \beta_1 + \frac{\sum_{i=1}^n (X_i-\bar X)u_i}{\sum_{i=1}^n (X_i-\bar X)^2}.} \]
4.3.1.2 Solution
Start from the decomposition:
\[ Y_i-\bar Y = \beta_1(X_i-\bar X) + (u_i-\bar u), \]
because \(\bar Y=\beta_0+\beta_1\bar X+\bar u\).
Multiply both sides by \((X_i-\bar X)\) and sum over \(i\):
\[ \sum (X_i-\bar X)(Y_i-\bar Y) = \beta_1\sum (X_i-\bar X)^2 + \sum (X_i-\bar X)(u_i-\bar u). \]
But \(\sum (X_i-\bar X)\bar u = \bar u\sum (X_i-\bar X)=0\), so
\[ \sum (X_i-\bar X)(u_i-\bar u)=\sum (X_i-\bar X)u_i. \]
Divide both sides by \(S_{xx}=\sum (X_i-\bar X)^2\):
\[ \hat\beta_1 = \beta_1 + \frac{\sum (X_i-\bar X)u_i}{S_{xx}}. \]
Hence the desired representation holds.
4.3.2 C2. Unbiasedness of the slope: conditional then unconditional
Narrative idea. Unbiasedness is a statement about repeated sampling. We first show unbiasedness conditional on the observed \(X\) sample, then take expectations again to get unconditional unbiasedness.
4.3.2.1 Question
Show:
\[ \boxed{\mathbb{E}[\hat\beta_1\mid X_1,\dots,X_n]=\beta_1} \quad\Rightarrow\quad \boxed{\mathbb{E}[\hat\beta_1]=\beta_1.} \]
4.3.2.2 Solution
From C1:
\[ \hat\beta_1-\beta_1 = \frac{1}{S_{xx}}\sum_{i=1}^n (X_i-\bar X)u_i, \qquad S_{xx}=\sum (X_i-\bar X)^2. \]
Condition on \(X=(X_1,\dots,X_n)\). The weights \((X_i-\bar X)/S_{xx}\) are constants given \(X\), so
\[ \mathbb{E}[\hat\beta_1-\beta_1\mid X] = \frac{1}{S_{xx}}\sum (X_i-\bar X)\,\mathbb{E}[u_i\mid X]. \]
Under i.i.d. sampling and \(\mathbb{E}[u_i\mid X_i]=0\), we have \(\mathbb{E}[u_i\mid X]=0\) for each \(i\). Therefore
\[ \mathbb{E}[\hat\beta_1-\beta_1\mid X]=0 \quad\Rightarrow\quad \boxed{\mathbb{E}[\hat\beta_1\mid X]=\beta_1.} \]
Finally, apply the Law of Iterated Expectations:
\[ \mathbb{E}[\hat\beta_1] = \mathbb{E}[\mathbb{E}[\hat\beta_1\mid X]] = \mathbb{E}[\beta_1]=\beta_1. \]
4.3.3 C3. Unbiasedness of the intercept
Narrative idea. Once we have unbiasedness of the slope, unbiasedness of the intercept follows from the identity \(\hat\beta_0=\bar Y-\hat\beta_1\bar X\).
4.3.3.1 Question
Show:
\[ \boxed{\mathbb{E}[\hat\beta_0\mid X_1,\dots,X_n]=\beta_0} \quad\Rightarrow\quad \boxed{\mathbb{E}[\hat\beta_0]=\beta_0.} \]
4.3.3.2 Solution
Start from:
\[ \hat\beta_0 = \bar Y - \hat\beta_1\bar X. \]
Using \(\bar Y=\beta_0+\beta_1\bar X+\bar u\),
\[ \hat\beta_0-\beta_0 = \bar u - (\hat\beta_1-\beta_1)\bar X. \]
Condition on \(X\):
- By zero conditional mean and iterated expectations, \(\mathbb{E}[\bar u\mid X]=0\).
- From C2, \(\mathbb{E}[\hat\beta_1-\beta_1\mid X]=0\).
Thus:
\[ \mathbb{E}[\hat\beta_0-\beta_0\mid X]=0 \quad\Rightarrow\quad \boxed{\mathbb{E}[\hat\beta_0\mid X]=\beta_0.} \]
Unconditional unbiasedness follows by iterated expectations as before.
4.3.5 Quick simulation intuition in R
This chunk illustrates unbiasedness in repeated samples when \(\mathbb{E}[u\mid X]=0\). The sample average of \(\hat\beta_1\) across many simulations should be close to the true \(\beta_1\).
set.seed(123)
B <- 2000
n <- 200
beta0 <- 1
beta1 <- 2
b1_hat <- numeric(B)
for (b in 1:B) {
X <- rnorm(n)
u <- rnorm(n) # independent => E[u|X]=0
Y <- beta0 + beta1 * X + u
b1_hat[b] <- coef(lm(Y ~ X))[2]
}
c(
mean_b1_hat = mean(b1_hat),
true_beta1 = beta1
)## mean_b1_hat true_beta1
## 1.999375 2.000000
4.4 Part D. Sampling variance of OLS and estimating \(\sigma^2\) (with solutions)
Narrative idea. In Part C we showed OLS is unbiased under i.i.d. sampling and zero conditional mean.
Part D asks a different question: how variable is OLS across samples? That is, what is the variance of \(\hat\beta_1\) and \(\hat\beta_0\)?
To get clean formulas, we add a variance assumption (homoskedasticity) and a weak independence condition across observations.
We maintain the model:
\[ Y_i = \beta_0 + \beta_1 X_i + u_i, \qquad \mathbb{E}[u_i\mid X_i]=0. \]
4.4.1 D1. Assumptions for the classical variance formulas
To derive simple variance expressions, assume:
Homoskedasticity \[ \operatorname{Var}(u_i\mid X_i)=\sigma^2 \quad \text{(constant in } X_i\text{)}. \]
No conditional correlation across observations \[ \operatorname{Cov}(u_i,u_j\mid X_1,\dots,X_n)=0 \quad (i\neq j). \]
Define: \[ S_{xx} \equiv \sum_{i=1}^n (X_i-\bar X)^2. \]
4.4.2 D2. Conditional variance of the slope
4.4.2.1 Question
Using the representation from Part C,
\[ \hat\beta_1-\beta_1 = \frac{\sum_{i=1}^n (X_i-\bar X)u_i}{S_{xx}}, \]
show that:
\[ \boxed{\operatorname{Var}(\hat\beta_1\mid X_1,\dots,X_n)=\frac{\sigma^2}{S_{xx}}.} \]
4.4.2.2 Solution
Condition on the full regressor sample \(X=(X_1,\dots,X_n)\). Then \(S_{xx}\) and \((X_i-\bar X)\) are constants. Compute:
\[ \operatorname{Var}(\hat\beta_1\mid X) = \operatorname{Var}\left(\frac{1}{S_{xx}}\sum (X_i-\bar X)u_i \Bigm| X\right) = \frac{1}{S_{xx}^2}\operatorname{Var}\left(\sum (X_i-\bar X)u_i \Bigm| X\right). \]
Using conditional uncorrelatedness across \(i\):
\[ \operatorname{Var}\left(\sum (X_i-\bar X)u_i \mid X\right) = \sum (X_i-\bar X)^2\operatorname{Var}(u_i\mid X) = \sum (X_i-\bar X)^2\sigma^2 = \sigma^2 S_{xx}. \]
Therefore:
\[ \operatorname{Var}(\hat\beta_1\mid X) = \frac{1}{S_{xx}^2}(\sigma^2 S_{xx}) = \boxed{\frac{\sigma^2}{S_{xx}}.} \]
Interpretation. The slope is more precise when (i) noise is smaller (\(\sigma^2\) small) and/or (ii) \(X\) has more spread (\(S_{xx}\) large).
4.4.3 D3. Conditional variance of the intercept
4.4.3.1 Question
Show that:
\[ \boxed{\operatorname{Var}(\hat\beta_0\mid X) = \sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{S_{xx}}\right).} \]
4.4.3.2 Solution
Use: \[ \hat\beta_0 = \bar Y - \hat\beta_1\bar X. \]
From the model, \(\bar Y = \beta_0+\beta_1\bar X+\bar u\), so:
\[ \hat\beta_0-\beta_0 = \bar u - (\hat\beta_1-\beta_1)\bar X. \]
Condition on \(X\). Then \(\bar X\) is constant, and we compute:
\[ \operatorname{Var}(\hat\beta_0\mid X) = \operatorname{Var}(\bar u\mid X) + \bar X^2\operatorname{Var}(\hat\beta_1\mid X) -2\bar X\operatorname{Cov}(\bar u,\hat\beta_1\mid X). \]
Under the classical assumptions, \(\operatorname{Var}(\bar u\mid X)=\sigma^2/n\). Also we already have \(\operatorname{Var}(\hat\beta_1\mid X)=\sigma^2/S_{xx}\).
It remains to show the covariance term is zero. Using \[ \hat\beta_1-\beta_1 = \frac{1}{S_{xx}}\sum (X_i-\bar X)u_i, \qquad \bar u = \frac{1}{n}\sum u_i, \] the covariance is proportional to: \[ \operatorname{Cov}\left(\sum u_i,\ \sum (X_i-\bar X)u_i \mid X\right) = \sum (X_i-\bar X)\operatorname{Var}(u_i\mid X) = \sigma^2 \sum (X_i-\bar X)=0. \]
Hence: \[ \operatorname{Var}(\hat\beta_0\mid X) = \frac{\sigma^2}{n}+\bar X^2\frac{\sigma^2}{S_{xx}} = \boxed{\sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{S_{xx}}\right).} \]
4.4.4 D4. Estimating \(\sigma^2\): the residual variance estimator
Define residuals:
\[ \hat u_i = Y_i-\hat\beta_0-\hat\beta_1X_i. \]
4.4.4.2 Solution
The standard estimator is:
\[ \boxed{\hat\sigma^2 = \frac{1}{n-2}\sum_{i=1}^n \hat u_i^2.} \]
Why \(n-2\)? Two parameters \((\beta_0,\beta_1)\) were estimated. The residuals are constrained by the two normal equations (Part A), so the remaining free variation used to estimate \(\sigma^2\) corresponds to \(n-2\) degrees of freedom.
4.4.5 D5. Estimated variance and standard errors of OLS
Plug in \(\hat\sigma^2\):
\[ \boxed{\widehat{\operatorname{Var}}(\hat\beta_1\mid X)=\frac{\hat\sigma^2}{S_{xx}}} \qquad\Rightarrow\qquad \boxed{\text{s.e.}(\hat\beta_1)=\sqrt{\frac{\hat\sigma^2}{S_{xx}}}}. \]
and
\[ \boxed{\widehat{\operatorname{Var}}(\hat\beta_0\mid X)=\hat\sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{S_{xx}}\right)} \qquad\Rightarrow\qquad \boxed{\text{s.e.}(\hat\beta_0)=\sqrt{\hat\sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{S_{xx}}\right)}}. \]
Interpretation. Standard errors translate sampling variability into a scale that allows inference (confidence intervals and t-tests).
4.4.6 Quick check in R using
set.seed(123)
n <- 200
X <- rnorm(n, mean = 2, sd = 1.5)
u <- rnorm(n, mean = 0, sd = 2)
Y <- 1 + 1.5*X + u
fit <- lm(Y ~ X)
# Manual pieces for the classical formulas
uhat <- resid(fit)
sigma2_hat <- sum(uhat^2)/(n-2)
Sxx <- sum((X - mean(X))^2)
se_b1_manual <- sqrt(sigma2_hat / Sxx)
se_b0_manual <- sqrt(sigma2_hat * (1/n + mean(X)^2 / Sxx))
c(
se_b0_lm = summary(fit)$coef[1,2],
se_b0_manual = se_b0_manual,
se_b1_lm = summary(fit)$coef[2,2],
se_b1_manual = se_b1_manual
)## se_b0_lm se_b0_manual se_b1_lm se_b1_manual
## 0.2437701 0.2437701 0.1000182 0.1000182
4.5 Part E. Functional form and units: interpreting coefficients correctly (with solutions)
Narrative idea. Even if OLS is unbiased and we know its variance, we still need to interpret coefficients correctly.
A coefficient is a number with units, and the functional form you choose (levels vs logs, scaling) determines the meaning of “one unit increase.”
We use one running example:
- \(Y\) = weekly earnings (dollars)
- \(X\) = hours worked per week
4.5.1 E1. Units of the slope in the level–level model
Consider:
\[ Y = \beta_0 + \beta_1 X + u. \]
4.5.1.1 Question
- What are the units of \(\beta_1\)?
- Give the economic interpretation of \(\beta_1\) in words.
4.5.1.2 Solution
\(Y\) is dollars and \(X\) is hours, so \(\beta_1\) has units dollars per hour.
\(\beta_1\) is the change in expected weekly earnings associated with one additional hour worked per week, holding other unobservables in \(u\) fixed in the conditional-mean sense (under \(\mathbb{E}[u\mid X]=0\)).
4.5.2 E2. Rescaling regressors: why coefficients change mechanically
Define a rescaled regressor:
\[ X^{(10)} \equiv \frac{X}{10}. \]
4.5.2.1 Question
If we regress \(Y\) on \(X^{(10)}\), how does the slope change? Relate \(\beta_1^{(10)}\) to \(\beta_1\).
4.5.2.2 Solution
Since \(X = 10X^{(10)}\), substitute into the original model:
\[ Y = \beta_0 + \beta_1(10X^{(10)}) + u = \beta_0 + (10\beta_1)X^{(10)} + u. \]
So:
\[ \boxed{\beta_1^{(10)} = 10\beta_1.} \]
Interpretation. A one-unit increase in \(X^{(10)}\) is a 10-hour increase in \(X\), so the slope scales accordingly.
4.5.3 E3. Rescaling outcomes: what changes and what does not
Define \(Y^{(1000)} \equiv Y/1000\) (earnings in “thousands of dollars”).
4.5.3.2 Solution
Divide the entire equation by 1000:
\[ \frac{Y}{1000} = \frac{\beta_0}{1000} + \frac{\beta_1}{1000}X + \frac{u}{1000}. \]
Thus:
\[ \boxed{\beta_0^{(1000)} = \beta_0/1000, \quad \beta_1^{(1000)} = \beta_1/1000.} \]
Interpretation. Changing the units of the dependent variable rescales coefficients, but does not change the underlying relationship—only the measurement scale.
4.5.4 E4. Log–level model: interpreting semi-elasticities
Consider:
\[ \ln(Y) = \gamma_0 + \gamma_1 X + v. \]
4.5.4.1 Question
Interpret \(\gamma_1\). Give a rule-of-thumb interpretation for small \(\gamma_1\).
4.5.4.2 Solution
\(\gamma_1\) is a semi-elasticity: it measures the change in log earnings from a one-unit increase in \(X\).
For small \(\gamma_1\):
\[ \Delta \ln(Y) \approx \frac{\Delta Y}{Y}. \]
So, approximately:
\[ \boxed{\text{A 1-unit increase in }X\text{ is associated with about }100\gamma_1\%\text{ change in }Y.} \]
4.5.5 E5. Log–log model: interpreting elasticities
Consider:
\[ \ln(Y) = \delta_0 + \delta_1 \ln(X) + e. \]
4.5.6 E6. Functional form as a modeling choice (concept check)
4.5.6.2 Solution
Logs are often preferred when variation in \(Y\) is roughly proportional to its level (e.g., earnings), which can make relationships closer to linear in logs and can reduce heteroskedasticity. Logs also lead to percent-change interpretations that are often more meaningful than “dollar changes” across very different income levels.
4.5.7 Small R demo: same data, different scales
set.seed(123)
n <- 200
X <- rnorm(n, mean = 40, sd = 5) # hours per week
u <- rnorm(n, mean = 0, sd = 50)
Y <- 200 + 15*X + u # dollars per week
fit_level <- lm(Y ~ X)
X10 <- X/10
fit_rescaleX <- lm(Y ~ X10)
Yk <- Y/1000
fit_rescaleY <- lm(Yk ~ X)
c(
b1_level = coef(fit_level)[2],
b1_rescaleX = coef(fit_rescaleX)[2],
b1_rescaleY = coef(fit_rescaleY)[2]
)## b1_level.X b1_rescaleX.X10 b1_rescaleY.X
## 14.70745558 147.07455583 0.01470746
You should see approximately:
- b1_rescaleX ≈ 10 * b1_level
- b1_rescaleY ≈ b1_level / 1000