4 Tutorial 3: Simple OLS — Residuals, Assumptions, Unbiasedness, Variance, and Interpretation

We observe an i.i.d. sample \(\{(Y_i, X_i)\}_{i=1}^n\) with \(\sum_{i=1}^n (X_i-\bar X)^2>0\). The simple linear regression model is

\[ Y = \beta_0 + \beta_1 X + u. \]

In the sample, OLS chooses \((\hat\beta_0,\hat\beta_1)\) to minimize the sum of squared residuals:

\[ \min_{\beta_0,\beta_1}\ \sum_{i=1}^n (Y_i-\beta_0-\beta_1X_i)^2. \]

Define the fitted values and residuals:

\[ \hat Y_i = \hat\beta_0+\hat\beta_1X_i, \qquad \hat u_i = Y_i-\hat Y_i. \]

4.1 Part A. What residuals are and what OLS forces them to satisfy

Narrative idea. OLS picks the “best” line (in squared-error sense). Once the line is chosen, each residual \(\hat u_i\) is the part of \(Y_i\) that the line does not explain. The key point is: OLS does not leave residuals arbitrary. The first-order conditions imply exact sample moment conditions—mechanical identities that hold in any dataset whenever you run OLS with an intercept.

We will use the normal equations as facts (we derived them last tutorial), and focus on what they imply.

4.1.1 A1. Normal equations: two sample moment conditions

The OLS normal equations imply:

\[ \boxed{\sum_{i=1}^n \hat u_i = 0} \qquad\text{and}\qquad \boxed{\sum_{i=1}^n X_i \hat u_i = 0.} \]

Interpretation. - \(\sum \hat u_i=0\) means residuals average to zero: OLS does not systematically over- or under-predict \(Y\) in the sample. - \(\sum X_i\hat u_i=0\) means residuals are “orthogonal” to \(X\) in the sample: once the slope is chosen, there is no remaining linear association between \(X\) and the residuals.

4.1.2 A2. Orthogonality to centered \(X\): what it really means

4.1.2.1 Question

Show that residuals are also orthogonal to deviations of \(X\) around its mean:

\[ \boxed{\sum_{i=1}^n (X_i-\bar X)\hat u_i = 0.} \]

4.1.2.2 Solution

Start from the identity:

\[ \sum_{i=1}^n (X_i-\bar X)\hat u_i = \sum_{i=1}^n X_i\hat u_i - \bar X\sum_{i=1}^n \hat u_i. \]

By the normal equations (A1), \(\sum X_i\hat u_i=0\) and \(\sum \hat u_i=0\). Therefore the right-hand side is \(0-\bar X\cdot 0=0\), so:

\[ \boxed{\sum_{i=1}^n (X_i-\bar X)\hat u_i = 0.} \]

Interpretation. After fitting the line, observations with above-average \(X\) are not systematically above/below the line relative to observations with below-average \(X\).

4.1.3 A3 Prove that the OLS regression line passes through \((\bar{X}, \bar{Y})\)

Step 1: Write the fitted value equation

For each observation \(i\), the fitted value is:

\[ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i \]

Step 2: Compute the mean of the fitted values

Sum all \(\hat{Y}_i\) and divide by \(n\):

\[ \bar{\hat{Y}} = \frac{1}{n} \sum_{i=1}^{n} \hat{Y}_i = \frac{1}{n} \sum_{i=1}^{n} \left( \hat{\beta}_0 + \hat{\beta}_1 X_i \right) \]

Step 3: Split the summation

\[ \bar{\hat{Y}} = \frac{1}{n} \sum_{i=1}^{n} \hat{\beta}_0 + \frac{1}{n} \sum_{i=1}^{n} \hat{\beta}_1 X_i \]

Since \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are constants, they factor out:

\[ \bar{\hat{Y}} = \hat{\beta}_0 + \hat{\beta}_1 \frac{1}{n} \sum_{i=1}^{n} X_i = \hat{\beta}_0 + \hat{\beta}_1 \bar{X} \]

Step 4: Substitute \(\hat{\beta}_0\)

Recall that the OLS intercept estimator is:

\[ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \]

Substituting:

\[ \bar{\hat{Y}} = \left( \bar{Y} - \hat{\beta}_1 \bar{X} \right) + \hat{\beta}_1 \bar{X} \]

Step 5: Simplify

The terms \(-\hat{\beta}_1 \bar{X}\) and \(+\hat{\beta}_1 \bar{X}\) cancel out:

\[ \bar{\hat{Y}} = \bar{Y} \]

Step 6: Geometric interpretation

Evaluating the fitted line at \(X = \bar{X}\):

\[ \hat{Y}(\bar{X}) = \hat{\beta}_0 + \hat{\beta}_1 \bar{X} = \bar{Y} \]

That is, when \(X\) equals its mean, the line predicts exactly \(\bar{Y}\).

Conclusion: The OLS regression line always passes through the point \((\bar{X}, \bar{Y})\). \(\blacksquare\)

4.1.4 A4. Prove the ANOVA decomposition: \(TSS = ESS + RSS\)

4.1.4.1 Context

In regression analysis we want to know how much of the total variation in \(Y\) is explained by the model. To answer this, we decompose the total variation into two parts: one attributed to the fitted line and one to the residuals. This decomposition is the foundation of the \(R^2\) statistic and the \(F\)-test in regression.

We define:

\[ TSS=\sum_{i=1}^n (Y_i-\bar Y)^2,\quad ESS=\sum_{i=1}^n (\hat Y_i-\bar Y)^2,\quad RSS=\sum_{i=1}^n \hat u_i^2. \]

TSS (Total Sum of Squares): measures the total variability of \(Y\) around its mean.
ESS (Explained Sum of Squares): measures how much of that variability is captured by the fitted values \(\hat{Y}_i\).
RSS (Residual Sum of Squares): measures the leftover variability not captured by the model.

The key identity we start from is:

\[ Y_i - \bar{Y} = (\hat{Y}_i - \bar{Y}) + \hat{u}_i \]

This simply says: the deviation of each observation from the mean equals the deviation explained by the model plus the residual.

4.1.4.2 Question

(a) Square both sides of the identity above and sum over all \(i = 1, \dots, n\). Write the result in terms of \(TSS\), \(ESS\), \(RSS\), and a cross-term.

(b) Using the result from A2 (\(\sum (X_i - \bar{X})\hat{u}_i = 0\)) and from A3 (\(\hat{Y}_i - \bar{Y} = \hat{\beta}_1(X_i - \bar{X})\)), show that the cross-term equals zero and conclude:

\[ \boxed{TSS = ESS + RSS.} \]

4.1.4.3 Solution

(a) Squaring and summing:

\[ \sum_{i=1}^n (Y_i-\bar Y)^2 = \sum_{i=1}^n (\hat Y_i-\bar Y)^2 + \sum_{i=1}^n \hat u_i^2 + 2\sum_{i=1}^n (\hat Y_i-\bar Y)\hat u_i \]

That is:

\[ TSS = ESS + RSS + 2\sum_{i=1}^n (\hat Y_i-\bar Y)\hat u_i \]

(b) We need to show the cross-term is zero.

Step 1. From A3, the regression line passes through \((\bar{X}, \bar{Y})\), so:

\[ \hat Y_i - \bar Y = (\hat\beta_0 + \hat\beta_1 X_i) - (\hat\beta_0 + \hat\beta_1 \bar X) = \hat\beta_1(X_i - \bar X) \]

Step 2. Substitute into the cross-term:

\[ \sum_{i=1}^n (\hat Y_i - \bar Y)\hat u_i = \hat\beta_1 \sum_{i=1}^n (X_i - \bar X)\hat u_i \]

Step 3. By the result from A2, \(\sum_{i=1}^n (X_i - \bar X)\hat u_i = 0\), so the entire cross-term vanishes.

Step 4. Substituting back:

\[ TSS = ESS + RSS + 2 \cdot 0 = ESS + RSS \]

\[ \boxed{TSS = ESS + RSS} \]

4.1.4.4 Interpretation

This result tells us that the total variation in \(Y\) can be cleanly split into two non-overlapping parts: what the model explains (\(ESS\)) and what it does not (\(RSS\)). This clean split is what makes the coefficient of determination meaningful:

\[ R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS} \]

Without the cross-term being zero, this decomposition would not hold, and \(R^2\) would lose its interpretation as the proportion of variance explained.

4.1.4.5 Computational check in R

This chunk verifies the ANOVA decomposition numerically in a simulated dataset.

set.seed(123)

n <- 80
X <- rnorm(n, mean = 2, sd = 1)
u <- rnorm(n, mean = 0, sd = 2)
Y <- 1 + 1.5 * X + u

fit <- lm(Y ~ X)
uhat <- resid(fit)
yhat <- fitted(fit)

TSS <- sum((Y - mean(Y))^2)
ESS <- sum((yhat - mean(Y))^2)
RSS <- sum(uhat^2)

c(
  cross_term        = sum((yhat - mean(Y)) * uhat),
  TSS_minus_ESS_RSS = TSS - ESS - RSS
)

##        cross_term TSS_minus_ESS_RSS 
##      2.914335e-15      1.136868e-13

Both values should be essentially zero (up to floating-point precision), confirming the decomposition.

4.2 Part B. The Population Regression Function (PRF)

4.2.1 Why do we need this?

In Part A we worked entirely with sample data: we found formulas for \(\hat{\beta}_0\) and \(\hat{\beta}_1\), proved algebraic properties of residuals, and showed the ANOVA decomposition \(TSS = ESS + RSS\). All of that was purely mechanical — it holds for any dataset, with no assumptions about where the data came from.

But the ultimate goal of regression is not just to draw a line through a particular sample. We want to learn something about the underlying population. This raises fundamental questions:

What exactly is OLS trying to estimate?
Under what conditions do our sample estimates \(\hat{\beta}_0, \hat{\beta}_1\) actually tell us something true about the world?
When can we trust these estimates, and when might they mislead us?

To answer these questions, we need to define the population counterpart of the sample regression — the Population Regression Function. Part B builds the theoretical framework that will allow us, in later sections, to prove that OLS is unbiased and to understand when it can fail.

The figure below makes this concrete. The grey cloud is your data. The red bell curves show how \(Y\) is distributed within each slice of \(X\) — these distributions shift as \(X\) changes. The blue line connects their centres. That line is the PRF.

The Population Regression Function (blue line) connects the conditional means E[Y|X=x] at every value of x. The red curves show the conditional distribution of Y given X. OLS estimates this line from sample data.

Figure 4.1: The Population Regression Function (blue line) connects the conditional means E[Y|X=x] at every value of x. The red curves show the conditional distribution of Y given X. OLS estimates this line from sample data.

4.2.2 B1.a Definition (concept)

The Population Regression Function is:

\[ m(x) \equiv \mathbb{E}[Y \mid X = x]. \]

Interpretation. For each value of \(x\), \(m(x)\) is the average of \(Y\) among all units in the population with \(X = x\). This is the “true” relationship between \(X\) and \(Y\) — the signal we are trying to recover from noisy data.

In many applications we approximate this conditional expectation with a linear function:

\[ \mathbb{E}[Y \mid X = x] = \beta_0 + \beta_1 x. \]

This is the linear PRF assumption. Here \(\beta_0\) and \(\beta_1\) are fixed, unknown population parameters — the quantities that the sample estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) from Part A are trying to approximate.

4.2.3 B2. Zero conditional mean: what it is and where it comes from

4.2.3.1 Context

In Part A we defined residuals \(\hat{u}_i = Y_i - \hat{Y}_i\) and showed they satisfy convenient algebraic properties (summing to zero, being orthogonal to \(X\)). Those were sample properties that hold by construction.

Now we ask: does the population error term \(u\) satisfy analogous properties? The answer is yes — but not by construction. It follows from the linear PRF assumption. This result, called the zero conditional mean condition, is the single most important assumption in OLS theory: it is what makes our estimators unbiased.

4.2.3.2 Question

Assume the PRF is linear:

\[ \mathbb{E}[Y \mid X = x] = \beta_0 + \beta_1 x. \]

Define the population error term:

\[ u \equiv Y - (\beta_0 + \beta_1 X). \]

Show that the linear PRF implies:

\[ \boxed{\mathbb{E}[u \mid X] = 0.} \]

4.2.3.3 Solution

Step 1. Apply the conditional expectation to the definition of \(u\):

\[ \mathbb{E}[u \mid X] = \mathbb{E}[Y - (\beta_0 + \beta_1 X) \mid X] = \mathbb{E}[Y \mid X] - (\beta_0 + \beta_1 X) \]

Step 2. Substitute the linear PRF, \(\mathbb{E}[Y \mid X] = \beta_0 + \beta_1 X\):

\[ \mathbb{E}[u \mid X] = (\beta_0 + \beta_1 X) - (\beta_0 + \beta_1 X) = 0 \]

\[ \boxed{\mathbb{E}[u \mid X] = 0} \]

Interpretation. After accounting for the linear effect of \(X\), the remaining component \(u\) has no systematic pattern left — on average, it is zero regardless of the value of \(X\). Compare this with Part A, where we showed \(\sum \hat{u}_i = 0\) and \(\sum X_i \hat{u}_i = 0\) by algebra. Here, the analogous population property holds because of the PRF assumption, not by construction.

4.2.4 B3. What zero conditional mean implies (useful corollaries)

4.2.4.1 Context

The condition \(\mathbb{E}[u \mid X] = 0\) is a conditional statement — it says something about \(u\) for every possible value of \(X\). This is a strong requirement, and it automatically implies weaker unconditional properties. These unconditional properties connect back to the moment conditions we saw in Part A (recall: \(\sum \hat{u}_i = 0\) and \(\sum X_i \hat{u}_i = 0\) were the sample analogues).

4.2.4.2 Question

Assuming \(\mathbb{E}[u \mid X] = 0\), show that:

\(\boxed{\mathbb{E}[u] = 0}\)
\(\boxed{\operatorname{Cov}(X, u) = 0}\)

4.2.4.3 Solution

1) By the law of iterated expectations:

\[ \mathbb{E}[u] = \mathbb{E}\big[\mathbb{E}[u \mid X]\big] = \mathbb{E}[0] = 0 \]

2) Start from the definition:

\[ \operatorname{Cov}(X, u) = \mathbb{E}[Xu] - \mathbb{E}[X]\,\mathbb{E}[u] \]

Since \(\mathbb{E}[u] = 0\) from part 1, it suffices to show \(\mathbb{E}[Xu] = 0\):

\[ \mathbb{E}[Xu] = \mathbb{E}\big[\mathbb{E}[Xu \mid X]\big] = \mathbb{E}\big[X \cdot \mathbb{E}[u \mid X]\big] = \mathbb{E}[X \cdot 0] = 0 \]

Therefore \(\operatorname{Cov}(X, u) = 0\).

Interpretation. These are the population analogues of the Part A results:

Part A (sample, by construction)	Part B (population, by assumption)
\(\sum \hat{u}_i = 0\)	\(\mathbb{E}[u] = 0\)
\(\sum X_i \hat{u}_i = 0\)	\(\operatorname{Cov}(X, u) = 0\)

The sample properties hold automatically for any OLS fit. The population properties require the zero conditional mean assumption. This parallel is not a coincidence — OLS is designed so that its sample moment conditions mimic the population ones.

4.2.5 B4. Mean independence vs zero conditional mean (and why we care)

4.2.5.1 Context

Students sometimes encounter different versions of the “no relationship between \(u\) and \(X\)” assumption. Here we clarify the two most common ones and their logical relationship.

4.2.5.2 B4.a Definitions

Zero conditional mean: \[ \mathbb{E}[u \mid X] = 0 \]
Mean independence: \[ \mathbb{E}[u \mid X] = \mathbb{E}[u] \]

Mean independence says the conditional mean of \(u\) does not depend on \(X\) at all; zero conditional mean additionally pins that constant to zero.

4.2.5.3 Question

If \(\mathbb{E}[u] = 0\), show that mean independence implies zero conditional mean.
In 2 lines: which is stronger, and why?

4.2.5.4 Solution

1) If mean independence holds, then \(\mathbb{E}[u \mid X] = \mathbb{E}[u]\). If also \(\mathbb{E}[u] = 0\):

\[ \mathbb{E}[u \mid X] = 0 \]

2) Mean independence is stronger because it requires \(\mathbb{E}[u \mid X]\) to be the same constant for every value of \(X\). Zero conditional mean is a special case where that constant is zero — and it is the specific condition needed for OLS unbiasedness.

4.2.6 B5. A cautionary note: uncorrelatedness is not enough (concept)

4.2.6.1 Context

From B3, we know that \(\mathbb{E}[u \mid X] = 0\) implies \(\operatorname{Cov}(X, u) = 0\). It is tempting to think the reverse is also true — that if \(X\) and \(u\) are uncorrelated, OLS must be fine. This is a common and dangerous misconception.

4.2.6.2 Question (2 lines)

Explain briefly why \(\operatorname{Cov}(X, u) = 0\) alone does not guarantee \(\mathbb{E}[u \mid X] = 0\).

4.2.6.3 Solution

Uncorrelatedness is an unconditional moment restriction — it only rules out a linear relationship between \(X\) and \(u\). It can hold even when \(\mathbb{E}[u \mid X]\) varies with \(X\) in a nonlinear way (e.g., \(\mathbb{E}[u \mid X] = X^2 - \mathbb{E}[X^2]\)). OLS unbiasedness requires the full conditional restriction \(\mathbb{E}[u \mid X] = 0\), not merely \(\operatorname{Cov}(X, u) = 0\).

4.2.7 Quick simulation intuition in R

This chunk is just to build intuition: it shows that when we generate data with \(\mathbb{E}[u\mid X]=0\), sample correlation between \(X\) and residual-like noise fluctuates around zero.

set.seed(123)

n <- 500
X <- rnorm(n)
u <- rnorm(n)                 # independent of X => E[u|X]=0
Y <- 1 + 2*X + u

# Check sample correlation between X and u (should be close to 0 on average)
cor(X, u)

## [1] -0.05193691

4.3 Part C. Unbiasedness of OLS (with solutions)

Narrative idea. Part A gave sample identities (orthogonality) that hold mechanically.
Part B defined the population object we care about (the PRF) and introduced the key assumption \(\mathbb{E}[u\mid X]=0\).
Part C connects the two: we use an algebraic representation of the OLS slope and show that under random sampling and zero conditional mean, OLS is unbiased.

We work with the population model (for each observation \(i\)):

\[ Y_i = \beta_0 + \beta_1 X_i + u_i. \]

Assume i.i.d. sampling and the zero conditional mean assumption:

\[ \mathbb{E}[u_i\mid X_i] = 0. \]

4.3.1 C1. Key representation of the OLS slope

4.3.1.1 Question

Using the known formula for the OLS slope,

\[ \hat\beta_1=\frac{\sum_{i=1}^n (X_i-\bar X)(Y_i-\bar Y)}{\sum_{i=1}^n (X_i-\bar X)^2}, \]

show that:

\[ \boxed{\hat\beta_1 = \beta_1 + \frac{\sum_{i=1}^n (X_i-\bar X)u_i}{\sum_{i=1}^n (X_i-\bar X)^2}.} \]

4.3.1.2 Solution

Start from the decomposition:

\[ Y_i-\bar Y = \beta_1(X_i-\bar X) + (u_i-\bar u), \]

because \(\bar Y=\beta_0+\beta_1\bar X+\bar u\).

Multiply both sides by \((X_i-\bar X)\) and sum over \(i\):

\[ \sum (X_i-\bar X)(Y_i-\bar Y) = \beta_1\sum (X_i-\bar X)^2 + \sum (X_i-\bar X)(u_i-\bar u). \]

But \(\sum (X_i-\bar X)\bar u = \bar u\sum (X_i-\bar X)=0\), so

\[ \sum (X_i-\bar X)(u_i-\bar u)=\sum (X_i-\bar X)u_i. \]

Divide both sides by \(S_{xx}=\sum (X_i-\bar X)^2\):

\[ \hat\beta_1 = \beta_1 + \frac{\sum (X_i-\bar X)u_i}{S_{xx}}. \]

Hence the desired representation holds.

4.3.2 C2. Unbiasedness of the slope: conditional then unconditional

Narrative idea. Unbiasedness is a statement about repeated sampling. We first show unbiasedness conditional on the observed \(X\) sample, then take expectations again to get unconditional unbiasedness.

4.3.2.1 Question

Show:

\[ \boxed{\mathbb{E}[\hat\beta_1\mid X_1,\dots,X_n]=\beta_1} \quad\Rightarrow\quad \boxed{\mathbb{E}[\hat\beta_1]=\beta_1.} \]

4.3.2.2 Solution

From C1:

\[ \hat\beta_1-\beta_1 = \frac{1}{S_{xx}}\sum_{i=1}^n (X_i-\bar X)u_i, \qquad S_{xx}=\sum (X_i-\bar X)^2. \]

Condition on \(X=(X_1,\dots,X_n)\). The weights \((X_i-\bar X)/S_{xx}\) are constants given \(X\), so

\[ \mathbb{E}[\hat\beta_1-\beta_1\mid X] = \frac{1}{S_{xx}}\sum (X_i-\bar X)\,\mathbb{E}[u_i\mid X]. \]

Under i.i.d. sampling and \(\mathbb{E}[u_i\mid X_i]=0\), we have \(\mathbb{E}[u_i\mid X]=0\) for each \(i\). Therefore

\[ \mathbb{E}[\hat\beta_1-\beta_1\mid X]=0 \quad\Rightarrow\quad \boxed{\mathbb{E}[\hat\beta_1\mid X]=\beta_1.} \]

Finally, apply the Law of Iterated Expectations:

\[ \mathbb{E}[\hat\beta_1] = \mathbb{E}[\mathbb{E}[\hat\beta_1\mid X]] = \mathbb{E}[\beta_1]=\beta_1. \]

4.3.3 C3. Unbiasedness of the intercept

Narrative idea. Once we have unbiasedness of the slope, unbiasedness of the intercept follows from the identity \(\hat\beta_0=\bar Y-\hat\beta_1\bar X\).

4.3.3.1 Question

Show:

\[ \boxed{\mathbb{E}[\hat\beta_0\mid X_1,\dots,X_n]=\beta_0} \quad\Rightarrow\quad \boxed{\mathbb{E}[\hat\beta_0]=\beta_0.} \]

4.3.3.2 Solution

Start from:

\[ \hat\beta_0 = \bar Y - \hat\beta_1\bar X. \]

Using \(\bar Y=\beta_0+\beta_1\bar X+\bar u\),

\[ \hat\beta_0-\beta_0 = \bar u - (\hat\beta_1-\beta_1)\bar X. \]

Condition on \(X\):

By zero conditional mean and iterated expectations, \(\mathbb{E}[\bar u\mid X]=0\).
From C2, \(\mathbb{E}[\hat\beta_1-\beta_1\mid X]=0\).

Thus:

\[ \mathbb{E}[\hat\beta_0-\beta_0\mid X]=0 \quad\Rightarrow\quad \boxed{\mathbb{E}[\hat\beta_0\mid X]=\beta_0.} \]

Unconditional unbiasedness follows by iterated expectations as before.

4.3.4 C4. Why uncorrelatedness is not enough (concept check)

Narrative idea. Students often confuse the unconditional statement “\(\operatorname{Cov}(X,u)=0\)” with the conditional statement “\(\mathbb{E}[u\mid X]=0\)”. OLS unbiasedness needs the conditional restriction.

4.3.4.1 Question (2 lines)

Explain briefly why \(\operatorname{Cov}(X,u)=0\) alone does not guarantee OLS unbiasedness.

4.3.4.2 Solution

\(\operatorname{Cov}(X,u)=0\) is an unconditional moment condition and can hold even if \(\mathbb{E}[u\mid X]\) varies with \(X\) (e.g., in a nonlinear way). OLS unbiasedness relies on the conditional restriction \(\mathbb{E}[u\mid X]=0\), which is stronger and rules out systematic dependence of the mean error on \(X\).

4.3.5 Quick simulation intuition in R

This chunk illustrates unbiasedness in repeated samples when \(\mathbb{E}[u\mid X]=0\). The sample average of \(\hat\beta_1\) across many simulations should be close to the true \(\beta_1\).

set.seed(123)

B <- 2000
n <- 200
beta0 <- 1
beta1 <- 2

b1_hat <- numeric(B)

for (b in 1:B) {
  X <- rnorm(n)
  u <- rnorm(n)               # independent => E[u|X]=0
  Y <- beta0 + beta1 * X + u
  b1_hat[b] <- coef(lm(Y ~ X))[2]
}

c(
  mean_b1_hat = mean(b1_hat),
  true_beta1 = beta1
)

## mean_b1_hat  true_beta1 
##    1.999375    2.000000

4.4 Part D. Sampling variance of OLS and estimating \(\sigma^2\) (with solutions)

Narrative idea. In Part C we showed OLS is unbiased under i.i.d. sampling and zero conditional mean.
Part D asks a different question: how variable is OLS across samples? That is, what is the variance of \(\hat\beta_1\) and \(\hat\beta_0\)?
To get clean formulas, we add a variance assumption (homoskedasticity) and a weak independence condition across observations.

We maintain the model:

\[ Y_i = \beta_0 + \beta_1 X_i + u_i, \qquad \mathbb{E}[u_i\mid X_i]=0. \]

4.4.1 D1. Assumptions for the classical variance formulas

To derive simple variance expressions, assume:

Homoskedasticity \[ \operatorname{Var}(u_i\mid X_i)=\sigma^2 \quad \text{(constant in } X_i\text{)}. \]
No conditional correlation across observations \[ \operatorname{Cov}(u_i,u_j\mid X_1,\dots,X_n)=0 \quad (i\neq j). \]

Define: \[ S_{xx} \equiv \sum_{i=1}^n (X_i-\bar X)^2. \]

4.4.2 D2. Conditional variance of the slope

4.4.2.1 Question

Using the representation from Part C,

\[ \hat\beta_1-\beta_1 = \frac{\sum_{i=1}^n (X_i-\bar X)u_i}{S_{xx}}, \]

show that:

\[ \boxed{\operatorname{Var}(\hat\beta_1\mid X_1,\dots,X_n)=\frac{\sigma^2}{S_{xx}}.} \]

4.4.2.2 Solution

Condition on the full regressor sample \(X=(X_1,\dots,X_n)\). Then \(S_{xx}\) and \((X_i-\bar X)\) are constants. Compute:

\[ \operatorname{Var}(\hat\beta_1\mid X) = \operatorname{Var}\left(\frac{1}{S_{xx}}\sum (X_i-\bar X)u_i \Bigm| X\right) = \frac{1}{S_{xx}^2}\operatorname{Var}\left(\sum (X_i-\bar X)u_i \Bigm| X\right). \]

Using conditional uncorrelatedness across \(i\):

\[ \operatorname{Var}\left(\sum (X_i-\bar X)u_i \mid X\right) = \sum (X_i-\bar X)^2\operatorname{Var}(u_i\mid X) = \sum (X_i-\bar X)^2\sigma^2 = \sigma^2 S_{xx}. \]

Therefore:

\[ \operatorname{Var}(\hat\beta_1\mid X) = \frac{1}{S_{xx}^2}(\sigma^2 S_{xx}) = \boxed{\frac{\sigma^2}{S_{xx}}.} \]

Interpretation. The slope is more precise when (i) noise is smaller (\(\sigma^2\) small) and/or (ii) \(X\) has more spread (\(S_{xx}\) large).

4.4.3 D3. Conditional variance of the intercept

4.4.3.1 Question

Show that:

\[ \boxed{\operatorname{Var}(\hat\beta_0\mid X) = \sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{S_{xx}}\right).} \]

4.4.3.2 Solution

Use: \[ \hat\beta_0 = \bar Y - \hat\beta_1\bar X. \]

From the model, \(\bar Y = \beta_0+\beta_1\bar X+\bar u\), so:

\[ \hat\beta_0-\beta_0 = \bar u - (\hat\beta_1-\beta_1)\bar X. \]

Condition on \(X\). Then \(\bar X\) is constant, and we compute:

\[ \operatorname{Var}(\hat\beta_0\mid X) = \operatorname{Var}(\bar u\mid X) + \bar X^2\operatorname{Var}(\hat\beta_1\mid X) -2\bar X\operatorname{Cov}(\bar u,\hat\beta_1\mid X). \]

Under the classical assumptions, \(\operatorname{Var}(\bar u\mid X)=\sigma^2/n\). Also we already have \(\operatorname{Var}(\hat\beta_1\mid X)=\sigma^2/S_{xx}\).

It remains to show the covariance term is zero. Using \[ \hat\beta_1-\beta_1 = \frac{1}{S_{xx}}\sum (X_i-\bar X)u_i, \qquad \bar u = \frac{1}{n}\sum u_i, \] the covariance is proportional to: \[ \operatorname{Cov}\left(\sum u_i,\ \sum (X_i-\bar X)u_i \mid X\right) = \sum (X_i-\bar X)\operatorname{Var}(u_i\mid X) = \sigma^2 \sum (X_i-\bar X)=0. \]

Hence: \[ \operatorname{Var}(\hat\beta_0\mid X) = \frac{\sigma^2}{n}+\bar X^2\frac{\sigma^2}{S_{xx}} = \boxed{\sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{S_{xx}}\right).} \]

4.4.4 D4. Estimating \(\sigma^2\): the residual variance estimator

Define residuals:

\[ \hat u_i = Y_i-\hat\beta_0-\hat\beta_1X_i. \]

4.4.4.1 Question

State the usual estimator for \(\sigma^2\) and explain the degrees of freedom.

4.4.4.2 Solution

The standard estimator is:

\[ \boxed{\hat\sigma^2 = \frac{1}{n-2}\sum_{i=1}^n \hat u_i^2.} \]

Why \(n-2\)? Two parameters \((\beta_0,\beta_1)\) were estimated. The residuals are constrained by the two normal equations (Part A), so the remaining free variation used to estimate \(\sigma^2\) corresponds to \(n-2\) degrees of freedom.

4.4.5 D5. Estimated variance and standard errors of OLS

Plug in \(\hat\sigma^2\):

\[ \boxed{\widehat{\operatorname{Var}}(\hat\beta_1\mid X)=\frac{\hat\sigma^2}{S_{xx}}} \qquad\Rightarrow\qquad \boxed{\text{s.e.}(\hat\beta_1)=\sqrt{\frac{\hat\sigma^2}{S_{xx}}}}. \]

and

\[ \boxed{\widehat{\operatorname{Var}}(\hat\beta_0\mid X)=\hat\sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{S_{xx}}\right)} \qquad\Rightarrow\qquad \boxed{\text{s.e.}(\hat\beta_0)=\sqrt{\hat\sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{S_{xx}}\right)}}. \]

Interpretation. Standard errors translate sampling variability into a scale that allows inference (confidence intervals and t-tests).

4.4.6 Quick check in R using

set.seed(123)

n <- 200
X <- rnorm(n, mean = 2, sd = 1.5)
u <- rnorm(n, mean = 0, sd = 2)
Y <- 1 + 1.5*X + u

fit <- lm(Y ~ X)

# Manual pieces for the classical formulas
uhat <- resid(fit)
sigma2_hat <- sum(uhat^2)/(n-2)
Sxx <- sum((X - mean(X))^2)

se_b1_manual <- sqrt(sigma2_hat / Sxx)
se_b0_manual <- sqrt(sigma2_hat * (1/n + mean(X)^2 / Sxx))

c(
  se_b0_lm = summary(fit)$coef[1,2],
  se_b0_manual = se_b0_manual,
  se_b1_lm = summary(fit)$coef[2,2],
  se_b1_manual = se_b1_manual
)

##     se_b0_lm se_b0_manual     se_b1_lm se_b1_manual 
##    0.2437701    0.2437701    0.1000182    0.1000182

4.5 Part E. Functional form and units: interpreting coefficients correctly (with solutions)

Narrative idea. Even if OLS is unbiased and we know its variance, we still need to interpret coefficients correctly.
A coefficient is a number with units, and the functional form you choose (levels vs logs, scaling) determines the meaning of “one unit increase.”

We use one running example:

\(Y\) = weekly earnings (dollars)
\(X\) = hours worked per week

4.5.1 E1. Units of the slope in the level–level model

Consider:

\[ Y = \beta_0 + \beta_1 X + u. \]

4.5.1.1 Question

What are the units of \(\beta_1\)?
Give the economic interpretation of \(\beta_1\) in words.

4.5.1.2 Solution

\(Y\) is dollars and \(X\) is hours, so \(\beta_1\) has units dollars per hour.
\(\beta_1\) is the change in expected weekly earnings associated with one additional hour worked per week, holding other unobservables in \(u\) fixed in the conditional-mean sense (under \(\mathbb{E}[u\mid X]=0\)).

4.5.2 E2. Rescaling regressors: why coefficients change mechanically

Define a rescaled regressor:

\[ X^{(10)} \equiv \frac{X}{10}. \]

4.5.2.1 Question

If we regress \(Y\) on \(X^{(10)}\), how does the slope change? Relate \(\beta_1^{(10)}\) to \(\beta_1\).

4.5.2.2 Solution

Since \(X = 10X^{(10)}\), substitute into the original model:

\[ Y = \beta_0 + \beta_1(10X^{(10)}) + u = \beta_0 + (10\beta_1)X^{(10)} + u. \]

So:

\[ \boxed{\beta_1^{(10)} = 10\beta_1.} \]

Interpretation. A one-unit increase in \(X^{(10)}\) is a 10-hour increase in \(X\), so the slope scales accordingly.

4.5.3 E3. Rescaling outcomes: what changes and what does not

Define \(Y^{(1000)} \equiv Y/1000\) (earnings in “thousands of dollars”).

4.5.3.1 Question

If we regress \(Y^{(1000)}\) on \(X\), what happens to the slope and intercept?

4.5.3.2 Solution

Divide the entire equation by 1000:

\[ \frac{Y}{1000} = \frac{\beta_0}{1000} + \frac{\beta_1}{1000}X + \frac{u}{1000}. \]

Thus:

\[ \boxed{\beta_0^{(1000)} = \beta_0/1000, \quad \beta_1^{(1000)} = \beta_1/1000.} \]

Interpretation. Changing the units of the dependent variable rescales coefficients, but does not change the underlying relationship—only the measurement scale.

4.5.4 E4. Log–level model: interpreting semi-elasticities

Consider:

\[ \ln(Y) = \gamma_0 + \gamma_1 X + v. \]

4.5.4.1 Question

Interpret \(\gamma_1\). Give a rule-of-thumb interpretation for small \(\gamma_1\).

4.5.4.2 Solution

\(\gamma_1\) is a semi-elasticity: it measures the change in log earnings from a one-unit increase in \(X\).

For small \(\gamma_1\):

\[ \Delta \ln(Y) \approx \frac{\Delta Y}{Y}. \]

So, approximately:

\[ \boxed{\text{A 1-unit increase in }X\text{ is associated with about }100\gamma_1\%\text{ change in }Y.} \]

4.5.5 E5. Log–log model: interpreting elasticities

Consider:

\[ \ln(Y) = \delta_0 + \delta_1 \ln(X) + e. \]

4.5.5.1 Question

Interpret \(\delta_1\).

4.5.5.2 Solution

\(\delta_1\) is an elasticity:

\[ \boxed{\text{A 1\% increase in }X\text{ is associated with a }\delta_1\%\text{ change in }Y.} \]

4.5.6 E6. Functional form as a modeling choice (concept check)

4.5.6.1 Question (short)

Give one reason to prefer logs (log–level or log–log) rather than levels.

4.5.6.2 Solution

Logs are often preferred when variation in \(Y\) is roughly proportional to its level (e.g., earnings), which can make relationships closer to linear in logs and can reduce heteroskedasticity. Logs also lead to percent-change interpretations that are often more meaningful than “dollar changes” across very different income levels.

4.5.7 Small R demo: same data, different scales

set.seed(123)

n <- 200
X <- rnorm(n, mean = 40, sd = 5)        # hours per week
u <- rnorm(n, mean = 0, sd = 50)
Y <- 200 + 15*X + u                     # dollars per week

fit_level <- lm(Y ~ X)

X10 <- X/10
fit_rescaleX <- lm(Y ~ X10)

Yk <- Y/1000
fit_rescaleY <- lm(Yk ~ X)

c(
  b1_level = coef(fit_level)[2],
  b1_rescaleX = coef(fit_rescaleX)[2],
  b1_rescaleY = coef(fit_rescaleY)[2]
)

##      b1_level.X b1_rescaleX.X10   b1_rescaleY.X 
##     14.70745558    147.07455583      0.01470746

You should see approximately: - b1_rescaleX ≈ 10 * b1_level - b1_rescaleY ≈ b1_level / 1000

3 Tutorial 2: Simple linear Regression

5 Tutorial 3b: Statistical Inference in Simple Regression