3 Tutorial 2: Simple linear Regression

3.1 Problem 1: Simple Linear Regression with Intercept

In Tutorial 1 we ran lm() and got a slope and an intercept. This tutorial asks: where do those numbers come from, and what do they optimize?

The goal of simple linear regression is to summarize the relationship between two variables — education and wages, match rates and 401(k) participation — with a straight line. That line cannot fit every data point perfectly. The question is: which line is best? OLS answers this by minimizing the sum of squared prediction errors. Squaring penalizes large mistakes more than small ones, produces a unique closed-form solution, and connects directly to variance decomposition and inference.

The error term \(\varepsilon_i\) (written \(u_i\) in Tutorial 3 onward, following Wooldridge — both denote the same object) captures everything that moves \(Y\) beyond \(X\): unobserved ability, luck, measurement error.


A picture of what we are doing. Each point is one observation. The blue line is a candidate fit. The red segments are the residuals — the vertical gaps between the data and the line. OLS chooses the line that makes those squared gaps as small as possible.

OLS minimizes the sum of squared vertical distances (red segments). The slope and intercept are chosen so no other line achieves a smaller total.

Figure 3.1: OLS minimizes the sum of squared vertical distances (red segments). The slope and intercept are chosen so no other line achieves a smaller total.


We study the simple linear regression with an intercept:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad i = 1, \ldots, n \]

with the regularity condition \(\sum_{i=1}^n (X_i - \bar{X})^2 > 0\) (i.e., not all \(X_i\) are equal).

Sample means:

\[ \bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i, \qquad \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \]

OLS objective: Minimize the sum of squared residuals:

\[ S(\beta_0, \beta_1) = \sum_{i=1}^n \left( Y_i - \beta_0 - \beta_1 X_i \right)^2 \]

Definitions:

  • Fitted values: \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\)
  • Residuals: \(\hat{u}_i = Y_i - \hat{Y}_i\)

3.1.1 Derive the OLS Normal Equations

Differentiate \(S(\beta_0, \beta_1)\) with respect to \(\beta_0\) and \(\beta_1\) and set equal to zero.

First-Order Condition for \(\beta_0\):

\[ \frac{\partial S}{\partial \beta_0} = -2 \sum_{i=1}^n \left( Y_i - \beta_0 - \beta_1 X_i \right) = 0 \]

This simplifies to:

\[ \sum_{i=1}^n \left( Y_i - \beta_0 - \beta_1 X_i \right) = 0 \]

Expanding:

\[ \sum_{i=1}^n Y_i = n \beta_0 + \beta_1 \sum_{i=1}^n X_i \]

Solving for \(\beta_0\):

\[ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \]

First-Order Condition for \(\beta_1\):

\[ \frac{\partial S}{\partial \beta_1} = -2 \sum_{i=1}^n X_i \left( Y_i - \beta_0 - \beta_1 X_i \right) = 0 \]

This simplifies to:

\[ \sum_{i=1}^n X_i \left( Y_i - \beta_0 - \beta_1 X_i \right) = 0 \]

Derive the Slope in Centered Form:

Substitute \(\beta_0 = \bar{Y} - \beta_1 \bar{X}\) into the residual expression:

\[ Y_i - \beta_0 - \beta_1 X_i = Y_i - (\bar{Y} - \beta_1 \bar{X}) - \beta_1 X_i = (Y_i - \bar{Y}) - \beta_1 (X_i - \bar{X}) \]

Minimizing \(\sum_{i=1}^n \left[ (Y_i - \bar{Y}) - \beta_1 (X_i - \bar{X}) \right]^2\) with respect to \(\beta_1\) gives:

\[ \sum_{i=1}^n (X_i - \bar{X}) \left[ (Y_i - \bar{Y}) - \beta_1 (X_i - \bar{X}) \right] = 0 \]

Rearranging:

\[ \sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y}) = \beta_1 \sum_{i=1}^n (X_i - \bar{X})^2 \]

OLS Estimators:

\[ \boxed{\hat{\beta}_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}} \]

\[ \boxed{\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}} \]

3.1.2 Residual Orthogonality Properties

These are the key properties that follow directly from the first-order conditions.

3.1.2.1 Residuals Sum to Zero

From the FOC for \(\beta_0\) evaluated at \((\hat{\beta}_0, \hat{\beta}_1)\):

\[ \sum_{i=1}^n \left( Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i \right) = 0 \]

Therefore:

\[ \sum_{i=1}^n \hat{u}_i = 0 \]

3.1.2.2 Residuals are Orthogonal to \(X\)

From the FOC for \(\beta_1\) evaluated at \((\hat{\beta}_0, \hat{\beta}_1)\):

\[ \sum_{i=1}^n X_i \left( Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i \right) = 0 \]

Therefore:

\[ \sum_{i=1}^n X_i \hat{u}_i = 0 \]

3.1.2.3 Residuals are Orthogonal to \((X_i - \bar{X})\)

Using the result \(\sum_{i=1}^n \hat{u}_i = 0\):

\[ \sum_{i=1}^n (X_i - \bar{X}) \hat{u}_i = \sum_{i=1}^n X_i \hat{u}_i - \bar{X} \sum_{i=1}^n \hat{u}_i = 0 - \bar{X} \cdot 0 = 0 \]

Therefore:

\[ \sum_{i=1}^n (X_i - \bar{X}) \hat{u}_i = 0 \]

3.1.2.4 Residuals are Orthogonal to fitted values

3.1.2.5 Question

Show that:

\[ \sum_{i=1}^n \hat Y_i \hat u_i = 0 \]

3.1.2.6 Solution

Write fitted values as \(\hat Y_i=\hat\beta_0+\hat\beta_1X_i\) and expand:

\[ \sum_{i=1}^n \hat Y_i\hat u_i = \sum_{i=1}^n (\hat\beta_0+\hat\beta_1X_i)\hat u_i = \hat\beta_0\sum_{i=1}^n \hat u_i + \hat\beta_1\sum_{i=1}^n X_i\hat u_i. \]

By A1, both sums are zero. Therefore:

\[ \sum_{i=1}^n \hat Y_i \hat u_i = 0 \]

Interpretation. The part OLS explains (\(\hat Y\)) and the part it cannot explain (\(\hat u\)) do not move together in the sample.



3.1.3 Properties of Fitted Values

3.1.3.1 The Regression Line Passes Through \((\bar{X}, \bar{Y})\)

Taking the average of fitted values:

\[ \overline{\hat{Y}} = \frac{1}{n} \sum_{i=1}^n \hat{Y}_i = \frac{1}{n} \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1 X_i) = \hat{\beta}_0 + \hat{\beta}_1 \bar{X} \]

Since \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\):

\[ \overline{\hat{Y}} = (\bar{Y} - \hat{\beta}_1 \bar{X}) + \hat{\beta}_1 \bar{X} = \bar{Y} \]

Equivalently, at \(X = \bar{X}\):

\[ \hat{Y}(\bar{X}) = \hat{\beta}_0 + \hat{\beta}_1 \bar{X} = \bar{Y} \]

Conclusion: The OLS fitted line passes through the point \((\bar{X}, \bar{Y})\).

3.1.3.2 Mean Residual is Zero

Since \(\sum_{i=1}^n \hat{u}_i = 0\):

\[ \bar{\hat{u}} = \frac{1}{n} \sum_{i=1}^n \hat{u}_i = 0 \]


3.1.4 Decomposition of Variation (TSS = ESS + RSS)

Definitions:

Quantity Name Formula
TSS Total Sum of Squares \(\sum_{i=1}^n (Y_i - \bar{Y})^2\)
ESS Explained Sum of Squares \(\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2\)
RSS Residual Sum of Squares \(\sum_{i=1}^n \hat{u}_i^2\)

3.1.4.1 Decomposition

Start from:

\[ Y_i - \bar{Y} = (\hat{Y}_i - \bar{Y}) + (Y_i - \hat{Y}_i) = (\hat{Y}_i - \bar{Y}) + \hat{u}_i \]

Square both sides and sum:

\[ \sum_{i=1}^n (Y_i - \bar{Y})^2 = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2 + \sum_{i=1}^n \hat{u}_i^2 + 2 \sum_{i=1}^n (\hat{Y}_i - \bar{Y}) \hat{u}_i \]

3.1.4.2 The Cross-Term Vanishes

Note that:

\[ \hat{Y}_i - \bar{Y} = (\hat{\beta}_0 + \hat{\beta}_1 X_i) - \bar{Y} = (\bar{Y} - \hat{\beta}_1 \bar{X} + \hat{\beta}_1 X_i) - \bar{Y} = \hat{\beta}_1 (X_i - \bar{X}) \]

Therefore:

\[ \sum_{i=1}^n (\hat{Y}_i - \bar{Y}) \hat{u}_i = \hat{\beta}_1 \sum_{i=1}^n (X_i - \bar{X}) \hat{u}_i = \hat{\beta}_1 \cdot 0 = 0 \]

3.1.4.3 Result

\[ \boxed{\text{TSS} = \text{ESS} + \text{RSS}} \]


3.1.5 Sample Covariance Implications

3.1.5.1 Sample Covariance Between \(X\) and \(\hat{u}\) is Zero

Since \(\sum_{i=1}^n \hat{u}_i = 0\), the sample covariance between \(X\) and \(\hat{u}\) is proportional to:

\[ \sum_{i=1}^n (X_i - \bar{X}) \hat{u}_i = 0 \]

Therefore:

\[ \widehat{\text{Cov}}(X, \hat{u}) = 0 \]

3.1.6 Fitted Values are Orthogonal to Residuals

\[ \sum_{i=1}^n \hat{Y}_i \hat{u}_i = \sum_{i=1}^n (\hat{Y}_i - \bar{Y}) \hat{u}_i + \bar{Y} \sum_{i=1}^n \hat{u}_i = 0 + \bar{Y} \cdot 0 = 0 \]

Therefore:

\[ \sum_{i=1}^n \hat{Y}_i \hat{u}_i = 0 \]


3.1.7 Summary of Key Results

3.1.7.1 OLS Estimators

Estimator Formula
Slope \(\hat{\beta}_1 = \dfrac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}\)
Intercept \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\)

3.1.7.2 Residual Properties

Property Result
Residuals sum to zero \(\sum_{i=1}^n \hat{u}_i = 0\)
Residuals orthogonal to \(X\) \(\sum_{i=1}^n X_i \hat{u}_i = 0\)
Residuals orthogonal to centered \(X\) \(\sum_{i=1}^n (X_i - \bar{X}) \hat{u}_i = 0\)
Mean residual is zero \(\bar{\hat{u}} = 0\)

3.1.7.3 Additional Properties

Property Result
Regression line passes through \((\bar{X}, \bar{Y})\)
Variance decomposition TSS = ESS + RSS
Sample covariance \(\widehat{\text{Cov}}(X, \hat{u}) = 0\)
Fitted values orthogonal to residuals \(\sum_{i=1}^n \hat{Y}_i \hat{u}_i = 0\)

3.2 Problem 2: Wooldridge Computer Exercise C1 (401K): Participation and Match Rate

3.2.1 Overview

This notebook reproduces Wooldridge, Computer Exercise C1 (401K). Goal. Use plan-level data to study whether a more generous employer match rate is associated with higher 401(k) participation.

  • Outcome: prate = percentage of eligible workers with an active 401(k) account.
  • Regressor: mrate = match rate (average firm contribution per $1 worker contribution).

We estimate the simple linear regression: \[ prate = \beta_0 + \beta_1 \, mrate + u. \]

3.2.2 Load packages and data

# If you do not have the package installed, uncomment the next line:
# install.packages("wooldridge")
library(wooldridge)

# Load the dataset used in this exercise.
data("k401k")
df <- k401k

# Quick check: dimensions and variable names
dim(df)
## [1] 1534    8
names(df)
## [1] "prate"   "mrate"   "totpart" "totelg"  "age"     "totemp"  "sole"   
## [8] "ltotemp"

3.2.3 (i) Compute the sample averages of participation and match rates

# In this dataset:
# - prate is the plan participation rate (in percentage points)
# - mrate is the match rate

mean_prate <- mean(df$prate, na.rm = TRUE)
mean_mrate <- mean(df$mrate, na.rm = TRUE)

mean_prate
## [1] 87.36291
mean_mrate
## [1] 0.7315124

Interpretation. - mean_prate is the average percentage of eligible workers participating across plans. - mean_mrate is the average employer match rate across plans.

3.2.4 (ii) Estimate the simple regression: prate on mrate

# Estimate the simple OLS regression model
m1 <- lm(prate ~ mrate, data = df)

# Summary includes coefficient estimates, standard errors, t-stats, and R^2
s1 <- summary(m1)
s1
## 
## Call:
## lm(formula = prate ~ mrate, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.303  -8.184   5.178  12.712  16.807 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.0755     0.5633  147.48   <2e-16 ***
## mrate         5.8611     0.5270   11.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.09 on 1532 degrees of freedom
## Multiple R-squared:  0.0747, Adjusted R-squared:  0.0741 
## F-statistic: 123.7 on 1 and 1532 DF,  p-value: < 2.2e-16
# Report the sample size used by the regression (after dropping any missing values)
n <- nobs(m1)

# Extract R-squared (fraction of sample variation in prate explained by mrate)
r2 <- s1$r.squared

n
## [1] 1534
r2
## [1] 0.0747031

3.2.5 (iii) Interpret the intercept and the slope

# Extract coefficients
b0 <- coef(m1)[1]  # intercept
b1 <- coef(m1)[2]  # slope on mrate

b0
## (Intercept) 
##    83.07546
b1
##    mrate 
## 5.861079

How to read these coefficients (economic meaning).

  • Intercept (\(\hat\beta_0\)): Predicted participation rate when mrate = 0.
    • This is the fitted prate for a plan with no employer match.
    • Note: Interpretation is most meaningful if mrate = 0 is in (or near) the support of the data.
  • Slope (\(\hat\beta_1\)): Predicted change in participation (in percentage points) for a one-unit increase in mrate.
    • Since mrate measures how many dollars the firm contributes per $1 the worker contributes, a one-unit change is economically large (e.g., from 0.5 to 1.5).
    • Practically, you may also interpret smaller changes: a 0.10 increase in mrate changes predicted participation by \(0.10 \times \hat\beta_1\).
# Example: predicted change in prate for a 0.10 increase in mrate
delta_mrate <- 0.10
pred_change_prate <- delta_mrate * b1
pred_change_prate
##     mrate 
## 0.5861079

3.2.6 (iv) Predicted participation when mrate = 3.5. Is it reasonable?

# Prediction at mrate = 3.5
pred_35 <- predict(m1, newdata = data.frame(mrate = 3.5))
pred_35
##        1 
## 103.5892

To assess whether this prediction is reasonable, check whether mrate = 3.5 lies within the observed range of the data.
Predictions far outside the support of mrate are extrapolations, which can be unreliable.

# Range of mrate in the sample
range_mrate <- range(df$mrate, na.rm = TRUE)
range_mrate
## [1] 0.01 4.91
# Also helpful: a few quantiles to understand typical values
quantile(df$mrate, probs = c(0, .05, .25, .5, .75, .95, 1), na.rm = TRUE)
##     0%     5%    25%    50%    75%    95%   100% 
## 0.0100 0.1100 0.3000 0.4600 0.8300 2.3635 4.9100

Discussion prompt (what is happening if it looks unreasonable): - If 3.5 is much larger than typical match rates in the data, then the fitted value is based on extending a linear trend beyond where you have information. - In addition, prate is a percentage and should generally lie between 0 and 100; a linear model can produce fitted values outside this range, especially under extrapolation.

3.2.7 (v) How much of the variation in prate is explained by mrate?

The share of variation in prate explained by mrate in this simple model is the R-squared:

r2
## [1] 0.0747031

Interpretation. - \(R^2\) is the fraction of sample variation in participation that is accounted for by variation in the match rate. - Whether it is “a lot” depends on context; in cross-sectional data, modest \(R^2\) values are common.

3.2.8 Optional: quick plot

# A simple scatter plot with the fitted regression line.
plot(df$mrate, df$prate,
     xlab = "Match rate (mrate)",
     ylab = "Participation rate (prate)",
     main = "401(k) Participation vs Match Rate")
abline(m1, lwd = 2)