Module 10 · Quantitative Methods

Simple Linear Regression

EN: OLS regression of one dependent variable on one independent variable — coefficients, R², F-test, hypothesis tests, and confidence intervals.
VN: Hồi quy tuyến tính đơn — hệ số, R², kiểm định F, t, khoảng tin cậy.

In this module
  1. Regression Equation
  2. OLS Slope & Intercept
  3. Sum-of-Squares Decomposition
  4. Coefficient of Determination (R²)
  5. Standard Error of the Estimate
  6. F-test (Overall Significance)
  7. Hypothesis Test on Slope
  8. Confidence Interval for Slope
  9. Prediction Interval
  10. Regression Assumptions

1. Simple Linear Regression Equation Core

About: Models Y = b0 + b1·X + ε. Estimates how much Y changes when X changes by one unit. Workhorse of factor models, beta estimation, and forecasting.Tóm tắt: Mô hình Y = b0 + b1·X + ε. Ước lượng Y thay đổi bao nhiêu khi X tăng 1 đơn vị. Công cụ cốt lõi cho factor model, beta, dự báo.
\[ Y_i = b_0 + b_1\,X_i + \varepsilon_i \] \[ \hat{Y}_i = \hat{b}_0 + \hat{b}_1\,X_i \]

Components / Thành phần

  • \(Y_i\) Dependent variable / biến phụ thuộc.
  • \(X_i\) Independent variable / biến độc lập.
  • \(b_0,\,b_1\) Population intercept and slope.
  • \(\hat{b}_0,\,\hat{b}_1\) OLS sample estimates.
  • \(\varepsilon_i\) Error term (residual = \(Y_i - \hat{Y}_i\)).
Practice problem

Fitted equation: \(\hat{Y} = 2 + 0.8 X\). Predict Y when X = 10. Interpret slope.

Show solution
\(\hat{Y} = 2 + 0.8(10) = 10\)
Slope 0.8 = a 1-unit increase in X is associated with a 0.8-unit increase in Y.
Predicted Y = 10; slope = 0.8

2. OLS Slope & Intercept Estimators Core

About: OLS picks the line that minimizes the sum of squared residuals. The fitted line always passes through the means \((\bar{X}, \bar{Y})\).Tóm tắt: OLS chọn đường tối thiểu tổng bình phương phần dư. Đường khớp luôn đi qua (\(\bar{X}, \bar{Y}\)).
\[ \hat{b}_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^{2}} \] \[ \hat{b}_0 = \bar{Y} - \hat{b}_1\,\bar{X} \]

OLS minimizes the sum of squared residuals \(\sum \hat{\varepsilon}_i^{2}\). The fitted line passes through \((\bar{X}, \bar{Y})\).

Practice problem

Cov(X,Y) = 12, Var(X) = 8, \(\bar{X}\) = 5, \(\bar{Y}\) = 20. Compute OLS slope and intercept.

Show solution
\(\hat{b}_1 = 12/8 = 1.5\)
\(\hat{b}_0 = 20 - 1.5(5) = 12.5\)
Slope = 1.5; Intercept = 12.5

3. Sum-of-Squares Decomposition Core

About: SST = total variation in Y, decomposed into SSR (explained by X) + SSE (unexplained / residual). Key identity for R² and ANOVA.Tóm tắt: SST = tổng biến thiên Y, phân rã thành SSR (giải thích) + SSE (phần dư). Đẳng thức cốt lõi cho R² và ANOVA.
\[ \underbrace{\sum (Y_i - \bar{Y})^{2}}_{SST} = \underbrace{\sum (\hat{Y}_i - \bar{Y})^{2}}_{SSR} + \underbrace{\sum (Y_i - \hat{Y}_i)^{2}}_{SSE} \]

Components / Thành phần

  • SST Total sum of squares (total variation in Y).
  • SSR Regression sum of squares (variation explained by the model).
  • SSE Sum of squared errors / residual SS (unexplained).
Practice problem

SST = 200, SSE = 60. Compute SSR and R².

Show solution
SSR = SST − SSE = 140
R² = SSR/SST = 140/200
SSR = 140; R² = 0.70

4. Coefficient of Determination (R²) Core

About: R² = % of Y's variance explained by the regression. In simple regression R² = r² (correlation squared). 0 ≤ R² ≤ 1; higher = tighter fit.Tóm tắt: R² = % phương sai Y được giải thích bởi hồi quy. Trong simple regression R² = r². Cao hơn = khớp tốt hơn.
\[ R^{2} = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} \]

In simple regression, \(R^{2} = r^{2}\) (square of the Pearson correlation between X and Y).

Practice problem

Pearson correlation r = 0.6 in a simple regression. What is R²?

Show solution
In simple regression R² = r²
= 0.36
R² = 36%

5. Standard Error of the Estimate (SEE) Core

About: SEE = typical size of a residual (in units of Y). Smaller SEE → tighter fit. Used to build prediction intervals.Tóm tắt: SEE = độ lớn điển hình của phần dư (cùng đơn vị Y). SEE nhỏ = khớp tốt. Dùng dựng prediction interval.
\[ SEE = \sqrt{\frac{SSE}{n - 2}} = \sqrt{MSE} \]

Smaller SEE → tighter fit. Units of SEE = units of Y.

Practice problem

SSE = 80, n = 22. Compute SEE.

Show solution
\(SEE = \sqrt{80/(22-2)} = \sqrt{4}\)
SEE = 2.0

6. F-test for Overall Significance Core

About: Tests joint significance of all slopes (b1 = 0 vs ≠ 0). In simple regression F = t² of the slope test — same conclusion. Becomes essential in multiple regression.Tóm tắt: Kiểm định ý nghĩa chung của tất cả slope. Trong simple regression F = t². Quan trọng hơn ở multiple regression.
\[ F = \frac{MSR}{MSE} = \frac{SSR/k}{SSE/(n - k - 1)} \]

Components / Thành phần

  • k Number of independent variables (= 1 in simple regression).
  • df Numerator = k, denominator = n − k − 1 = n − 2.

Tests H₀: b₁ = 0 vs Hₐ: b₁ ≠ 0. In simple regression, F = t² of the slope test.

Practice problem

SSR = 120, SSE = 60, n = 32. Compute F.

Show solution
MSR = 120/1 = 120
MSE = 60/(32−2) = 2
F = MSR/MSE = 120/2
F = 60 (df = 1, 30)

7. Hypothesis Test on Slope Core

About: t-test on whether b1 differs from a hypothesized value (usually 0). Most common: 'is there a relationship?' — reject H0: b1 = 0 if t-stat exceeds critical value.Tóm tắt: T-test b1 có khác giá trị giả thuyết không (thường là 0). Câu hỏi phổ biến: có quan hệ không? Bác bỏ H0 nếu |t| lớn.
\[ t = \frac{\hat{b}_1 - b_{1,\,\text{H}_0}}{s_{\hat{b}_1}}, \quad df = n - 2 \]
Practice problem

An OLS regression returns \(\hat{b}_1\) = 0.85 with standard error 0.20, n = 32. Test H₀: b₁ = 0 at α = 5% (two-tailed).

Show solution
t = 0.85/0.20 = 4.25, df = 30
Critical t (α/2 = 0.025, df = 30) ≈ ±2.042
|4.25| > 2.042 → Reject H₀. Slope is highly significant.

8. Confidence Interval for Slope Core

About: Range of plausible values for the population slope b1. If 0 is inside the CI, you cannot reject 'no relationship' at that confidence level.Tóm tắt: Khoảng giá trị hợp lý cho slope tổng thể. Nếu 0 nằm trong CI, không bác bỏ được 'không quan hệ'.
\[ \hat{b}_1 \pm t_{\alpha/2,\,n - 2} \cdot s_{\hat{b}_1} \]
Practice problem

\(\hat{b}_1 = 0.85\), SE(\(\hat{b}_1\)) = 0.20, n = 32, t(α/2, df=30) = 2.042. Build the 95% CI.

Show solution
CI = 0.85 ± 2.042(0.20)
95% CI ≈ [0.442, 1.258]

9. Prediction Interval for Y given X Core

About: Range for a NEW observation Y_f at a given X. Wider than the confidence interval for the conditional mean — includes both line uncertainty AND new error variance.Tóm tắt: Khoảng cho một quan sát Y_f mới tại X cho trước. Rộng hơn CI vì thêm bất định của error mới.
\[ \hat{Y}_f \pm t_{\alpha/2,\,n - 2} \cdot s_f \]

Prediction SE \(s_f\) is wider than SEE because it includes uncertainty in both the regression line and the new error.

Practice problem

Predicted Y = 50, prediction SE = 4, t(α/2, df=28) = 2.048. Build the 95% prediction interval.

Show solution
PI = 50 ± 2.048(4)
95% PI ≈ [41.81, 58.19]

10. OLS Regression Assumptions Concept

About: Six classical assumptions (linearity, independence, homoskedasticity, normality, X non-stochastic, no perfect collinearity). Violations bias coefficients or standard errors — diagnostic plots and tests catch them.Tóm tắt: 6 giả định OLS cổ điển. Vi phạm làm lệch hệ số hoặc SE — biểu đồ chẩn đoán và test phát hiện.

Six classical assumptions

  • 1. Linearity — true relationship is linear in parameters.
  • 2. Independence — errors are independent (no autocorrelation).
  • 3. Homoskedasticity — errors have constant variance.
  • 4. Normality — errors are normally distributed.
  • 5. X is non-stochastic (or at least uncorrelated with the error).
  • 6. No perfect collinearity (trivial in simple regression).