Geometrically interpreting the OLS estimator as a projection operator

2025-04-02

The goal of this writeup: I dimly remember my econometrics professor talking about the OLS estimator as an “projection” operator with geometric meaning. I also dimly remember taking a linear algebra class (i.e. vector spaces) in undergrad. So that I finally remember the geometric interpretation of the OLS estimator…

First, I describe the OLS estimator as a “projection operator”, given a fixed set of observations \(X\), from possible \(Y\) to a point on the subspace of “possible \(\hat\beta\)” embedded in \(Y\)-space.

Of course, we also need to motivate our estimator of the standard error \(\sigma^2\) for the per-case residual, and in particular the normalization constant (the famous \(1/(n - p)\) factor. I motivate our estimator of \(\sigma^2\) as a geometric representation of the uncertainty around the line projection outputs \(Y\) onto \(\hat \beta\)-space. Finally, I compute the OLS estimates of the standard errors around each coefficient.

Caveat: I don’t have the chops to do a lot of the derivations myself. At the end, in an appendix, I attach the ChatGPT o1 prompts that helped me write this up.

The usual linear model setup

We must begin by fixing the size of our sample, \(n\). Only then can we set everything up, from linear model to estimators. So let’s say we sample \(n\) cases of \(p\) covariates each.

Of course, we will take 1 as our intercept coefficient, so we ultimately have \(p + 1\) predictors/coefficients The standard equation for our linear model in matrix form \(Y = X \beta + \epsilon\), where:

Recall the standard distributional assumptions:

In this setup, the objective of the game is to estimate the parameters \(\beta\) (a matrix), \(\sigma^2\) (a scalar).

Deriving the geometric interpretation of the OLS estimator

Get ready for some vector spaces.

Recall that OLS gives us a closed-form formula for an optimal estimate of \(\beta\), namely \(\hat \beta = (X^T X)^{-1} X^T Y\). We take the factor \(P_X = (X^T X)^{-1} X^T\) as an operator that, intuitively, is

\[\begin{align} P_X : (\mathbb R^n \equiv \textrm{all possible $n$-sized samples of Y}) & \longrightarrow (\textrm{fixing $X$, the space of all possible OLS estimates for $\beta$ for all possible $Y$}) \\ \textrm{Our sample of $Y$} & \longmapsto \textrm{OLS estimates of $\beta $associated with $X$ given $Y$.} \\ \end{align}\]

(I’m sure) this operator can be fully characterized by the “least squares” criterion – i.e. that minimizes the residuals (squared differences) between \(Y\) and \(\hat \beta X\).

Of course, I don’t think about it in those terms. Here is the way I geometrically think about it in terms of geometric projections.

But wait, what about \(\sigma^2\)?

Of course, the OLS estimator only estimates the coefficient vector \(\beta\). We now need to estimate the parameter \(\sigma^2\), the variance of each iid normally distributed per-case residual \(\epsilon_1, \dots, \epsilon_n\). Well, assume that we have already derived the OLS estimators of our coefficients. The associated residuals are given by the formula: \(\hat \sigma^2 = {1 \over n - (p + 1)} \textrm{(sum of squares of all observed residuals)}\) \(= {1 \over n - (p + 1)} ||Y - X \hat \beta||^{2}\). If you are reading this section out of context, note that \(p\) has been defined here as the number of non-intercept parameters. The usual denominator is \({1 \over n - p}\).

Now here is more about residual vectors.

As a fun aside, the usual justification used to motivate introduction of the normalizing term \({1 \over n - (p + 1)}\) in \(\hat \sigma^2 = {1 \over n - (p + 1)} ||Y - X\hat \beta||^2\) is to argue that it makes the estimator \(\hat \sigma^2\) unbiased. I asked ChatGPT to show why not including the normalizing term makes the unbiasedness proof fail.

ChatGPT prompt: Give a proof of why the estimator associated with \(\sigma^2\) is unbiased. Perhaps do so in a way that justifies the “normalizing” term 1/(n - p) by attempting to do show unbiasedness without that term, observe how the reasoning fails, and then note how introducing 1/(n - p) makes the unbiasedness proof go through. Does this lead to an intuitive interpretation of the role of 1/(n - p) in unbiasedness?

Below is a step-by-step explanation of why the usual OLS estimator for \(\sigma^2\),

\[ \hat{\sigma}^2 \;=\; \frac{1}{n-p}\,\|\,y - X\hat{\beta}\|^2, \]

is unbiased, and why the factor \(\tfrac{1}{n-p}\) (rather than \(\tfrac{1}{n}\)) is needed.

We’ll start by trying \(\tfrac{1}{n}\) in the denominator and seeing what goes wrong, then show how \(\tfrac{1}{n-p}\) fixes the issue.


1. The Usual Setup

We have the linear model:

\[ y \;=\; X\,\beta \;+\; \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,\;\sigma^2 I_n), \]

where

The OLS estimator of \(\beta\) is

\[ \hat{\beta} \;=\; (X^\mathsf{T}X)^{-1}\,X^\mathsf{T}\,y. \]

Then the residual vector is

\[ \hat{\varepsilon} \;=\; y \;-\; X\hat{\beta}. \]

We want an unbiased estimator for \(\sigma^2\).


2. Trying (Naively) to Use \(\tfrac{1}{n}\,\|\hat{\varepsilon}\|^2\)

A “natural” first guess might be

\[ \hat{\sigma}^2_{\text{naive}} \;=\; \frac{1}{n}\,\|\hat{\varepsilon}\|^2 \;=\; \frac{1}{n}\,\sum_{i=1}^n \hat{\varepsilon}_i^2. \]

Let’s see if that is unbiased. We check

\[ \mathbb{E}\bigl[\hat{\sigma}^2_{\text{naive}}\bigr] \;=\; \frac{1}{n}\, \mathbb{E}\bigl[\|\hat{\varepsilon}\|^2\bigr]. \]

2.1 Compute \(\|\hat{\varepsilon}\|^2\) in Expectation

We know \(\hat{\varepsilon} = M\,\varepsilon\), where \(M = I_n - P\) and \(P = X(X^\mathsf{T}X)^{-1}X^\mathsf{T}\) is the hat matrix. Indeed,

\[ \hat{\varepsilon} \;=\; y - X\hat{\beta} \;=\; (X\beta + \varepsilon) - X\bigl((X^\mathsf{T}X)^{-1}X^\mathsf{T}y\bigr) \;=\; \varepsilon - P\,\varepsilon \;=\; (I - P)\,\varepsilon \;=\; M\,\varepsilon. \]

One can show \(M\) is idempotent and symmetric, i.e. \(M = M^2 = M^\mathsf{T}\). Then

\[ \|\hat{\varepsilon}\|^2 \;=\; \varepsilon^\mathsf{T}\,M^\mathsf{T}\,M\,\varepsilon \;=\; \varepsilon^\mathsf{T}\,M\,\varepsilon. \]

Taking expectation:

\[ \mathbb{E}\bigl[\|\hat{\varepsilon}\|^2\bigr] \;=\; \mathbb{E}\bigl[\varepsilon^\mathsf{T} M \varepsilon\bigr] \;=\; \mathbb{E}\bigl[\mathrm{trace}(\varepsilon^\mathsf{T} M \varepsilon)\bigr] \;=\; \mathbb{E}\bigl[\mathrm{trace}(M\,\varepsilon\,\varepsilon^\mathsf{T})\bigr]. \]

Since \(\mathrm{E}[\varepsilon \varepsilon^\mathsf{T}] = \sigma^2 I_n\), we get

\[ \mathbb{E}\bigl[\|\hat{\varepsilon}\|^2\bigr] \;=\; \mathrm{trace}\bigl(M\,\sigma^2 I_n\bigr) \;=\; \sigma^2\,\mathrm{trace}(M). \]

But \(\mathrm{trace}(I_n - P) = \mathrm{trace}(I_n) - \mathrm{trace}(P) = n - p\). Indeed, \(\mathrm{trace}(P)\) = \(\mathrm{rank}(P)\) = \(p\). Hence,

\[ \mathbb{E}\bigl[\|\hat{\varepsilon}\|^2\bigr] \;=\; \sigma^2 \,(n - p). \]

2.2 Expectation of the Naive Estimator

\[ \mathbb{E}\bigl[\hat{\sigma}^2_{\text{naive}}\bigr] \;=\; \frac{1}{n}\,\mathbb{E}\bigl[\|\hat{\varepsilon}\|^2\bigr] \;=\; \frac{1}{n}\,\sigma^2\,(n - p) \;=\; \sigma^2\;\frac{n-p}{n}. \]

This is less than \(\sigma^2\). Therefore, \(\hat{\sigma}^2_{\text{naive}}\) underestimates \(\sigma^2\) on average.


3. Fixing the Bias: Use \(\tfrac{1}{n-p}\,\|\hat{\varepsilon}\|^2\)

Now let’s define

\[ \hat{\sigma}^2 \;=\; \frac{1}{\,n-p\,}\;\|\hat{\varepsilon}\|^2. \]

Then

\[ \mathbb{E}\bigl[\hat{\sigma}^2\bigr] \;=\; \frac{1}{\,n-p\,}\,\mathbb{E}\bigl[\|\hat{\varepsilon}\|^2\bigr] \;=\; \frac{1}{\,n-p\,}\;\sigma^2\,(n - p) \;=\; \sigma^2. \]

So this estimator is unbiased.


3.1 Intuitive Interpretation of \(n - p\)

Geometric View

  1. The column space of \(X\) is \(p\)-dimensional.
  2. You “project” \(y\) into that space to get \(\hat{y} = X\hat{\beta}\).
  3. The orthogonal complement (where the residuals live) has dimension \(n - p\).
  4. Summing squared residuals over that \((n-p)\)-dimensional subspace, you then divide by \(n-p\) to estimate the variance.

4. Conclusion

  1. Without the factor \(\tfrac{1}{n-p}\), the estimator is biased.
  2. Including \(\tfrac{1}{n-p}\) makes \(\hat{\sigma}^2\) exactly unbiased (under the usual linear model assumptions).
  3. Interpretation: The model “uses up” \(p\) degrees of freedom to fit \(\beta\), so only \(n - p\) degrees of freedom remain to estimate the leftover noise variance. That is why we divide the residual sum of squares by \(n-p\) rather than by \(n\).

Skippable: some more random thoughts on projection and residual space.

Finally: but wait, where do the standard errors come from?

The standard errors for each term in fact fall out of the estimate of a more general gadget, \(Cov(\hat \beta)\). The rough idea is that, since \(Cov(\hat \beta)\) are population variances and covariances of our estimates, the \(\widehat{Cov(\hat \beta)}\) provides sampling variances and covariances of or estimates, and the square roots of the diagonals of \(\widehat{Cov(\hat \beta)}\) are sampling standard deviations of our covariances – that is, standard errors.

But what is \(\widehat{Cov(\hat \beta)}\)? Roughly, we will compute \(Cov(\hat \beta)\) and “take its estimate”, like so.

Note that this sampling covariance matrix is a random matrix, since the OLS estimator \(\hat \beta\) for fixed \(X\) and \(Y\) is defined in terms of \(\epsilon\):

\[\begin{align} \hat \beta & = (X^T X)^{-1} X^T Y \\ & = (X^T X)^{-1} X^T (X\beta + \epsilon) \\ & = (X^T X)^{-1} X^T X\beta + (X^T X)^{-1} X^T \epsilon \\ & = \beta + (X^T X)^{-1} X^T \epsilon. \end{align}\]

Therefore, To take the covariance matrix of \(\hat\beta\), Recall that we can define the “cross-covariance” matrix of two random vectors \(U \in \mathbb R^j, V\in \mathbb R_k\) as the matrix of element pairwise covariances:

\[Cov(X, Y) = \begin{bmatrix} \mathrm{Cov}(X_1, Y_1) & \mathrm{Cov}(X_1, Y_2) & \cdots & \mathrm{Cov}(X_1, Y_k) \\ \mathrm{Cov}(X_2, Y_1) & \mathrm{Cov}(X_2, Y_2) & \cdots & \mathrm{Cov}(X_2, Y_k) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(X_j, Y_1) & \mathrm{Cov}(X_j, Y_2) & \cdots & \mathrm{Cov}(X_j, Y_k) \end{bmatrix}\]

Then the covariance matrix of \(\hat\beta\) is:

\[\begin{align} Cov(\hat \beta) &= Cov(\beta + (X^T X)^{-1} X^T \epsilon) \\ &= Cov(\beta) + Cov((X^T X)^{-1} X^T \epsilon) + Cov(\beta, (X^T X)^{-1} X^T \epsilon) + Cov((X^T X)^{-1} X^T \epsilon, \beta) \\ &= Cov((X^T X)^{-1} X^T \epsilon) \quad \textrm{since beta is a constant} \\ &= ((X^T X)^{-1} X^T)Cov(\epsilon)((X^T X)^{-1} X^T)^T \end{align}.\]

“Passing to the estimate”, we get \(\widehat{Cov(\hat \beta)} = ((X^T X)^{-1} X^T)\widehat{Cov(\epsilon)}((X^T X)^{-1} X^T)^T\), where \(\widehat{Cov(\epsilon)} = \hat \sigma^2 I.\) Again, to read off \(SE(\hat\beta_i)\), we simply look at \(\widehat{Cov(\hat \beta)}_{i,i} = \hat \sigma^2 (X^T X)^{-1}\).

Appendix: ChatGPT prompts

– Can we express the OLS estimator in a more intuitive way – perhaps geometrically as a projection matrix, for example?

– In the context of multivariate linear models, the OLS estimator is an estimator for the coefficients associated with predictors. However, another parameter we must estimate is the variance of the per-case residuals (following the homoskedasticity assumption.) What is the estimator of the residual variance associated with the OLS estimator for coefficients?

– Is there an intuitive interpretation of the definition, or terms, of this estimator?

– Can a plane embedded in R^3 be described in terms of a matrix?

– Can the ambient n-dimensional subspace be realized as a “product” of some form of the embedded p-dimension subspace of R^n given by the matrix X (dimension n x p), and its orthogonal complement?

– What do you mean by “design matrix” in this setup?

– How is the residual variance estimated in the above setup?

– In the standard OLS context, what is the relationship between of one of parameters to estimate, the variance of the error (i.e. residual term), and the covariance matrix of the OLS estimates?

– Why is the theoretical covariance of the vector h given like so?

– Recall that, taking to be the error vector, = Y - X. What is the covariance matrix for ?

– More generally, given that J is a (a x 1) matrix, K is an (a x b) matrix, and is (b x 1) matrix, observe that J - Kis an (a x 1) random vector. What is the covariance matrix of the random vector J - K?

– Give a proof of why the estimator associated with \(\sigma^2\) is unbiased. Perhaps do so in a way that justifies the “normalizing” term 1/(n - p) by attempting to do show unbiasedness without that term, observe how the reasoning fails, and then note how introducing 1/(n - p) makes the unbiasedness proof go through. Does this lead to an intuitive interpretation of the role of 1/(n - p) in unbiasedness?

– Here is an Markdown file that gives an exposition of geometric intuition for the OLS estimator as a projection matrix, the estimator of the standard error of the residual term (and in particular why it is a normalized sum of squares of residuals), and an account for how the standard errors for particular coefficients in the OLS estimator arise. It summarizes a few of the conversation and responses listed above. Provide your evaluation of the factuality of mathematical details mentioned, and mathematical reasoning, taking into account the fact that most of the arguments provided are in fact heuristic and geometric and do not necessarily lead to fact-checking.

– Outline the document, and the goal of each section, in one prose paragraph.