Just to recall some basic stuff: Suppose $Y_i = alpha + beta x_i + varepsilon_i$ for $i=1,dots,n$ and $varepsilon_i sim N(0,sigma^2)$ and these are independent. We will observe the $x$'s and $y$'s but not $alpha$, $beta$, or $varepsilon_i$. The $Y$'s are random only because the $varepsilon$'s are random. These are weaker assumptions than that we have $(x,y)$ pairs that are jointly normally distributed.
Let
$$
X = begin{bmatrix}
1 & x_1 \ vdots & vdots \ 1 & x_n
end{bmatrix}
$$
be the "design matrix" (so called because if the experimenter can choose the $x$ values, then this is how the experiment is designed).
Then the least-squares estimates of $alpha$ and $beta$ are given by
$$
begin{bmatrix} hatalpha \ hatbeta end{bmatrix} = (X^T X)^{-1}X^T Y
$$
and therefore the probability distribution of the least-squares estimators is given by
$$
begin{bmatrix} hatalpha \ hatbeta end{bmatrix} sim Nleft( begin{bmatrix} alpha \ beta end{bmatrix}, sigma^2 (X^T X)^{-1} right)
$$
(the variance is thus of course a $2times 2$ positive-definite matrix). The predicted $y$-value for a given $x$ value is therefore $hatalpha + hatbeta x$, and this therefore has a probability distribution given by
$$
hat y = hatalpha + hatbeta x = begin{bmatrix}1, & xend{bmatrix} begin{bmatrix} hatalpha \ hatbeta end{bmatrix} sim Nleft( begin{bmatrix}1, & xend{bmatrix} begin{bmatrix} alpha \ beta end{bmatrix}, sigma^2 begin{bmatrix} 1, & x end{bmatrix} (X^T X)^{-1} begin{bmatrix} 1 \ x end{bmatrix} right)
$$
$$
= Nleft( alpha + beta x, frac{sigma^2}{n}frac{sum_i (x_i - x)^2}{sum_i (x_i - overline{x})^2} right).
$$
(OK, check my algebra here; it's trivial but laborious.)
There's a simple geometric intuition behind the dependence of the variance on $x$ and in particular the fact that the variance is smallest when $x=overline{x}$, so think about that too.
Now $sigma^2$ must be estimated based on the data. The errors $y_i - (alpha + beta x_i)$ are unobservable but but the residuals $y_i - (hatalpha + hatbeta x_i)$ (i.e. the estimated errors) are the components of the random vector
$$
hatvarepsilon = (I - H)Y = (I - X(X^T X)^{-1} X^T)Y.
$$
("H" stands for "hat", for reasons that should be apparent.) It is easy to see that the $ntimes n$ hat matirx $H = X(X^T X)^{-1} X^T$ is the matrix of the orthogonal projection of rank $2$ onto the 2-dimensional column space of $X$. And $I - H$ is the rank-$(n-2)$ projection onto the orthogonal complement of that space. Diagonalized, this matrix just has $n-2$ instances of 1 on the diagonal and 0 in the other two positions. Therefore
$$
frac{hatsigma^2}{sigma^2} = frac{| hatvarepsilon |^2}{sigma^2}
$$
is distributed like a sum of squares of $n - 2$ independent $N(0,1)$ random variables. It therefore has a chi-square distribution with $n-2$ degrees of freedom.
Finally, we need this: $hatvarepsilon$ and $begin{bmatrix}hatalpha \ hatbeta end{bmatrix}$ are probabilistically independent. This is true because both are linear transformations of the same vector of independent identically distributed normal random variables and their covariance vanishes:
$$
operatorname{cov}left(hatvarepsilon, begin{bmatrix}hatalpha \ hatbeta end{bmatrix} right) = operatorname{cov}left( (I - H)Y , (X^T X)^{-1}X^ T right)
$$
$$
= (I - H) operatorname{cov}(Y,Y) X(X^T X)^{-1} = sigma^2 (I - H) X(X^T X)^{-1}$
$$
and this is the $ntimes 2$ zero matrix, by definition of $H$.
Now all our lemmas are in place and we can draw some conclusions:
Firstly
$$
frac{hat y - (alpha + beta x)}{sqrt{ frac{sigma^2}{n}frac{sum_i (x_i - x)^2}{sum_i (x_i - overline{x})^2} }} sim N(0,1).
$$
Hence if $sigma$ were miraculously known, we could say that
$$
hat y pm A sqrt{ frac{sigma^2}{n}frac{sum_i (x_i - x)^2}{sum_i (x_i - overline{x})^2} }
$$
are the endpoints of a 90% confidence interval for $alpha + beta x$ if $pm A$ are the endpoints of the interval above which lies 90% of the area under the bell-curve.
But $sigma$ is not known. Since $hatsigma^2$ is indpendent of the random variable in the numerator and has a chi-square distribution with $n-2$ degrees of freedom, we can put $hatsigma$ in place of $sigma$ and instead of the normal distribution use the Student's t-distribution with $n-2$ degrees of freedom.
That's the conventional frequentist confidence interval.
For the prediction interval, just remember that the new value of $Y$ is independent of those we used above, so the variance of the difference between that and the predicted value is $sigma^2$ plus the variance of the predicted value, found above.
No comments:
Post a Comment