Measures
of fit and Model Variation
Back
The
uncertainties associated with predictions are derived from two parts of the
mathematical model. The important issue is partly that these two parts behave
differently, one reduces and the other increases, and partly that the two parts
are stochastically independent. We shall look closer at the two parts.
Parts
of the prediction variance. We shall consider closer the parts that enter the
prediction variance. Assuming a standard linear regression model, y~N(Xb,s2I),
the variance of the estimated response, y(x0)=b1T
x0,
associated with a new sample x0 is given by
Var(y(x0))A @
[(yTy)
- yT
X1(X1TX1)+
X1Ty
][1+x0T
(X1TX1)+
x0]/(N-A)
Here
we assume that A components have been selected, and the X1
is the part of X that has been selected. When expanding the model by
adding new components, one can show that the measures
i)
the error of fit, [(yTy) - yT
X1(X1TX1)+
X1Ty
], always decreases
ii)
the model variation, (1+x0T
(X1TX1)+
x0), always increases
(In
theory these measures may be unchanged, when adding new components, but in
practice we always have these changes). It is useful to look at these measures
when using orthogonal components,
f(A) = [(yTy)
- yT
X1(X1TX1)+
X1Ty
] = (yTy)
- {d1 c12+ … + dA cA2}
g(A) = (1 + x0T
(X1TX1)+
x0) = 1+ (x0Tr1)2/d1
+ … + (x0TrA)2/dA.
Here
di=(tiTti)
and ci=(yTti)/(tiTti).
These equations shows that, when adding the (A+1)th component, the
measure of fit is reduced by dA+1 cA+12, while
the model variation is increased by (x0TrA+1)2/dA+1.
Note that di is the sample variance of the score vector and in case
of Principal Component Regression (PCR) the eigen value of the covariance
matrix.
Optimizing
a product of two functions. We are interested in optimizing the product f(A)g(A). Suppose that the
functions depend on a continuous parameter t. Then the differential of f(t)g(t)
is (f(t)g(t))’=[f(t)g’(t)/g(t) + f’(t)]´g(t).
If we consider the discrete analogue, we get
D(f(A)g(A)
@ [(f(A)/g(A)) (x0TrA+1)2
- (yTtA+1)2]g(A)/
(tA+1TtA+1)
When we select the (A+1)th component,
we want the differential to decrease as much as possible. One way to look at the
task is maximize the covariance, i.e., to find w,
maximize (yTtA+1)2,
for
tA+1=XAw,
subject to |w|=1.
Here
XA is the reduced data matrix. Thus, we see that maximizing
the covariance is an approach to balance decrease in error of fit and increase
in model variation with the purpose of reducing the prediction variance as much
as possible.
Independence
of the two parts.
It is a standard result from multivariate statistical theory that the error of
fit, [(yTy)
- yT
X1(X1TX1)+
X1Ty
] is stochastically independent of the model variation, (1+x0T
(X1TX1)+
x0), if we assume multivariate normal distribution. We have here
a similar situation like in the case of estimation of the mean and standard
deviation in the univariate normal distribution. We have no information on the
standard deviation, if we only know the estimated mean value. Similarly, we have
no information on the model variation, if we know the value of the error of fit.
It is instructive to use the above functions f(A) and g(A) to illustrate the
situation closer. We can write
f(A) =
f(A-1) - (yTtA)2/(tATtA)
g(A) = g(A-1) + (x0TrA)2/dA
= g(A-1) + (dA cA2) (x0TrA)2/(yTtA)2.
The
term (dA
cA2)=(yTtA)2/(tATtA)
may be large and give a significant improvement in the error of the fit. But (yTtA)2
may be large or small and therefore we have no information if the improvement in
fit gives large or small model variation, a large or small value of (x0TrA)2/(yTtA)2.
When working with industrial data, we typically have many variables but
relatively small dimension. If we are searching in data for improvement in the
error in fit, we typically get results where the model variation is (too) large.
The reason is that we find many score vectors that are close to zero, (tATtA)@0, and some of them will blow up the value of
(yTtA)2/(tATtA)
even if (yTtA)
@0.