Measures of fit and Model Variation
Back

The uncertainties associated with predictions are derived from two parts of the mathematical model. The important issue is partly that these two parts behave differently, one reduces and the other increases, and partly that the two parts are stochastically independent. We shall look closer at the two parts.

Parts of the prediction variance. We shall consider closer the parts that enter the prediction variance. Assuming a standard linear regression model, y~N(Xb,s2I), the variance of the estimated response, y(x0)=b1T x0, associated with a new sample x0 is given by 

           Var(y(x0))A @ [(yTy) - yT X1(X1TX1)+ X1Ty ][1+x0T (X1TX1)+ x0]/(N-A) 

Here we assume that A components have been selected, and the X1 is the part of X that has been selected. When expanding the model by adding new components, one can show that the measures 

i)            the error of fit, [(yTy) - yT X1(X1TX1)+ X1Ty ], always decreases

ii)          the model variation, (1+x0T (X1TX1)+ x0), always increases

(In theory these measures may be unchanged, when adding new components, but in practice we always have these changes). It is useful to look at these measures when using orthogonal components, 

           f(A) = [(yTy) - yT X1(X1TX1)+ X1Ty ] = (yTy) - {d1 c12+ … + dA cA2}

            g(A) = (1 + x0T (X1TX1)+ x0) = 1+ (x0Tr1)2/d1 + … + (x0TrA)2/dA.

Here di=(tiTti) and ci=(yTti)/(tiTti). These equations shows that, when adding the (A+1)th component, the measure of fit is reduced by dA+1 cA+12, while the model variation is increased by (x0TrA+1)2/dA+1. Note that di is the sample variance of the score vector and in case of Principal Component Regression (PCR) the eigen value of the covariance matrix.

Optimizing a product of two functions. We are interested in optimizing the product f(A)g(A). Suppose that the functions depend on a continuous parameter t. Then the differential of f(t)g(t) is (f(t)g(t))’=[f(t)g’(t)/g(t) + f’(t)]´g(t). If we consider the discrete analogue, we get

          D(f(A)g(A) @ [(f(A)/g(A)) (x0TrA+1)2 - (yTtA+1)2]g(A)/ (tA+1TtA+1)

When we select the (A+1)th component, we want the differential to decrease as much as possible. One way to look at the task is maximize the covariance, i.e., to find w,

                   maximize (yTtA+1)2,          for tA+1=XAw,       subject to |w|=1.

Here XA is the reduced data matrix. Thus, we see that maximizing the covariance is an approach to balance decrease in error of fit and increase in model variation with the purpose of reducing the prediction variance as much as possible.  

Independence of the two parts. It is a standard result from multivariate statistical theory that the error of fit, [(yTy) - yT X1(X1TX1)+ X1Ty ] is stochastically independent of the model variation, (1+x0T (X1TX1)+ x0), if we assume multivariate normal distribution. We have here a similar situation like in the case of estimation of the mean and standard deviation in the univariate normal distribution. We have no information on the standard deviation, if we only know the estimated mean value. Similarly, we have no information on the model variation, if we know the value of the error of fit. It is instructive to use the above functions f(A) and g(A) to illustrate the situation closer. We can write 

                       f(A) = f(A-1) - (yTtA)2/(tATtA) = f(A-1) - dA cA2

                       g(A) = g(A-1) + (x0TrA)2/dA = g(A-1) + (dA cA2) (x0TrA)2/(yTtA)2. 

The term (dA cA2)=(yTtA)2/(tATtA) may be large and give a significant improvement in the error of the fit. But (yTtA)2 may be large or small and therefore we have no information if the improvement in fit gives large or small model variation, a large or small value of (x0TrA)2/(yTtA)2. When working with industrial data, we typically have many variables but relatively small dimension. If we are searching in data for improvement in the error in fit, we typically get results where the model variation is (too) large. The reason is that we find many score vectors that are close to zero, (tATtA)@0, and some of them will blow up the value of (yTtA)2/(tATtA) even if (yTtA) @0.

Back