Significance testing and the prediction variance
Back
Here we shall consider the prediction variances of new samples. A standard linear regression model, y~N(Xb,s2I), is assumed. Furthermore, we assume that we have selected A components. The response value of the new sample, x0, is estimated as y(x0)=b1Tx0. The variance of y(x0) is given by 

                       Var(y(x0)) =  s2 x0T (X1TX1)+ x0 

@ [(yTy)- {(yTt1)2/(t1Tt1)+…+(yTtA)2/(tATtA)}]/(N-A) 

´ [(x0Tr1)2/(t1Tt1) + … + (x0TrA)2/(tATtA)] 

Here we have neglected the possible bias. The residual variance is given by

            s2A = [(yTy)- {(yTt1)2/(t1Tt1)+…+(yTtA)2/(tATtA)}]/(N-A)

Here it is assumed that X has been decomposed as X=t1p1T+...+tApAT+ X0. We have R=(PT)-1. If the task is variable selection, the matrix P will be a lower diagonal matrix. When we are finding the next score vector, tA+1, we have two aspects to take into consideration: 

i) Reduction in fit:                        (yTtA+1)2/(tA+1TtA+1)
ii) Increase in model variation:     (x0
TrA+1)2/(tA+1TtA+1)

When handling these two terms, we may consider the term (x0TrA+1)2 as a constant even though rA+1 will depend on the score vector, tA+1, we are trying to find. The reason is that rA+1 typically has length around one. Also, the value of (x0TrA+1) will vary with new data, x0. The H-principle suggests considering the product of i) and the inverse of ii). In this way we find an optimal balance between the improvement in fit and the associated increase in the model variation. The procedure to obtain an optimal balance can be based on other measures of reduction in fit and also other measures of increase in model variation. The main issue is to take the increase in model variation into account, when finding new score vectors. Furthermore, we have considered here a standard linear regression model. Other mathematical models will suggest different measures of fit (improvement of the modeling criterion) and measures of prediction associated with the solution at each step.

Significance testing. Traditional significance testing is only concerned with the question of fit. We shall look at the procedure closer. The regression coefficient associated with a new score vector tA+1 is cA+1=(yTtA+1)/(tA+1TtA+1). It significance is compared to its approximate variance, s2A+1/(tA+1TtA+1), by an F-statistic, 

           F = [(yTtA+1)/(tA+1TtA+1)]2/[ s2A+1/(tA+1TtA+1)]. 

There is a useful way to look at this test by considering the residual values of the response variable. Let u0=y and define 

                       ui= ui-1 - (yTti)/(tiTti) ti,            i=1,2, …, A.

With this notation ui is the residual after i steps and using (yTtA+1)=(uATtA+1), we can write

s2A+1 = [(uATuA)- (uATtA+1)2/(tA+1TtA+1)]/(N-A-1)

 This gives

                    F = [(uATtA+1)2/(tA+1TtA+1)]/[(uATuA)- (uATtA+1)2/(tA+1TtA+1)]/(N-A-1)

                   = [r(uA, tA+1)]2/(1-[r(uA, tA+1)]2)/(N-A-1).

 Here [r(uA, tA+1)]2 is the squared simple correlation coefficient between the residual of the response variable after A components, uA, and the next score vector, tA+1. We see that F is a monotone function of the correlation coefficient. Thus the F-test is equivalent to test the significance of the correlation coefficient between the present residual response values and the next score vector. The correlation coefficient is invariant to the size of the score vector, tA+1. Thus, significance testing is only concerned with improvement in fit, i) above. This is unfortunate, because the terms in ii) are stochastically independent of the terms in i). In practice it means that the terms in ii) can be large or small. Thus significance testing does not provide with information if the prediction variance, Var(y(x0)), is large or small. Program packages using significance testing to find variables or dimensions generally lead to results that represent overfitting of the data in the sense that the prediction variance is too large. The reason is partly that e.g., 5% level of significance is used, but should be much lower than 5% because the methods are searching in data. Another reason is that, as mentioned above, significance testing does not pay any consideration to the model variation.  

Back