Significance testing and the
prediction variance
Back
Here we shall consider the prediction variances of new samples.
A standard linear regression model, y~N(Xb,s2I),
is assumed.
Furthermore, we assume that we have selected A components. The response
value of the new sample, x0, is estimated as y(x0)=b1Tx0. The variance of y(x0) is given by
Var(y(x0))
= s2
x0T
(X1TX1)+
x0
@
[(yTy)-
{(yTt1)2/(t1Tt1)+…+(yTtA)2/(tATtA)}]/(N-A)
´ [(x0Tr1)2/(t1Tt1)
+ … + (x0TrA)2/(tATtA)]
Here we have neglected the possible bias. The residual variance is given by
s2A = [(yTy)- {(yTt1)2/(t1Tt1)+…+(yTtA)2/(tATtA)}]/(N-A)
Here it is assumed that X has been
decomposed as X=t1p1T+...+tApAT+
X0. We have R=(PT)-1. If
the task is variable selection, the matrix P will be a lower diagonal
matrix. When
we are finding the next score vector, tA+1, we have two aspects
to take into consideration:
i)
Reduction in fit:
(yTtA+1)2/(tA+1TtA+1)
ii) Increase in model variation:
(x0TrA+1)2/(tA+1TtA+1)
When
handling these two terms, we may consider the term (x0TrA+1)2
as a constant even though rA+1 will depend on the score
vector, tA+1, we are trying to find. The reason is that rA+1
typically has length around one. Also, the value of (x0TrA+1)
will vary with new data, x0. The H-principle suggests
considering the product of i) and the inverse of ii). In this way we find an
optimal balance between the improvement in fit and the associated increase in
the model variation. The procedure to obtain an optimal balance can be based on
other measures of reduction in fit and also other measures of increase in model
variation. The main issue is to take the increase in model variation into
account, when finding new score vectors. Furthermore, we have considered here a
standard linear regression model. Other mathematical models will suggest
different measures of fit (improvement of the modeling criterion) and measures
of prediction associated with the solution at each step.
Significance
testing.
Traditional significance testing is only concerned with the question of fit. We
shall look at the procedure closer. The regression coefficient associated with a
new score vector tA+1 is cA+1=(yTtA+1)/(tA+1TtA+1).
It significance is compared to its approximate variance, s2A+1/(tA+1TtA+1),
by an F-statistic,
F = [(yTtA+1)/(tA+1TtA+1)]2/[
s2A+1/(tA+1TtA+1)].
There is a useful way to look at this test by considering the residual values of the response variable. Let u0=y and define
ui= ui-1 - (yTti)/(tiTti) ti,
i=1,2, …, A.
With
this notation ui is the residual after i steps and using (yTtA+1)=(uATtA+1),
we can write
s2A+1
= [(uATuA)-
(uATtA+1)2/(tA+1TtA+1)]/(N-A-1)
This
gives
F = [(uATtA+1)2/(tA+1TtA+1)]/[(uATuA)-
(uATtA+1)2/(tA+1TtA+1)]/(N-A-1)
= [r(uA, tA+1)]2/(1-[r(uA,
tA+1)]2)/(N-A-1).
Here [r(uA, tA+1)]2 is the squared simple correlation coefficient between the residual of the response variable after A components, uA, and the next score vector, tA+1. We see that F is a monotone function of the correlation coefficient. Thus the F-test is equivalent to test the significance of the correlation coefficient between the present residual response values and the next score vector. The correlation coefficient is invariant to the size of the score vector, tA+1. Thus, significance testing is only concerned with improvement in fit, i) above. This is unfortunate, because the terms in ii) are stochastically independent of the terms in i). In practice it means that the terms in ii) can be large or small. Thus significance testing does not provide with information if the prediction variance, Var(y(x0)), is large or small. Program packages using significance testing to find variables or dimensions generally lead to results that represent overfitting of the data in the sense that the prediction variance is too large. The reason is partly that e.g., 5% level of significance is used, but should be much lower than 5% because the methods are searching in data. Another reason is that, as mentioned above, significance testing does not pay any consideration to the model variation.