The uncertainties of predictions in linear regression
In industry the primary interest is in the uncertainties of predictions associated with new samples. We shall consider closer the parts that enter the prediction variance, when assuming a standard linear regression model, y~N(Xb,s2I), where X is an N´K design matrix. The variance of the estimated response, y(x0)=bTx0, associated with a new sample x0 is given by
Var(y(x0)) @ [(yTy) - yT X(XTX)-1 XTy ][1+x0T (XTX)-1 x0]/(N-K)
When expanding the model by adding new components/variables, one can show that the measures
i) the error of fit, [(yTy) - yT X(XTX)-1 XTy ], always decreases
ii) the model variation, (1+x0T (XTX)-1 x0), always increases
(In theory these measures may be unchanged, when adding new components, but in practice these changes always occur).
From a modelling point of view the following considerations are important. If the data values (y X) follow a multivariate normal distribution, then a) and b) given by
a) the error of fit, [(yTy) - yT X(XTX)-1 XTy ]
b) the model precision, (XTX)-1,
are stochastically independent. It means that knowledge of a) does not give any knowledge of the value of b). In order to know the quality of predictions, b) must be computed in order to find out how well the model is performing.
The situation is similar the one, where from the data values (x1, ..., xN) we want to estimate the mean and standard deviation of the normal distribution. If only the mean is estimated, there is no information on the value of the standard deviation.
The term x0T(XTX)-1x0 can be viewed as having a distribution that is approximately proportional to a c2 distribution (E(c2)=f and Var(c2)=2f). Typically, the H-method reduces this term by 50% or more, and obtains the same value for a) as standard methods that base results on traditional significance testing.
The reason is that the standard methods are only concerned with improving a). Thus they provide with no information on the size of b). When working with industrial data, the data are often of reduced statistical rank, which means that the relevant data of X are located in a subspace of the column space of X. By neglecting the part b) of the prediction variance, the standard methods typically provide with results that are often difficult or impossible to use by the company due to the bad prediction ability of the modelling results.
It is often difficult to identify the model that should be used to model the industrial data. Furthermore, even if the sciences are advanced they have not been developed with respect to tell about the future developments. The tradition that has established itself is not sufficient to provide with satisfactory solutions.