Some background analysis for the H-method

When parameters of a model are estimated, there are introduced uncertainties due to the parameter estimates that depends on the data used. When there are many variables it is important to take into account the uncertainties of the parameter estimates that are being introduced into the model. This can be explained closer in terms of a standard linear regression analysis.

The prediction variance associated with a new sample tells us, what are the uncertainties when the model is applied to this new sample. Unfortunately, this variance increases the more the sample is deviating from the centre. But the changes in the prediction variance can be studied closer. The result is that the term (yTtA+1)2 should be maximized, when a new score vector tA+1 should be found.

The prediction variance can be considered as a product of two functions, one related to fit, fA, and the other related to the model variation, gA. It is natural to look at the change D(fA´gA). If this should decrease as much as possible, the term (yTtA+1)2 should be as large as possible. This is an approximate consideration with analogy to results from the differential.

The subject of variable selection and the dimension of models has been intensively studied. The theory of Mallows Cp is an example. It shows that the means squared error increases by the dimension. The main result is that it is important to keep the dimension of the model as low as possible. It is a basic aspect of the H-method to keep the dimension as low as possible, but obtaining as good predictions as possible.

In traditional statistical analysis the results are judged by a signifcance test. It is common to formulate a model for the preent data, and use significance testing to judge is some variables/factors should be removed from the model. In many (most) cases this is not the best procedure to use. Data have their own identity and it is general mistake made by many statisticians and scientists to force models on data. It is in general better to use the H-method to build up the model. But what is wrong with traditional significance testing? What is wrong is that it only focusses on the 'fit' part of the results. It does not take into account the model uncertainties that being invited. This is a problem, because the model uncertainties are (almost) independent of the 'fit' part. It can be shown that traditional significance testing is equivalent to using a correlation coefficient for testing the effect. Thus the significance test does not take into account the size of the associated score vector that is introduced into the model.

Back