Mean squared error and squared bias
Back

Mean squared error. The so-called theory of Mallows Cp is concerned the theoretical properties of mean squared error and squared bias. We summarize the results here. The theoretical expected values of the response values are , but we are only using ŷA=X1b1. X1 is the part of X that we are using and b1 is our estimate of the parameters, β, when A terms are used in the decomposition of X. The estimated response value for the ith response value is ŷi=b1Txi. The residual sums of squares, RSSA, is given by

                   RSSA = |y - ŷA|2 = |y1 - b1Tx1|2 + … + |yN - b1TxN|2.

 It has the mean value

(1)               E(RSSA) = (N-A)σ2 + g2TD2g2

The squared bias, JA= |X1b1-|2 has the mean value

(2)               E(JA) = A σ2 + [gA+12 (tA+1TtA+1) + … + gK2 (tKTtK)] = Aσ2 + g2TD2g2

The term g2TD2g2 is estimated by c2TD2c2. Eliminating the term g2TD2g2 gives

(3)               E(JA)/σ2 = E(RSSA)/σ2 – N + 2A.

This equation leads to the so-called Mallows Cp-criterion that is given by

(4)               CA = RSSA/s2 - N + 2A,

CA is an estimate of E(JA)/σ2. Here s2 is a ‘good’ estimate of σ2. These equations are important from methodological point of view:

i) (2) shows that it is important to keep the dimension as low as possible

ii) The amount of squared bias, c2TD2c2, that should be allowed, can be found as a trade-off from (1) or (2).

iii) (2) and (4) show that CA should be as small as possible and as close to A as possible.

iv) The penalty of of one dimension extra in the model is approximately 2σ2.

As pointed out by Chatterjee1 the results i)-iv) are valid in general for both linear and non-linear models, and independently of the question of overfitting  or underfitting. Unfortunately, these formulae cannot be used to identify the dimension of the model. The reason is that they only are concerned with the fit of the given model. It is instructive to look at an example.

Text Box:  
Value of Mallows Cp-value, CA in equation (4). x-axis the dimension of the model.

In typical application a certain maximal value for the dimension is chosen. The residual variance at that dimension is chosen as an estimate of the variance σ2.

 

In Fig. 1 the values of CA in equation (4) are shown for the furnace data. The dimension for these data is 5 or 6. The minimal value of CA is obtained at dimension 12. (The results would be the same if e.g, the maximal dimension was chosen as 15).

 

The general experience is that the application of Mallows Cp-value leads to overfitting of the data.  The reason is that the measures are very week. This can be seen by looking at (2). The modelling should stop if E(JA) < E(JA+1), or equivalently if gA+12 (tA+1TtA+1) < σ2 . The F-value for testing the significance of the regression coefficient cA+1 is F=cA+12 (tA+1TtA+1)/s2. Thus the modeling should stop, if the F-value is less than 1, F<1.

 

On the other hand, the equations point out the principal issue in modelling and they can be used to compare one method with others.

Back