Mean
squared error and squared bias
Back
Mean squared error. The so-called theory of Mallows Cp is concerned the theoretical properties of mean
squared error and squared bias. We summarize the results here. The theoretical
expected values of the response values are Xβ, but we are only using
ŷA=X1b1. X1
is the part of X that we are using and b1 is our
estimate of the parameters, β, when A terms are used in the
decomposition of X. The estimated response value for the ith
response value is ŷi=b1Txi.
The residual sums of squares, RSSA, is given by
RSSA = |y - ŷA|2 = |y1
- b1Tx1|2
+ … + |yN - b1TxN|2.
It
has the mean value
(1)
E(RSSA) = (N-A)σ2 + g2TD2g2
The
squared bias, JA= |X1b1- Xβ|2
has the mean value
(2)
E(JA) = A σ2 + [gA+12
(tA+1TtA+1)
+ … +
gK2 (tKTtK)] = Aσ2
+ g2TD2g2
The
term g2TD2g2
is
estimated by c2TD2c2.
Eliminating the term g2TD2g2
gives
(3)
E(JA)/σ2 = E(RSSA)/σ2
– N + 2A.
This
equation leads to the so-called Mallows Cp-criterion that is given by
(4)
CA = RSSA/s2 - N + 2A,
CA
is an estimate of E(JA)/σ2. Here s2 is a
‘good’ estimate of σ2. These equations are important from
methodological point of view:
i)
(2) shows that it is important to keep the dimension as low as possible
ii)
The amount of squared bias, c2TD2c2,
that should be allowed, can be found as a trade-off from (1) or (2).
iii) (2) and (4) show that CA should be as small as possible and as close to A as possible.
iv) The penalty of of one dimension extra in the model is approximately 2σ2.
As
pointed out by Chatterjee1 the results i)-iv) are valid in general for
both linear and non-linear models, and independently of the question of
overfitting or underfitting.
Unfortunately, these formulae cannot be used to identify the dimension of the
model. The reason is that they only are concerned with the fit of the given
model. It is instructive to look at an example.
In typical application a certain maximal value for the
dimension is chosen. The residual variance at that dimension is chosen as an
estimate of the variance σ2.
In
Fig. 1 the values of CA in equation (4) are shown for the furnace
data. The dimension for these data is 5 or 6. The minimal value of CA
is obtained at dimension 12.
(The results would be the same if e.g,
the maximal dimension was chosen as 15).
The
general experience is that the application of Mallows Cp-value leads
to overfitting of the data. The
reason is that the measures are very week. This can be seen by looking at (2).
The modelling should stop if E(JA)
< E(JA+1), or equivalently if gA+12
(tA+1TtA+1)
< σ2 . The F-value for testing the significance of the
regression coefficient cA+1 is F=cA+12 (tA+1TtA+1)/s2.
Thus the modeling should stop, if the F-value is less than 1, F<1.