On the practice of modelling data

When there is a modelling task that need to be carried out, we automatically try out simple and comprehensive methods. The choice of models and analysis of results depends a lot one ones experience.

Industrial data are typically large. A 'normal' size of the number variables may be say 1000. Furthermore, the number of samples may be relatively small compared to the number of variables. There may be available only 100 samples or less. The practical problem is often that we should not use all variables or all samples of the given data. The purpose of modelling data is typically prediction. For that purpose there may be need a considerable reductionin the data. An example:

It belongs to the scientific tradition of natural sciences to formulate an overall model, estimate the parameters in the model and then carry out significance testing to find out, which parameters are not significant. This approach does not work in the case of industrial data. One can say that data have their 'identity of their own', which is typically much different from 'laboratory data', where scientific tradition can be applied. When working with industrial data it is necessary build up model from as simple parts as possible and expand the model. The expansion of the model should stop, when data 'says stop'. Significance testing and other evaluations can then be carried out in a model that is valid for the given data.

The program packages in statistics have created procedures and views that are not sound. As a standard the packages automatically compute the full rank solutions to the model specified, although it might not be desirable. Only in the case of numerical singularity are the users informed of possible problems concerning the solution obtained. But there might be problems with the solution although it can easily be computed. Another problem is that users of the packages tend to look at the parameter estimates and their standard deviations. If the parameters are not signifcantly differing from zero, they are typically removed, although a closer study indicates that they may be good to use. In the example above the company may work with 60 variables and a dimension of the model of say 5. In this case it is possible to find 5 variables that mathematically can compute the other 55. Thus majority of the variables do not deviate significantly from zero, but the 60 together is the collection that gives the best predictions.

The modern tasks of predictions derived from mathematical models require new ways of thinking in applying models to data.

Back