Requirements to data analysis in research and industry
In research and industry there are certain basic features and objectives that are prevailing, when data analysis is needed. Some basic points are commented upon.
Many variables. It is common that there are many variables that enter the analysis. An example is a NIR (Near Infra-Red) instrument. It may give automatically 1056 data values, when a measurement of an object is carried out. This results in 1056 variables that enter the model. Other optical measurement instrument (Raman, and others) may give 3200 measurement values. Digitization of images may similarly result in thousands or tenths of thousands of variables.
Few samples (objects). It is common that there are not measured many objects. There can be many different reasons for this. It may be time consuming or expensive to carry out the measurement of an object. It may also be the policy of the company not to measure too many objects. E.g., a company producing measurement equipments wants data to reflect the situation at the customer. And the customer may typically only work with small amount of objects at each time.
Important to find appropriate variables. Even though the instrument gives many variables, it may not be appropriate to use all variables in a given modelling task. E.g., when using NIR data in analysing chemical composition of materials (oil, milk, etc), it is often best to use only around 10% or so of the variables. 90% of the variables do not contain ‘predictive’ information. It means that if they were included in the model, worse predictions are obtained compared to the model, where they are not included.
Reduced rank for inference from data. Typically the useful rank of data is much smaller than the number of variables. A company that works intensively with NIR data has the following experience: Out of 1056 variables it is adequate to work only with between 60 and 140 variables and to use only 4 to 8 latent variables. This is also the typical situation, when the H-method is applied to the case, where the X-data (design or instrumental data) are NIR data and Y-data (the response or output data) are chemical concentrations. It is important to note that there usually is not a numerical problem to obtain the full inverse or generalized inverse.
Prediction is the objective of primary interest. In research and industry the primary interest is concerned how good predictions are obtained from the estimated model. Statisticians often emphasize the importance of interpretation of the solution values. In many cases, perhaps in most cases, the interpretation of parameter values, like is standard in program packages in statistics, is not reliable. One example is the following. Typically, parameter estimates together with their standard deviation are given. From this list of pairs, people make interpretation of which variables are significant or which are not. People tend to forget that these interpretations are marginal ones. Two variables may be significant, but if the effect of one is removed from the model, the other may turn out not to be significant. There are also examples of stepwise search in data that are very popular among people, where results are very data dependent, and may not be reliable, when new data becomes available.
Difficult to specify a detailed model. It is usually difficult for the experimenter to specify a detailed model for a given data. It is a part of the tradition of natural sciences that one should formulate a model that reflects the physics or chemistry or mechanics etc of the situation. But typically the knowledge available is only vague. Knowledge of the situation is obtained by studying the data and the variation that they show. From this information new models are built up with the purpose of getting good predictions.
Important to test the solution found. In practice it is important to study the sensitivity of the solution. This is often done by cross-validation, where a part of data is excluded and the excluded part is evaluated by the model found by the data used. An example is the case, where data are geometrically situated like ‘a comet with long tail’. The model may look good, but if the data that correspond to the ‘tail’ is removed, the new solution found may be bad.
Graphic illustrations of variation in data. It is important to study graphically the inherent variation in the data. There is often unexpected variation that only appears, when data are studied closer graphically. An example is the case, when there are grouping in data. It might be that the main variation is the variation along groups. If the data are not analyzed graphically, one might not detect the variation within the groups, which might be the important one.
Presentation of results in simple terms. Scientists and industrial people want simple and clear presentation of the results of the modeling task. The main emphasis should be on the prediction tasks derived from the model. This is opposite to the case of program packages in statistics. In many cases one must be expert in the specific methods in order to make correct interpretation of the results presented.
The applications of the H-method satisfies these requirements to modeling that is described above.