- Nonlinear feature selection using the general mutual information (2008)
- In the context of information theory, the term Mutual Information has first been formulated by Claude Elwood Shannon. Information theory is the consistent mathematical description of technical communication systems. To this day, it is the basis of numerous applications in modern communications engineering and yet became indispensable in this field. This work is concerned with the development of a concept for nonlinear feature selection from scalar, multivariate data on the basis of the mutual information. From the viewpoint of modelling, the successful construction of a realistic model depends highly on the quality of the employed data. In the ideal case, high quality data simply consists of the relevant features for deriving the model. In this context, it is important to possess a suitable method for measuring the degree of the, mostly nonlinear, dependencies between input- and output variables. By means of such a measure, the relevant features could be specifically selected. During the course of this work, it will become evident that the mutual information is a valuable and feasible measure for this task and hence the method of choice for practical applications. Basically and without the claim of being exhaustive, there are two possible constellations that recommend the application of feature selection. On the one hand, feature selection plays an important role, if the computability of a derived system model cannot be guaranteed, due to a multitude of available features. On the other hand, the existence of very few data points with a significant number of features also recommends the employment of feature selection. The latter constellation is closely related to the so called "Curse of Dimensionality". The actual statement behind this is the necessity to reduce the dimensionality to obtain an adequate coverage of the data space. In other word, it is important to reduce the dimensionality of the data, since the coverage of the data space exponentially decreases, for a constant number of data points, with the dimensionality of the available data. In the context of mapping between input- and output space, this goal is ideally reached by selecting only the relevant features from the available data set. The basic idea for this work has its origin in the rather practical field of automotive engineering. It was motivated by the goals of a complex research project in which the nonlinear, dynamic dependencies among a multitude of sensor signals should be identified. The final goal of such activities was to derive so called virtual sensors from identified dependencies among the installed automotive sensors. This enables the real-time computability of the required variable without the expenses of additional hardware. The prospect of doing without additional computing hardware is a strong motive force in particular in automotive engineering. In this context, the major problem was to find a feasible method to capture the linear- as well as the nonlinear dependencies. As mentioned before, the goal of this work is the development of a flexibly applicable system for nonlinear feature selection. The important point here is to guarantee the practicable computability of the developed method even for high dimensional data spaces, which are rather realistic in technical environments. The employed measure for the feature selection process is based on the sophisticated concept of mutual information. The property of the mutual information, regarding its high sensitivity and specificity to linear- and nonlinear statistical dependencies, makes it the method of choice for the development of a highly flexible, nonlinear feature selection framework. In addition to the mere selection of relevant features, the developed framework is also applicable for the nonlinear analysis of the temporal influences of the selected features. Hence, a subsequent dynamic modelling can be performed more efficiently, since the proposed feature selection algorithm additionally provides information about the temporal dependencies between input- and output variables. In contrast to feature extraction techniques, the developed feature selection algorithm in this work has another considerable advantage. In the case of cost intensive measurements, the variables with the highest information content can be selected in a prior feasibility study. Hence, the developed method can also be employed to avoid redundance in the acquired data and thus prevent for additional costs.
- Real-valued feature selection for process approximation and prediction (2009)
- The selection of features for classification, clustering and approximation is an important task in pattern recognition, data mining and soft computing. For real-valued features, this contribution shows how feature selection for a high number of features can be implemented using mutual in-formation. Especially, the common problem for mutual information computation of computing joint probabilities for many dimensions using only a few samples is treated by using the Rènyi mutual information of order two as computational base. For this, the Grassberger-Takens corre-lation integral is used which was developed for estimating probability densities in chaos theory. Additionally, an adaptive procedure for computing the hypercube size is introduced and for real world applications, the treatment of missing values is included. The computation procedure is accelerated by exploiting the ranking of the set of real feature values especially for the example of time series. As example, a small blackbox-glassbox example shows how the relevant features and their time lags are determined in the time series even if the input feature time series determine nonlinearly the output. A more realistic example from chemical industry shows that this enables a better ap-proximation of the input-output mapping than the best neural network approach developed for an international contest. By the computationally efficient implementation, mutual information becomes an attractive tool for feature selection even for a high number of real-valued features. Key words: feature selection, process approximation, Rènyi mutual information, classification, clustering, Takens-Grassberger correlation integral