Feature Selection ToolboxFeature Selection

Introduction

Dimensionality Reduction
Feature Selection aims at removing irrelevant or redundant dimensions what simplifies construction of decision rules, reduces data acquisition cost as well as makes decision rules more robust.

Pattern recognition systems are constantly gaining importance as the number of scenarios where they can help reveal important but otherwise inaccessible information constantly grows. This is made possible partly by the ever growing performance and proliferation of computers as well as by advances in theory. Pattern recognition is applicable in a vast variety of fields, including:

Traditionally, one of the key issues in pattern recognition system design has been the identification of a set of distinctive pattern properties, referred to as features, that can be used as a basis for pattern discrimination. Such features are selected from among a set of available measurements by mathematical tools which allow the designer to measure the discriminatory content of a feature set.

The early motivation for feature selection was largely of a pragmatic nature. Considering the constraints imposed by the primitive computing hardware of the time it was absolutely essential to reduce the dimensionality of pattern representation so as to minimise the complexity of the decision making process. However, later it was discovered that reducing the dimensionality of the feature space could also have a beneficial effect on the recognition performance. In fact, it was noted that adding features would initially enhance the system performance by virtue of the complementarity of the information made available to the classification stage. However, at some point the augmentation of the pattern representation space would start exhibiting the law of diminishing returns, resulting in a gradual performance degradation. This "peaking" phenomenon is a manifestation of an intricate relationship between the dimensionality of the pattern space and the size of the training set. As the dimensionality of pattern representation increases, more degrees of freedom become available. Learning the decision rule of a classifier with many free parameters, using a finite training set, then gives a scope for over-fitting and therefore poor generalisation.

The peaking phenomenon has motivated the development of the methodology for feature selection for the last two decades. This motivation has been reinforced by the biological or psychophysical equivalent present in human decision making whereby pattern recognition carried out by the central nervous system appears to be based on the most significant attributes only. It also has a basis in logic as argued by Watanabe in his seminal text Knowing and Guessing [33]. Accordingly, any two objects would be equally similar or dissimilar unless one applied unequal weighting to the list of attributes that could potentially be observed. The feature selection process effectively imposes zero-one weights on the available attributes and therefore sharpens the distinctiveness of object classes.

We acknowledge the arguments put forward by Vapnik [34] that the latest paradigm in classifier design based on the Support Vector concept avoids the problem of peaking. This would suggest that feature selection was not needed. While independent experimental research e.g. [35, 36] supports Vapnik's theoretical predictions, it has been found that equally effective pattern recognition systems can be designed using conventional approaches. Such designs are likely to differ from and complement Support Vector Machines. They will help generate the necessary diversity of classifiers needed for multiple classifier systems which offer a path to improved performance.

Formulation of the Feature Selection Problem

Floating Search can be considered one of the standard feature selection algorithms due to its good overall properties - it yields results close to optimum yet it does not over-fit as much as more complex methods. Click green arrow for demonstration.

Following the statistical approach to pattern recognition, we assume that a pattern described by a real D-dimensional vector X = (x1,x2,...,xD)TX⊂RD is to be classified into one of a finite set of C different classes Ω = {ω12,...,ωC}. The patterns are supposed to occur randomly according to some true class conditional probability density functions (pdfs) p*(X|ω) and the respective a priori probabilities P*(ω). Since the class conditional pdfs and the a priori class probabilities are seldom specified in practice, it is necessary to estimate these functions from the training sets of samples with known classification.

The aim of dimensionality reduction is to find a set of d features, where d < D (if possible dD), so as to maximize an adopted criterion function. This goal requires:

While the above named requirements are treated rather well in pattern recognition theory, it is fair to admit that the problem of determining how many features one should select (cardinality d) for the problem at hand is not yet satisfactorily solved. Usually our aim is to achieve the lowest misclassification rate with selected features. Thus if the probability of error or misclassification rate are used as the feature subset evaluation criterion, the optimal d is determined immediately when minimizing the criterion. However, in practice we often have to resort to some other criteria, only indirectly related to probability of error, and the optimal d has to be determined experimentally according to the results on an independent test set.

The set of d features can either be formed by a subset of the original set of D measured pattern vector components found by feature selection or alternatively derived from them by a certain transformation. The latter process is referred to as feature extraction. In this paper we concentrate on feature selection only. For the practical differences between feature selection and feature extraction see e.g. [37]

Assuming that a suitable criterion function has been chosen to evaluate the effectiveness of feature subsets, feature selection is reduced to a search for the optimal feature subset based on the selected measure.

State of the Art

The framework of various approaches to feature selection as well as the battery of available feature selection criteria and search strategies has grown rapidly in last years. It is beyond the scope of this short summary to provide details. See the References section for suggestions for further reading.

Despite all the recent research effort the problem of feature selection can hardly be considered solved in entirety. To conclude let us summarize some of the pending problems: