Feature Selection Toolbox

News

9th September 2012FST 3.1.1 corrects an indexing bug in Sparse ARFF input filter. Now the filter is compliant with Sparse ARFF specification. Standard ARFF filter is now less sensitive to header formatting and accepts more ARFF files right away.
30th March 2011The most significant update planned for 2011 has been made public. See History for details on what is new in FST 3.1.
13th January 2011Added Exhaustive Search (standard + threaded versions) to enable optimal feature selection. Improved result tracking and LibSVM 3.0 support.
2nd November 2010Improved support for ARFF data format. All code polished to enable compilation under Visual C++ (in addition to Linux/Cygwin gcc).

Guestbook

chaoyin () [Sep 8th 2014]Hello，I am interest in the area of feature selection. Thanks in advance.
udit [Mar 6th 2014]i use it
JuBeOr [Dec 30th 2013]For me is impressive the amount of work the UTIA have made to help everyone solving problems all around the world. I want to express my gratitude from Spain.

Future

additional criteria + additional search schemes (Simulated Annealing...)
hierarchical sub-space access (to enable FS method chaining)
regression based wrappers
mixture models with embedded feature selection
...your suggestions ?

Feature Selection Toolbox 3 (FST3) is a standalone widely applicable C++ library for feature selection (FS, also known as attribute or variable selection), capable of reducing problem dimensionality to maximize the accuracy of data models, performance of automatic decision rules as well as to reduce data acquisition cost. The library can be exploited by users in research as well as in industry. Less experienced users can experiment with different provided methods and their application to reallife problems, experts can implement their own criteria or search schemes taking advantage of the toolbox framework.

FST3 key functionality:

(Threaded) highly effective subset search methods to tackle computational complexity
Wrappers, filters & hybrid methods, deterministic and/or randomized
Specialized methods for very-high-dimensional feature selection
Anti-overfitting measures: criteria ensembles, result regularization etc.
Stability evaluation

Reference:

PDF BIB Introduction to Feature Selection Toolbox 3, UTIA Tech. Report No. 2287, 2010

PDF FST 3.1 overview poster

FST3 (v3.1) functionality in more detail:

highly customizable, templated, threaded C++ code, using Boost library
feature selection criteria
- wrapper - classification accuracy estimation, see data access options below
  - normal Bayes classifier
  - k-Nearest Neighbor classifier (based on various L-distances)
  - Support Vectior Machine (optional, depends on external LibSVM library)
- filter - normal model based
  - Bhattacharyya distance
  - Divergence
  - Generalized Mahalanobis distance
- filter - multinomial model based - Bhattacharyya, Mutual Information
- criteria ensembles
- hybrids
feature selection methods
- individual ranking (BIF, best individual features)
- DAF, dependency-aware feature ranking
- sequential search (hill-climbing) - standard or generalized
  - (G)SFS/SBS, sequential selection (restricted/unrestricted)
  - (G)SFFS/SBFS, floating search (restricted/unrestricted)
  - (G)OS, oscillating search (deterministic, randomized, restricted/unrestricted)
  - (G)DOS, dynamic oscillating search (deterministic, randomized, restricted/unrestricted)
  - (G)SFRS/SBRS, retreating search (restricted/unrestricted)
  - in any of the above: threaded, sequential, hybrid or ensemble based feature preference evaluation
- Branch & Bound algorithms (optimal)
  - BBB, Basic Branch & Bound
  - IBB, Improved Branch & Bound
  - BBPP, Branch & Bound with Partial Prediction
  - FBB, Fast Branch & Bound
- exhaustive search (optimal)
- supporting techniques (freely combinable with methods above)
  - subset size optimization vs. subset size as user parameter
  - result regularization (preference of solutions with slightly lower criterion value to counter over-fitting)
  - feature acquisition cost minimization
  - feature selection process stability evaluation
  - two-process similarity evaluation (to determine impact of parameter change etc.)
  - classifier bias estimation
methods specifically suitable for very-high-dimensional feature selection
- individual ranking (BIF, best individual features)
- DAF, dependency-aware feature ranking
- OS, oscillating search (set to low oscillation depth)
flexible data processing
- nested multi-level sampling (splitting to training, valitation, test and possibly other data parts)
- sampling through extendable objects (includes re-substitution, cross-valiation, hold-out, leave-one-out, random sampling, etc.)
- normalization through extendable objects (interval shrinking, whitening)
- missing data substitution
- support for textual data formats TRN (see FST1) and ARFF
library is free for non-commercial use