Feature Selection ToolboxFST3 Library / Usage
JavaScript is disabled! Full functionality is provided with enabled JavaScript only.
Choose example      back
Basic sub-optimal scenarios - general purpose
Example 10: Basic Filter-based feature selection.
This is a simple example of filter-based feature selection, as described in [4]. Features are selected here using the Sequential Forward Selection (SFS) procedure so as to maximize the Generalized Mahalanobis probabilistic class distance based on the assumption of normality of the data. Generalized Mahalanobis is evaluated on the first 50% of data samples. The selected subset is eventually verified by means of 3-NN classifier accuracy estimation on the second 50% (independent test) part of the data. SFFS is called here in d-parametrized setting, invoked by nonzero parameter d in search(d,...). In this scenario the user has to decide about the target subset size.
Note:
With arbitrary data the assumption of normality may not be fulfilled what would negatively affect the feature seleciton results based on Mahalanobis, Bhattacharyya or Divergence.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; typedef FST::Criterion_Normal_GMahalanobis<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> FILTERCRIT; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,FILTERCRIT> EVALUATOR; typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; std::cout << "Starting Example 10: Basic Filter-based feature selection..." << std::endl; // in the course of search use the first half of data for feature selection and the second half for testing using 3-NN classifier PSPLITTER dsp(new SPLITTER5050()); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up the normal Generalized Mahalanobis criterion boost::shared_ptr<FILTERCRIT> crit(new FILTERCRIT); crit->initialize(da); // initialization = normal model parameter estimation on training data part // set-up the standard sequential search step object (options: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFS<RETURNTYPE,DIMTYPE,SUBSET,FILTERCRIT,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // try FST::BACKWARD // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *crit << std::endl << std::endl; RETURNTYPE critval_train, critval_test; const DIMTYPE d=7; // request subset of size d; if set to 0, cardinality will decided in the course of search if(!srch.search(d,critval_train,sub,crit,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) the following line is included here just for illustration because srch.search() reports results in itself std::cout << std::endl << "Search result: " << std::endl << *sub << std::endl << "Criterion value=" << critval_train << std::endl << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 11: Wrapper-based feature selection with Floating Search.
Floating Search (SFFS or SBFS) can be considered one of the best compromises between versatility, speed, ability to find close-to-optimum results, and robustness against overfitting. If in doubt which feature subset search method to use, and the dimensionality of your problem is not more than roughly several hundred, try SFFS. In this example features are selected using SFFS algorithm and 3-NN wrapper classification accuracy as FS criterion. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of 3-fold cross-validation. The final classification performance on the selected subspace is eventually validated on the second 50% of data. SFFS is called here in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size.
Note:
Floating Search (SFFS or SBFS) can be considered one of the best compromises between versatility, speed, ability to find close-to-optimum results, and robustness against overfitting. If in doubt which feature subset search method to use, and the dimensionality of your problem is not more than roughly several hundred, try SFFS.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; std::cout << "Starting Example 11: Wrapper-based feature selection with Floating Search..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // try FST::BACKWARD // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) the following line is included here just for illustration because srch.search() reports results in itself std::cout << std::endl << "Search result: " << std::endl << *sub << std::endl << "Criterion value=" << critval_train << std::endl << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // (optionally) list the best known solutions for each cardinality as recorded throughout the course of search std::cout << "Best recorded solution for subset size:" << std::endl; for(DIMTYPE d=1;d<=sub->get_n();d++) if(srch.get_result(d,critval_train,sub)) std::cout << d << ": val="<< critval_train << ", "<<*sub << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 11t: Threaded wrapper-based feature selection with Floating Search.
As alternative to standard sequential evaluation of feature subset candidates FST3 enables threaded candidate subset evaluation. All FST3 sequential search methods can be easily parallelized by using Sequential_Step_Straight_Threaded instead of Sequential_Step_Straight evaluator object. The actual search speed gain depends on particular problem setting. In small-sample, low-dimensional settings, or when criterion evaluation is very fast, the actual gain can remain negligible or even negative due to thread management overhead (note that this also considerably depends on compiler). This example is the threaded version of Example 11: Wrapper-based feature selection with Floating Search with the 15-dimensional (i.e., low-dimensional) small-sample speech dataset. It illustrates well the threading performance issue. When compiled under cygwin with gcc 4.3.x and Boost 1.33.1 it operates slower than the equivalent sequential code in Example 11: Wrapper-based feature selection with Floating Search. In contrary, when compiled under Visual C++ 2010 with Boost 1.44, it operates faster.
Note:
Threading is generally capable of substantially accelerating the search process when applied in computationally more demanding scenarios; see Example 12t: Threaded SVM-wrapper-based feature selection with Dynamic Oscillating Search or Example 25t: Fast pre-selection followed by Dynamic Oscillating Search or Example 33t: Threaded Oscillating Search in very high-dimensional feature selection or Example 40t: Threaded Exhaustive (optimal) feature selection, etc., for such cases.
The maximum permitted number of threads to run at once is to be user-specified with respect to hardware capabilities.
try{ const unsigned int max_threads=2; typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Straight_Threaded<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,max_threads> EVALUATOR; std::cout << std::endl << "Starting Example 11t: Threaded wrapper-based feature selection with Floating Search..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) the following line is included here just for illustration because srch.search() reports results in itself std::cout << std::endl << "Search result: " << std::endl << *sub << std::endl << "Criterion value=" << critval_train << std::endl << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // (optionally) list the best known solutions for each cardinality as recorded throughout the course of search std::cout << "Best recorded solution for subset size:" << std::endl; for(DIMTYPE d=1;d<=sub->get_n();d++) if(srch.get_result(d,critval_train,sub)) std::cout << d << ": val="<< critval_train << ", "<<*sub << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 12t: Threaded SVM-wrapper-based feature selection with Dynamic Oscillating Search.
Dynamic Oscillating Search is a d-optimizing procedure that adjusts selected subset size in the course of search. It is a generalization of the Oscillating Search idea, which proved to be useful in various feature selection contexts. Here we demonstrate it in multi-threaded configuration (using Sequential_Step_Straight_Threaded instead of Sequential_Step_Straight evaluator object). In this example due to the use of very complex feature selection criterion (SVM Wrapper) the speed gain due to multithreading is substantial. In this example features are selected on 40-dimensional waveform data with 3-fold cross-validated SVM wrapper as criterion on the first 50% of data samples. The final classification performance on the selected subspace is eventually validated on the second 50% of data.
Note:
The maximum permitted number of threads to run at once is to be user-specified with respect to hardware capabilities.
try{ const unsigned int max_threads=2; typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; typedef FST::Classifier_LIBSVM<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERSVM; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERSVM,DATAACCESSOR> WRAPPERSVM; typedef FST::Sequential_Step_Straight_Threaded<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,max_threads> EVALUATOR; std::cout << std::endl << "Starting Example 12t: Threaded SVM-wrapper-based feature selection with Dynamic Oscillating Search..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/waveform_40.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); // set-up SVM (interface to external library LibSVM) boost::shared_ptr<CLASSIFIERSVM> csvm(new CLASSIFIERSVM); csvm->set_kernel_type(RBF); // (option: LINEAR, POLY, SIGMOID) csvm->initialize(da); // wrap the SVM classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERSVM> wsvm(new WRAPPERSVM); wsvm->initialize(csvm,da); // set-up the threaded sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Dynamic Oscillating Search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,EVALUATOR> srch(eval); srch.set_delta(3); // first optimize SVM parameters using 3-fold cross-validation on training data on the full set of features sub->select_all(); csvm->optimize_parameters(da,sub); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wsvm << std::endl << std::endl; sub->deselect_all(); // let DOS start from an empty set (any starting subset is permissible) RETURNTYPE critval_train, critval_test; if(!srch.search(0,critval_train,sub,wsvm,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) the following line is included here just for illustration because srch.search() reports results in itself std::cout << std::endl << "Search result: " << std::endl << *sub << std::endl << "Criterion value=" << critval_train << std::endl << std::endl; // (optionally) validate result by estimating SVM accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); csvm->train(da,sub); csvm->test(critval_test,da); std::cout << "Validated SVM accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
More sub-optimal search scenarios
Example 20: Retreating Sequential Search.
Retreating Search (SFRS or SBRS) is closely related to SFFS, but more thorough and slower. It evaluates more candidate subsets due to more excessive backtracking. This may be advantageous especially in case of SFRS which thus evaluates more of lower-dimensional potential solutions. See Example 52t: (Threaded SFRS) Result regularization using secondary criterion for a result regularizing scenario for which SFRS is particularly suitable. In this example features are selected using SFRS algorithm and 3-NN wrapper classification accuracy as FS criterion. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of Leave-One-Out estimation. The final classification performance on the selected subspace is eventually validated on the second 50% of data. SFRS is primarily a d-optimizing procedure, thus it should be invoked with parameter 0 in search(0,...).
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_Leave1Out<INTERVALLER,IDXTYPE> SPLITTERL1O; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; std::cout << "Starting Example 20: Retreating Sequential Search..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERL1O); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFRS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // (optionally) list the best known solutions for each cardinality as recorded throughout the course of search std::cout << "Best recorded solution for subset size:" << std::endl; for(DIMTYPE d=1;d<=sub->get_n();d++) if(srch.get_result(d,critval_train,sub)) std::cout << d << ": val="<< critval_train << ", "<<*sub << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 21: Generalized sequential feature subset search.
All sequential search algorithms (SFS, SFFS, OS, DOS, SFRS) can be extended to operate in 'generalized' setting (term coined in [4]). In each step of a generalized sequential search algorithm not only one best feature is added to current subset nor one worst feature is removed from current subset; instead, g-tuples of features are considered. Searching for such group of g features that improves the current subset the most when added (or such that degrades the current subset the least when removed) is more computationally complex but increases the chance of finding the optimum or a result closer to optimum (nevertheless, improvement is not guaranteed and in some cases the result can actually degrade). The value g is to be set by user; the higher the value g, the slower the search (time complexity increases exponentially with increasing g). Note that setting g equal to the number of all features would effectively emulate the operation of exhaustive search. In this example features are selected using the generalized (G)SFFS algorithm (G=2) and 3-NN wrapper classification accuracy as FS criterion. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of 3-fold cross-validation. The final classification performance on the selected subspace is eventually validated on the second 50% of data. (G)SFFS is called here in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size.
Note:
Note that in this context the term generalization does not! relate to classification performance on independent data.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; std::cout << "Starting Example 21: Generalized sequential feature subset search..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(5); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // try FST::BACKWARD // set the size of feature groups to be evaluated for inclusion/removal in each sequential step (can be applied to SFS, SFFS, OS, DOS, SFRS) srch.set_generalization_level(2); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // (optionally) list the best known solutions for each cardinality as recorded throughout the course of search std::cout << "Best recorded solution for subset size:" << std::endl; for(DIMTYPE d=1;d<=sub->get_n();d++) if(srch.get_result(d,critval_train,sub)) std::cout << d << ": val="<< critval_train << ", "<<*sub << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 22: Randomized feature selection with Oscillating Search.
Selects features using randomized OS algorithm and normal Bhattacharyya distance as feature selection criterion. Target subset size must be set by user because  a) Bhattacharyya is monotonous (global optimization would yield full set of features) and  b) OS requires initial subset of target cardinality. OS is called here repeatedly from various random initial points as long as there comes no improvement during consecutive 5 runs. This mechanism considerably improves chances to avoid local maxima. Bhattacharyya is evaluated on randomly chosen 75% of data samples. The resulting feature subset is eventually validated by means of 3NN classification accuracy estimation on on the remaining 25% of data samples.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRANDRAND; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Criterion_Normal_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTCRIT; typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,BHATTCRIT> EVALUATOR; std::cout << "Starting Example 22: Randomized feature selection with Oscillating Search..." << std::endl; // use randomly chosen 75% of data for training and keep the other 25% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRANDRAND(1,75,25)); // (there will be one outer randomized split only) // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/sonar_60.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("75/25 data split failed."); // initiate the storage for subset to-be-selected (+ one more for storing temporaries) boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); boost::shared_ptr<SUBSET> submax(new SUBSET(da->getNoOfFeatures())); // set-up normal Bhattacharyya criterion boost::shared_ptr<BHATTCRIT> cb(new BHATTCRIT); cb->initialize(da); // initialization = normal model parameter estimation on training data on the current split // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Oscillating Search procedure with the extent of search specified by delta (admissible values 1..da->getNoOfFeatures()) FST::Search_OS<RETURNTYPE,DIMTYPE,SUBSET,BHATTCRIT,EVALUATOR> srch(eval); srch.set_delta(10); // target subset size must be set because Bhattacharyya is monotonous with respect to subset size (i.e., evaluates full set as the best) DIMTYPE target_subsize=30; // repeat random-initialized OS searches until there is no improvement in 5 consecutive runs std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *cb << std::endl << std::endl; RETURNTYPE critval_train, critval_trainmax; int non_improving_runs=-1; // -1 indicates first run const int max_non_improving_runs=5; do { sub->make_random_subset(target_subsize); if(!srch.search(target_subsize,critval_train,sub,cb,std::cout)) throw FST::fst_error("Search not finished."); if(non_improving_runs==-1 || critval_train>critval_trainmax) { critval_trainmax=critval_train; submax->stateless_copy(*sub); non_improving_runs=0; } else non_improving_runs++; } while(non_improving_runs<max_non_improving_runs-1); std::cout << std::endl << "Final search result: " << std::endl << *submax << std::endl << "Criterion value=" << critval_trainmax << std::endl << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data RETURNTYPE critval_test; boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); cknn->train(da,submax); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 23: Combined feature subset contents, size and SVM parameters optimization.
Support Vector Machine performance strongly depends on parameters. Moreover, optimal SVM parameters on a subspace of the original space may differ. Therefore we suggest to optimize both the feature subset and SVM parameters in a repeated consecutive process. In this example feature subset search is followed by SVM parameter optimization for the current subset; this sequece of two operations is repeated as long as the criterion value increases. For this purpose we use the DOS procedure which is capable of starting the search from the previously obtained subset. We illustrate this approach here on an SVM wrapper with sigmoid kernel. 50% of data is randomly chosen to form the training dataset (remains the same for all the time), 40% of data is randomly chosen to be used at the end for validating the classification performance on the finally selected subspace (training and test data parts are disjunct and altogether cover 90% of the original data). The training data part is accessed by means of 3-fold cross-validation in the course of search. In the course of search SVM parameters are optimized on the currently best known feature subset, which is then used to initialize next DOS search. The calls are repeated as long as better SVM performance (on the training data) is achieved.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRR; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Classifier_LIBSVM<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERSVM; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERSVM,DATAACCESSOR> WRAPPERSVM; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM> EVALUATOR; std::cout << "Starting Example 23: Combined feature subset contents, size and SVM parameters optimization..." << std::endl; // randomly sample 50% of data for training and randomly sample (disjunct) 40% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRR(1, 50, 40)); // (there will be one outer randomized split only) // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/sonar_60.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/40 random data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected + another one as temporary storage boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); boost::shared_ptr<SUBSET> sub_temp(new SUBSET(da->getNoOfFeatures())); // set-up SVM (interface to external library LibSVM) boost::shared_ptr<CLASSIFIERSVM> csvm(new CLASSIFIERSVM); csvm->set_kernel_type(RBF); // (option: LINEAR, RBF, POLY) csvm->initialize(da); // wrap the SVM classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERSVM> wsvm(new WRAPPERSVM); wsvm->initialize(csvm,da); // set-up the standard sequential search step object (option: hybrid, ensemble) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Dynamic Oscillating Search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,EVALUATOR> srch(eval); srch.set_delta(3); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wsvm << std::endl << std::endl; RETURNTYPE bestcritval_train, critval_train, critval_test; sub->select_all(); csvm->optimize_parameters(da,sub); double best_svm_param_C=csvm->get_parameter_C(); double best_svm_param_gamma=csvm->get_parameter_gamma(); double best_svm_param_coef0=csvm->get_parameter_coef0(); bool stop=false; sub->deselect_all(); if(!srch.search(0,bestcritval_train,sub,wsvm, std::cout)) throw FST::fst_error("Search not finished."); sub_temp->stateless_copy(*sub); while(!stop) { csvm->optimize_parameters(da,sub); if(best_svm_param_C!=csvm->get_parameter_C() || best_svm_param_gamma!=csvm->get_parameter_gamma() || best_svm_param_coef0!=csvm->get_parameter_coef0()) { if(!srch.search(0,critval_train,sub_temp,wsvm,std::cout)) throw FST::fst_error("Search not finished."); if(critval_train>bestcritval_train) { bestcritval_train=critval_train; sub->stateless_copy(*sub_temp); best_svm_param_C=csvm->get_parameter_C(); best_svm_param_gamma=csvm->get_parameter_gamma(); best_svm_param_coef0=csvm->get_parameter_coef0(); } else stop=true; } else stop=true; } std::cout << std::endl << "Search result: " << std::endl << *sub << std::endl << "Criterion value=" << bestcritval_train << std::endl << std::endl; // (optionally) validate result by estimating SVM accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); csvm->set_parameter_C(best_svm_param_C); csvm->set_parameter_gamma(best_svm_param_gamma); csvm->set_parameter_coef0(best_svm_param_coef0); csvm->train(da,sub); csvm->test(critval_test,da); std::cout << "Validated SVM accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 24: Monte Carlo - random feature subset search.
Pure random search does not give any guarantees as of the optimality of results. It is unlikely to yield better results than any of the sub-optimal or optimal methods implemented in FST, except in isolated cases. The advantage of random search is neverhtheless its quick initial 'convergence'; often it is capable of quickly revealing a solution that is only marginally worse than solutions that would take considerably longer to find using more advanced methods. In this example features are selected using Monte Carlo algorithm and 3-NN wrapper classification accuracy as FS criterion. The stopping condition is specified as a time limit in seconds. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of 3-fold cross-validation. The final classification performance on the selected subspace is eventually validated on the second 50% of data. SFFS is called in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size.
Note:
Random search requires a user specified stopping condition.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; std::cout << "Starting Example 24: Monte Carlo - random feature subset search..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up Sequential Forward Floating Selection search procedure FST::Search_Monte_Carlo<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> srch; srch.set_cardinality_randomization(4,10); // select only subsets within specified cardinality limits srch.set_stopping_condition(0/*max trials*/,100/*seconds*/); // one or both values must have positive value // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 24t: Threaded Monte Carlo - random feature subset search.
Pure random search does not give any guarantees as of the optimality of results. It is unlikely to yield better results than any of the sub-optimal or optimal methods implemented in FST, except in isolated cases. The advantage of random search is neverhtheless its quick initial 'convergence'; often it is capable of quickly revealing a solution that is only marginally worse than solutions that would take considerably longer to find using more advanced methods. In this example features are selected using Monte Carlo algorithm and 3-NN wrapper classification accuracy as FS criterion. The stopping condition is specified as a time limit in seconds. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of 3-fold cross-validation. The final classification performance on the selected subspace is eventually validated on the second 50% of data. SFFS is called in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size.
Note:
Random search requires a user specified stopping condition.
In addition to threaded execution this example differs from Example 24: Monte Carlo - random feature subset search also by using a different setting of random subset cardinality generator.
try{ const unsigned int max_threads=4; typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; std::cout << "Starting Example 24t: Threaded Monte Carlo - random feature subset search..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up Sequential Forward Floating Selection search procedure FST::Search_Monte_Carlo_Threaded<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,max_threads> srch; srch.set_cardinality_randomization(0.5); // probability of inclusion of each particular feature (~implies also the expected subset size) srch.set_stopping_condition(0/*max trials*/,100/*seconds*/); // one or both values must have positive value // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 25t: Fast pre-selection followed by Dynamic Oscillating Search.
Dynamic Oscillating Search is a d-optimizing procedure that adjusts selected subset size in the course of search. It is a generalization of the Oscillating Search idea, which proved to be useful in various feature selection contexts. If started from empty set, DOS may need considerable time to reach higher cardinality, or does not reach such cardinality at all. In case there is a reason to expect the optimal subset size be of certain minimum known value, it may be practical to 'accelerate' DOS by starting it from a subset of roughly the desired cardinality. Here we first build the initial subset of 20 features by selecting features according to best individual ranking. The initial subset is then used as starting point of the DOS search. Here we illustrate threaded feature selection on 40-dimensional waveform data with 3-fold cross-validated SVM wrapper as criterion.
Note:
With many higher-dimensional problems the FST3 threading capability can become key in making the feature selection task tractable. Note that maximum permitted number of threads to run at once is to be user-specified depending on hardware capabilities.
try{ const unsigned int max_threads=4; typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Classifier_LIBSVM<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERSVM; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERSVM,DATAACCESSOR> WRAPPERSVM; typedef FST::Sequential_Step_Straight_Threaded<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,max_threads> EVALUATOR; std::cout << std::endl << "Starting Example 25t: Fast pre-selection followed by Dynamic Oscillating Search..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/optdigits_64.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); // set-up SVM (interface to external library LibSVM) boost::shared_ptr<CLASSIFIERSVM> csvm(new CLASSIFIERSVM); csvm->set_kernel_type(SIGMOID); // (option: LINEAR, POLY, SIGMOID) csvm->initialize(da); // wrap the SVM classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERSVM> wsvm(new WRAPPERSVM); wsvm->initialize(csvm,da); // set-up Best Individual Features procedure FST::Search_BIF<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM> bif; // first optimize SVM parameters using 3-fold cross-validation on training data on the full set of features sub->select_all(); csvm->optimize_parameters(da,sub); // use BIF to pre-select 20 features according to their individual merit, to be later passed to DOS for adjustment const DIMTYPE d=20; std::cout << std::endl << "Individual feature ranking setup:" << std::endl << *da << std::endl << bif << std::endl << *wsvm << std::endl << std::endl; RETURNTYPE critval; bif.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) bif.evaluate_individuals(da->getNoOfFeatures(), wsvm, std::cout); DIMTYPE i=0, feature; sub->deselect_all(); bool found=bif.getFirstBIF(critval,feature); while(i++<d && found) { sub->select(feature); std::cout << "Pre-selected feature "<<feature<<", value=" << critval << std::endl; found=bif.getNextBIF(critval,feature); } std::cout << std::endl; // set-up the threaded sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Dynamic Oscillating Search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,EVALUATOR> srch(eval); srch.set_delta(3); // re-optimize SVM parameters using 3-fold cross-validation on training data on the selected feature subset csvm->optimize_parameters(da,sub); // run the search std::cout << std::endl << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wsvm << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wsvm,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating SVM accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); csvm->train(da,sub); csvm->test(critval_test,da); std::cout << "Validated SVM accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 26: Wrapper-based (normal Bayes) feature selection with Floating Search, verified by 2 classifiers.
Floating Search (SFFS or SBFS) can be considered one of the best compromises between versatility, speed, ability to find close-to-optimum results, and robustness against overfitting. If in doubt which feature subset search method to use, and the dimensionality of your problem is not more than roughly several hundred, try SFFS. In this example features are selected using SFFS algorithm and normal Bayes wrapper classification accuracy as FS criterion. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of 3-fold cross-validation. The final classification performance on the selected subspace is eventually validated on the second 50% of data using two different classifiers. SFFS is called in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size.
Note:
Normal Bayes classifier accuracy is unlikely to be comparable to that of kNN or SVM in cases when the data distribution is far from normal. This is unfortunately very often the case. Moreover, the necessity to learn normal model may make the normal Bayes unusable in cases of insufficient data (ill-conditioned covariance matrices + inversion problems, etc.)
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Classifier_Normal_Bayes<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERNB; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERNB,DATAACCESSOR> WRAPPERNB; typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERNB> EVALUATOR; std::cout << "Starting Example 26: Wrapper-based (normal Bayes) feature selection with Floating Search, verified by 2 classifiers..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERNB> cnb(new CLASSIFIERNB); cnb->initialize(da); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERNB> wnb(new WRAPPERNB); wnb->initialize(cnb,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERNB,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // try FST::BACKWARD // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wnb << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wnb,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating normal Bayes accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cnb->train(da,sub); cnb->test(critval_test,da); std::cout << "Validated normal Bayes accuracy=" << critval_test << std::endl << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // (optionally) list the best known solutions for each cardinality as recorded throughout the course of search std::cout << "Best recorded solution for subset size:" << std::endl; for(DIMTYPE d=1;d<=sub->get_n();d++) if(srch.get_result(d,critval_train,sub)) std::cout << d << ": val="<< critval_train << ", "<<*sub << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
(Very) high dimensional scenarios
Example 30: Feature selection on binary and/or natural-valued data.
The mushroom dataset from UCI repository is originally 22-dimensional, with categorical features. Here we use the transformed dataset with features expanded to 125 binary ones. Features are selected using the DOS algorithm (search extent restricted by Delta=10) with the criterion being Multinomial Naive-like Bayes wrapper classification accuracy. 50% of data is randomly chosen to form the training dataset (remains the same for all the time), 40% of data is randomly chosen to be used at the end for validating the classification performance on the finally selected subspace. (selected training and test data parts are disjunct and altogether cover 90% of the original data). Classification accuracy (i.e, FS wrapper criterion value) is estimated on the training part of data by means of 5-fold cross-validation.
Note:
This approach is applicable to problems of high dimensionality, but may prove too slow if the dimensionality exceeds thousands or tens of thousands due to the direct use of DOS, starting from zero cardinality. For approaches more suitable for very high dimensional problems see Example 31: Individual ranking (BIF) in very high-dimensional feature selection, Example 32t: Threaded individual ranking (BIF) with SVM wrapper in very high-dimensional feature selection, Example 33: Oscillating Search in very high-dimensional feature selection, Example 33t: Threaded Oscillating Search in very high-dimensional feature selection, Example 34: Dependency-Aware Feature Ranking (DAF0) and Example 35t: Dependency-Aware Feature Ranking (DAF1) to enable Wrapper based FS on very-high-dimensional data.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRR; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Classifier_Multinomial_NaiveBayes<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERMULTINOMIAL; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERMULTINOMIAL,DATAACCESSOR> WRAPPERMULTINOMIAL; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERMULTINOMIAL> EVALUATOR; std::cout << "Starting Example 30: Feature selection on binary and/or natural-valued data..." << std::endl; // randomly sample 50% of data for training and randomly sample (disjunct) 40% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRR(1, 50, 40)); // (there will be one outer randomized split only) // in the course of search use the first half of data by 5-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(5)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/mushroom_125.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/40 random data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("5-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up Naive Multinomial Bayes classifier boost::shared_ptr<CLASSIFIERMULTINOMIAL> cmultinom(new CLASSIFIERMULTINOMIAL); cmultinom->initialize(da); // wrap the 5-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERMULTINOMIAL> wmultinom(new WRAPPERMULTINOMIAL); wmultinom->initialize(cmultinom,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Dynamic Oscillating Search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERMULTINOMIAL,EVALUATOR> srch(eval); srch.set_delta(10); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wmultinom << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wmultinom,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating classifier accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cmultinom->train(da,sub); cmultinom->test(critval_test,da); std::cout << "Validated Multinomial NaiveBayes accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 31: Individual ranking (BIF) in very high-dimensional feature selection.
Very high-dimensional feature selection is applied, e.g., in text categorization, with dimensionality in the order of 10000 or 100000. Individual feature ranking (or Best Individual Feature, BIF) is the most commonly applied approach because of its key advantages -- speed and high stability. In this example we illustrate a less common but very effective approach based on the Multinomial Bhattacharyya distance feature selection criterion. Multinomial Bhattacharyya has been shown capable of overperforming traditional tools like Information Gain etc., cf. [25]. Randomly sampled 50% of data is used here for multinomial model parameter estimation to be used in the actual feature selection process, another (disjunct) 40% of data is randomly sampled for testing. The selected subset is eventually used for validation; multinomial Naive Bayes classifier is trained on the training data on the selected subset and classification accuracy is finally estimated on the test data.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRR; //typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Criterion_Multinomial_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTMULTINOMIALDIST; typedef FST::Classifier_Multinomial_NaiveBayes<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERMULTINOMIAL; std::cout << "Starting Example 31: Individual ranking (BIF) in very high-dimensional feature selection..." << std::endl; // randomly sample 50% of data for training and randomly sample (disjunct) 40% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRR(1, 50, 40)); // (there will be one outer randomized split only) // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/reuters_apte.arff",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/40 random data split failed."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); // set-up multinomial Bhattacharyya distance criterion boost::shared_ptr<BHATTMULTINOMIALDIST> dmultinom(new BHATTMULTINOMIALDIST); dmultinom->initialize(da); // (initialization = multinomial model parameter estimation on training data) // set-up individual feature ranking to serve as OS initialization FST::Search_BIF<RETURNTYPE,DIMTYPE,SUBSET,BHATTMULTINOMIALDIST> srch; // target subset size must be set because a) Bhattacharyya is monotonous with respect to subset size, // b) in very-high-dimensional problem d-optimizing search is not feasible due to search complexity DIMTYPE target_subsize=500; // run the search - first find the initial subset by means of BIF, then improve it by means of OS std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << srch << std::endl << *dmultinom << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(target_subsize,critval_train,sub,dmultinom,std::cout)) throw FST::fst_error("Search (BIF) not finished."); // (optionally) validate result by estimating Naive Multinomial Bayes classifier accuracy on selected feature sub-space on independent test data boost::shared_ptr<CLASSIFIERMULTINOMIAL> cmultinom(new CLASSIFIERMULTINOMIAL); cmultinom->initialize(da); cmultinom->train(da,sub); cmultinom->test(critval_test,da); std::cout << "Validated Multinomial NaiveBayes accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 32t: Threaded individual ranking (BIF) with SVM wrapper in very high-dimensional feature selection.
Very high-dimensional feature selection is applied, e.g., in text categorization, with dimensionality in the order of 10000 or 100000. Individual feature ranking (or Best Individual Feature, BIF) is the most commonly applied approach because of its key advantages -- speed and high stability. In this example we illustrate a less common but effective approach based on Support Vector Machine feature selection wrapper. We use randomly sampled 50% of data to be used in the actual feature selection process, another (disjunct) 40% of data is randomly sampled for testing. The selected subset is eventually used for validation - SVM classifier is trained on the training data on the selected subspace and classification accuracy is finally estimated on the test data.
Warning:
Threaded search on very-high-dimensional data based on a complex wrapper as presented here may need large amount of memory.
try{ const unsigned int max_threads=4; typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRR; //typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Classifier_LIBSVM<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERSVM; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERSVM,DATAACCESSOR> WRAPPERSVM; std::cout << "Starting Example 32t: Individual ranking (BIF) in very high-dimensional feature selection..." << std::endl; // randomly sample 50% of data for training and randomly sample (disjunct) 40% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRR(1, 50, 40)); // (there will be one outer randomized split only) // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/reuters_apte.arff",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/40 random data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); // set-up SVM (interface to external library LibSVM) boost::shared_ptr<CLASSIFIERSVM> csvm(new CLASSIFIERSVM); csvm->set_kernel_type(LINEAR); csvm->initialize(da); // wrap the SVM classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERSVM> wsvm(new WRAPPERSVM); wsvm->initialize(csvm,da); // set-up individual feature ranking to serve as OS initialization FST::Search_BIF_Threaded<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,max_threads> srch; // target subset size must be set because in very-high-dimensional problem d-optimizing search is not feasible due to search complexity DIMTYPE target_subsize=500; // run the search - first find the initial subset by means of BIF, then improve it by means of OS std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << srch << std::endl << *wsvm << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(target_subsize,critval_train,sub,wsvm,std::cout)) throw FST::fst_error("Search (BIF) not finished."); // (optionally) validate result by estimating classifier accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); csvm->train(da,sub); csvm->test(critval_test,da); std::cout << "Validated SVM accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 33: Oscillating Search in very high-dimensional feature selection.
Very high-dimensional feature selection in text categorization, with dimensionality in the order of 10000 or 100000. The standard approach is BIF, yet we show here that a non-trivial search procedure (OS) can be feasible. Here OS is applied in its fastest form (delta=1), initialized by means of BIF. We use Multinomial Bhattacharyya distance as the feature selection criterion (it has been shown capable of overperforming traditional tools like Information Gain etc., cf. [25]). Randomly sampled 50% of data is used for multinomial model parameter estimation to be used in the actual feature selection process, another (disjunct) 40% of data is randomly sampled for testing. The selected subset is eventually used for validation; multinomial Naive Bayes classifier is trained on the training data on the selected subset and classification accuracy is finally estimated on the test data.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRR; //typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Criterion_Multinomial_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTMULTINOMIALDIST; typedef FST::Classifier_Multinomial_NaiveBayes<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERMULTINOMIAL; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,BHATTMULTINOMIALDIST> EVALUATOR; std::cout << "Starting Example 33: Oscillating Search in very high-dimensional feature selection..." << std::endl; // randomly sample 50% of data for training and randomly sample (disjunct) 40% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRR(1, 50, 40)); // (there will be one outer randomized split only) // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/reuters_apte.arff",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/40 random data split failed."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); // set-up multinomial Bhattacharyya distance criterion boost::shared_ptr<BHATTMULTINOMIALDIST> dmultinom(new BHATTMULTINOMIALDIST); dmultinom->initialize(da); // (initialization = multinomial model parameter estimation on training data) // set-up individual feature ranking to serve as OS initialization FST::Search_BIF<RETURNTYPE,DIMTYPE,SUBSET,BHATTMULTINOMIALDIST> srch_bif; // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up the Oscillating Search procedure in its fastest setting FST::Search_OS<RETURNTYPE,DIMTYPE,SUBSET,BHATTMULTINOMIALDIST,EVALUATOR> srch(eval); srch.set_delta(1); // target subset size must be set because a) Bhattacharyya is monotonous with respect to subset size, // b) in very-high-dimensional problem d-optimizing search is not feasible due to search complexity DIMTYPE target_subsize=500; // run the search - first find the initial subset by means of BIF, then improve it by means of OS std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch_bif << std::endl << srch << std::endl << *dmultinom << std::endl << std::endl; RETURNTYPE critval_train, critval_test; if(!srch_bif.search(target_subsize,critval_train,sub,dmultinom,std::cout)) throw FST::fst_error("Search (BIF) not finished."); std::cout << std::endl << "Initialization result: " << std::endl << *sub << "Criterion value=" << critval_train << std::endl << std::endl; if(!srch.search(target_subsize,critval_train,sub,dmultinom,std::cout)) throw FST::fst_error("Search (OS) not finished."); std::cout << std::endl << "Search result: " << std::endl << *sub << "Criterion value=" << critval_train << std::endl; // (optionally) validate result by estimating Naive Multinomial Bayes classifier accuracy on selected feature sub-space on independent test data boost::shared_ptr<CLASSIFIERMULTINOMIAL> cmultinom(new CLASSIFIERMULTINOMIAL); cmultinom->initialize(da); cmultinom->train(da,sub); cmultinom->test(critval_test,da); std::cout << "Validated Multinomial NaiveBayes accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 33t: Threaded Oscillating Search in very high-dimensional feature selection.
As alternative to standard sequential evaluation of feature subset candidates FST3 enables threaded candidate subset evaluation. All FST3 sequential search methods can be easily parallelized by using Sequential_Step_Straight_Threaded instead of Sequential_Step_Straight evaluator object. The actual search speed gain depends on particular problem setting. In small-sample, low-dimensional settings, or when criterion evaluation is very fast, the actual gain may remain negligible or even negative due to thread management overhead (see Example 11: Wrapper-based feature selection with Floating Search). However, in computationally more complex cases as illustrated here on the threaded version of very-high-dimensional problem (see Example 33: Oscillating Search in very high-dimensional feature selection) the gain is becoming substantial.
Note:
With many higher-dimensional problems the FST3 threading capability can become key in making the feature selection task tractable. Note that maximum permitted number of threads to run at once is to be user-specified depending on hardware capabilities.
try{ const unsigned int max_threads=8; typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRR; //typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Criterion_Multinomial_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTMULTINOMIALDIST; typedef FST::Classifier_Multinomial_NaiveBayes<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERMULTINOMIAL; typedef FST::Sequential_Step_Straight_Threaded<RETURNTYPE,DIMTYPE,SUBSET,BHATTMULTINOMIALDIST,max_threads> EVALUATOR; std::cout << "Starting Example 33t: Threaded very-high-dimensional feature selection..." << std::endl; // randomly sample 50% of data for training and randomly sample (disjunct) 40% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRR(1, 50, 40)); // (there will be one outer randomized split only) // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/reuters_apte.arff",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("Random data split failed."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); // set-up multinomial Bhattacharyya distance criterion boost::shared_ptr<BHATTMULTINOMIALDIST> dmultinom(new BHATTMULTINOMIALDIST); dmultinom->initialize(da); // (initialization = multinomial model parameter estimation on training data) // set-up individual feature ranking to serve as OS initialization FST::Search_BIF<RETURNTYPE,DIMTYPE,SUBSET,BHATTMULTINOMIALDIST> srch_bif; // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up the Oscillating Search procedure in its fastest setting FST::Search_OS<RETURNTYPE,DIMTYPE,SUBSET,BHATTMULTINOMIALDIST,EVALUATOR> srch(eval); srch.set_delta(1); // target subset size must be set because a) Bhattacharyya is monotonous with respect to subset size, // b) in very-high-dimensional problem d-optimizing search is not feasible due to search complexity DIMTYPE target_subsize=500; // run the search - first find the initial subset by means of BIF, then improve it by means of OS std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch_bif << std::endl << srch << std::endl << *dmultinom << std::endl << std::endl; RETURNTYPE critval_train, critval_test; if(!srch_bif.search(target_subsize,critval_train,sub,dmultinom,std::cout)) throw FST::fst_error("Search (BIF) not finished."); if(!srch.search(target_subsize,critval_train,sub,dmultinom,std::cout)) throw FST::fst_error("Search (OS) not finished."); // (optionally) validate result by estimating Naive Multinomial Bayes classifier accuracy on selected feature sub-space on independent test data boost::shared_ptr<CLASSIFIERMULTINOMIAL> cmultinom(new CLASSIFIERMULTINOMIAL); cmultinom->initialize(da); cmultinom->train(da,sub); cmultinom->test(critval_test,da); std::cout << "Validated Multinomial NaiveBayes accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 34: Dependency-Aware Feature Ranking (DAF0).
Dependency-Aware Feature Ranking (DAF) is a novel approach to feature selection especially suitable for very-high-dimensional problems and over-fitting-prone feature selection scenarios. DAF evaluates a chosen criterion on a series of probe subsets to eventually rank features according to their estimated contextual quality. DAF has been shown capable of overperforming BIF quite significantly in many cases in terms of the quality of selected feature subsets, yet its stability and resistance against over-fitting remains on par with BIF. For details see [12]. We demonstrate two slightly different forms of DAF (DAF0 and DAF1) on examples Example 34: Dependency-Aware Feature Ranking (DAF0) and Example 35t: Dependency-Aware Feature Ranking (DAF1) to enable Wrapper based FS on very-high-dimensional data. Example34 illustrates the approach with k-NN accuracy wrapper criterion on 64-dimensional image data. Example 35t illustrates DAF with SVM wrapper applied to very-high-dimensional text categorization problem.
Note:
DAF (as BIF) provides feature ranking but does not determine final subset size.
This approach makes it possible to apply even the complex wrapper feature selection criteria in problems of very-high-dimensionality.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Result_Tracker_Feature_Stats<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKERSTATS; std::cout << "Starting Example 34: Dependency-Aware Feature Ranking (DAF0)..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/optdigits_64.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // Dependency-Aware Feature ranking computation settings const unsigned long max_search_time=600; // in seconds (the more search time can be afforded the better; for very-high-dimensional problems set better to hours or more) const DIMTYPE min_probe_cardinality=5; // lower limit on random probe subset cardinality (the default value is generally applicable) const DIMTYPE max_probe_cardinality=100; // upper limit on random probe subset cardinality (the default value is generally applicable) // set-up the probe subset generation procedure FST::Search_Monte_Carlo<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> srch; srch.set_cardinality_randomization(min_probe_cardinality,max_probe_cardinality); srch.set_stopping_condition(0/*max trials*/,max_search_time/*seconds*/); // one or both values must have positive value // set-up tracker to gather data for eventual DAF rank computation boost::shared_ptr<TRACKERSTATS> trackerstats(new TRACKERSTATS); srch.enable_result_tracking(trackerstats); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // compute DAF ranking trackerstats->compute_stats(); // (optionally) print DAF computation statistics trackerstats->print_stats(std::cout); // select user-specified number of features according to highest DAF feature rank values // + validate result by estimating classifier accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); const DIMTYPE d=60; unsigned int DAF=0; // DAF0 is the simplest and generally best performing option; DAF1 as a normalized version of DAF0 may occasionally yield better results RETURNTYPE critval; DIMTYPE i=0, feature; sub->deselect_all(); bool found=trackerstats->getFirstDAF(critval,feature,DAF); while(i++<d && found) { sub->select(feature); std::cout << "Added feature "<<feature<<", DAF"<<DAF<<"=" << critval << std::endl; if(i%5==0) { // (optionally) validate result by estimating classifier accuracy on selected feature sub-space on independent test data cknn->train(da,sub); cknn->test(critval_test,da); std::cout << *sub << std::endl << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } found=trackerstats->getNextDAF(critval,feature,DAF); } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 35t: Dependency-Aware Feature Ranking (DAF1) to enable Wrapper based FS on very-high-dimensional data.
Dependency-Aware Feature Ranking (DAF) is a novel approach to feature selection especially suitable for very-high-dimensional problems and over-fitting-prone feature selection scenarios. DAF evaluates a chosen criterion on a series of probe subsets to eventually rank features according to their estimated contextual quality. Note that this approach makes it possible to apply even the complex Wrapper feature selection criteria in problems of very-high-dimensionality. DAF has been shown capable of overperforming BIF quite significantly in many cases in terms of the quality of selected feature subsets, yet its stability and resistance against over-fitting remains on par with BIF. For details see [12]. We demonstrate two slightly different forms of DAF (DAF0 and DAF1) on examples Example 34: Dependency-Aware Feature Ranking (DAF0) and Example 35t: Dependency-Aware Feature Ranking (DAF1) to enable Wrapper based FS on very-high-dimensional data. Example34 illustrates the approach with k-NN accuracy wrapper criterion. This example 35t illustrates DAF with SVM wrapper applied to very-high-dimensional (greater than 10000-dimensional) text categorization problem.
Note:
DAF (as BIF) ranks features but does not determine final subset size.
To achieve reasonable results in case of extreme dimensionality like here DAF requires at least hours of computation. (Standard wrapper based methods would need several orders more time in similar setting.) It is beneficial to allow for as many probes as possible. For instance, setting max_search_time to 20 hours instead of 200 minutes as seen below improves the final accuracy on independent test data roughly by 3%.
Warning:
This example needs large RAM memory (4GB may not be enough).
try{ const unsigned int max_threads=16; typedef double RETURNTYPE; typedef float DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRANDRAND; //typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Classifier_LIBSVM<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERSVM; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERSVM,DATAACCESSOR> WRAPPERSVM; typedef FST::Result_Tracker_Feature_Stats<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKERSTATS; std::cout << "Starting Example 35t: Dependency-Aware Feature Ranking (DAF1) enabling Wrapper based feature selectio on very-high-dimensional data..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRANDRAND(1/*trials*/,70,30)); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/reuters_apte.arff",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("70/30 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up SVM (interface to external library LibSVM) boost::shared_ptr<CLASSIFIERSVM> csvm(new CLASSIFIERSVM); csvm->initialize(da); csvm->set_kernel_type(LINEAR); // first optimize SVM parameters using 3-fold cross-validation on training data on the full set of features sub->select_all(); csvm->optimize_parameters(da,sub); // wrap the SVM classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERSVM> wsvm(new WRAPPERSVM); wsvm->initialize(csvm,da); // Dependency-Aware Feature ranking computation settings const unsigned long max_search_time=200*60; // in seconds (the more search time can be afforded the better) const DIMTYPE min_probe_cardinality=1; // lower limit on random probe subset cardinality (the default value of 1 is generally applicable) const DIMTYPE max_probe_cardinality=200; // upper limit on random probe subset cardinality (the default value of 100 is generally applicable) // set-up Sequential Forward Floating Selection search procedure FST::Search_Monte_Carlo_Threaded<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,max_threads> srch; srch.set_cardinality_randomization(min_probe_cardinality,max_probe_cardinality); srch.set_stopping_condition(0/*max trials*/,max_search_time/*seconds*/,1/*time check frequency*/); // one or both values must have positive value // set-up tracker to gather data for eventual DAF rank computation boost::shared_ptr<TRACKERSTATS> trackerstats(new TRACKERSTATS); srch.enable_result_tracking(trackerstats); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wsvm << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wsvm,std::cout)) throw FST::fst_error("Search not finished."); // compute DAF0 ranking trackerstats->compute_stats(); // (optionally) print DAF computation statistics trackerstats->print_stats(std::cout); // select user-specified number of features according to highest DAF feature rank values // + validate result by estimating classifier accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); const DIMTYPE d=1000; unsigned int DAF=1; // DAF0 is the simplest and generally best performing option; DAF1 as a normalized version of DAF0 may occasionally yield better results RETURNTYPE critval; DIMTYPE i=0, feature; sub->deselect_all(); bool found=trackerstats->getFirstDAF(critval,feature,DAF); while(i++<d && found) { sub->select(feature); std::cout << "Added feature "<<feature<<", DAF"<<DAF<<"=" << critval << std::endl; if(i%50==0) { // (optionally) validate result by estimating classifier accuracy on selected feature sub-space on independent test data csvm->train(da,sub); csvm->test(critval_test,da); std::cout << *sub << std::endl << "Validated SVM accuracy=" << critval_test << std::endl << std::endl; } found=trackerstats->getNextDAF(critval,feature,DAF); } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Optimal search (mind the search complexity)
Example 40: Exhaustive (optimal) feature selection.
Selects features exhaustively, i.e., evaluates all possible feature combinations. This approach is guaranteed to find optimum with respect to the chosen criterion, but its exponential time complexity renders it prohibitive for even moderately dimensioned tasks. Here it is demonstrated on 15-dimensional data with 3-NN (based on L1.5 distance) wrapper classification accuracy as FS criterion - note how time consuming the computation is even for relatively low-dimensional case. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of 3-fold cross-validation. The final classification performance on the selected subspace is eventually validated on the second 50% of data. Exhaustive search is called here in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size. Optional result tracking is employed here to reveal duplicate solutions yielding the same maximum criterion value (see also Example 60: Detecting alternative feature selection results with equal criterion value).
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Lp<DATATYPE,REALTYPE,DIMTYPE,SUBSET,3,2> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Result_Tracker_Dupless<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKER; std::cout << "Starting Example 40: Exhaustive (optimal) feature selection..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up Exhaustive Search procedure FST::Search_Exhaustive<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> srch; // set-up result tracker to enable logging of candidate solutions, ordered descending by value // (optionally limit the number of kept records to 50000 highest valued to prevent memory exhaustion due to possibly excessive number of candidates) boost::shared_ptr<TRACKER> tracker(new TRACKER(50000)); // let the tracker register only solution no worse than "the best known criterion value minus 0.05" tracker->set_inclusion_margin(0.05); // register the result tracker with the used search object srch.enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // report tracker contents std::cout << "Result tracker records " << tracker->size(0.0) << " solutions with criterion value equal to " << critval_train << "." << std::endl << std::endl; for(unsigned int i=1;i<10;i++) std::cout << "Result tracker records " << tracker->size((double)i*0.005) << " solutions with criterion value greater or equal to " << critval_train-(double)i*0.005 << "." << std::endl << std::endl; TRACKER::PResultRec result; if(tracker->get_first(result) && tracker->size(0.0)>1) { RETURNTYPE firstvalue=result->value; std::cout << "All recorded feature subsets yielding the same best known criterion value " << firstvalue << ":" << std::endl; while(tracker->get_next(result) && result->value==firstvalue) std::cout << *(result->sub) << std::endl; } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 40t: Threaded Exhaustive (optimal) feature selection.
Exhaustive search yields optimal solutions but due to exponential search complexity tends to slow down dramatically with increasing dimensionality. Threaded implementation is thus desirable to improve exhaustive search applicability. The actual speed gain can be compromised on some platforms in small-sample, low-dimensional setting where criterion evaluation is extremely fast and thread management overhead thus comes more into play. But this is generally no issue because the application of threaded exhaustive search is of most importance under the opposite setting - with computationally demanding problems. The threaded exhaustive search implementation is demonstrated here on 15-dimensional data with 3-NN (based on L1.5 distance) wrapper classification accuracy as FS criterion - note how time consuming the computation is even for such relatively low-dimensional case. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of 3-fold cross-validation. The final classification performance on the selected subspace is eventually validated on the second 50% of data. Exhaustive search is called here in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size. Optional result tracking is employed here to reveal duplicate solutions yielding the same maximum criterion value (see also Example 60: Detecting alternative feature selection results with equal criterion value).
Note:
The maximum permitted number of threads to run at once is to be user-specified depending on hardware.
try{ const unsigned int max_threads=16; typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Lp<DATATYPE,REALTYPE,DIMTYPE,SUBSET,3,2> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Result_Tracker_Dupless<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKER; std::cout << "Starting Example 40t: Threaded Exhaustive (optimal) feature selection..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up Exhaustive Search procedure (+ set the number of worker threads) FST::Search_Exhaustive_Threaded<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,max_threads> srch; // set-up result tracker to enable logging of candidate solutions, ordered descending by value // (optionally limit the number of kept records to 50000 highest valued to prevent memory exhaustion due to possibly excessive number of candidates) boost::shared_ptr<TRACKER> tracker(new TRACKER(50000)); // let the tracker register only solution no worse than "the best known criterion value minus 0.05" tracker->set_inclusion_margin(0.05); // register the result tracker with the used search object srch.enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // report tracker contents std::cout << "Result tracker records " << tracker->size(0.0) << " solutions with criterion value equal to " << critval_train << "." << std::endl << std::endl; for(unsigned int i=1;i<10;i++) std::cout << "Result tracker records " << tracker->size((double)i*0.005) << " solutions with criterion value greater or equal to " << critval_train-(double)i*0.005 << "." << std::endl << std::endl; TRACKER::PResultRec result; if(tracker->get_first(result) && tracker->size(0.0)>1) { RETURNTYPE firstvalue=result->value; std::cout << "All recorded feature subsets yielding the same best known criterion value " << firstvalue << ":" << std::endl; while(tracker->get_next(result) && result->value==firstvalue) std::cout << *(result->sub) << std::endl; } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 41: Improved Branch and Bound (IBB) optimal feature selection.
Branch & Bound algorithms yield optimal result in shorter time than exhaustive search, but can not be used with non-monotonic criteria. All Branch & Bound feature selection algorithms require the used Criterion to be monotonic with respect to cardinality. More precisely, it must hold that removing a feature from a set MUST NOT increase criterion value. Otherwise there is no guarantee as of the optimality of obtained results with respect to the used criterion. IBB is the well-known B&B variant that orders tree nodes in each constructed tree level so as to increase the chance that the bound gets higher soon. This effectively improves chances that larger sub-trees can be cut and thus more time saved, but this is achieved at the cost of additional criterion evaluations in each tree level. In time consuming tasks this mechanism proves capable of considerably overperforming the Basic Branch & Bound algorithm.
Note:
B&B are d-parametrized procedures.
Due to the necessary monotonicity condition the B&B algorithms can not be used with Criterion_Wrapper criteria.
Although B&B algorithms are much faster than exhaustive search, they are exponential in nature. As such they are generally unusable with problems of dimensionality (roughly) 50 or above.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Lp<DATATYPE,REALTYPE,DIMTYPE,SUBSET,3,2> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Criterion_Normal_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTCRIT; typedef FST::Result_Tracker_Dupless<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKER; std::cout << "Starting Example 41: Improved Branch and Bound (IBB) optimal feature selection..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/wdbc_30.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up normal Bhattacharyya criterion boost::shared_ptr<BHATTCRIT> cb(new BHATTCRIT); cb->initialize(da); // initialization = normal model parameter estimation on training data on the current split // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up Branch and Bound procedure FST::Search_Branch_And_Bound_Improved<RETURNTYPE,DIMTYPE,SUBSET,BHATTCRIT> srch; // set-up result tracker to enable logging of candidate solutions, ordered descending by value // (optionally limit the number of kept records to 50000 highest valued to prevent memory exhaustion due to possibly excessive number of candidates) boost::shared_ptr<TRACKER> tracker(new TRACKER(50000)); // let the tracker register only solution no worse than "the best known criterion value minus 0.05" tracker->set_inclusion_margin(0.05); // register the result tracker with the used search object srch.enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) const DIMTYPE target_d=10; // target subset size to be specified by the user if(!srch.search(target_d,critval_train,sub,cb,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // report tracker contents std::cout << "Result tracker records " << tracker->size(0.0) << " solutions with criterion value equal to " << critval_train << "." << std::endl << std::endl; for(unsigned int i=1;i<5;i++) std::cout << "Result tracker records " << tracker->size((double)i*0.01) << " solutions with criterion value greater or equal to " << critval_train-(double)i*0.005 << "." << std::endl << std::endl; TRACKER::PResultRec result; if(tracker->get_first(result) && tracker->size(0.0)>1) { RETURNTYPE firstvalue=result->value; std::cout << "All recorded feature subsets yielding the same best known criterion value " << firstvalue << ":" << std::endl; while(tracker->get_next(result) && result->value==firstvalue) std::cout << *(result->sub) << std::endl; } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 42: Branch and Bound with Partial Prediction (BBPP) optimal feature selection.
Branch & Bound algorithms yield optimal result in shorter time than exhaustive search, but can not be used with non-monotonic criteria. All Branch & Bound feature selection algorithms require the used Criterion to be monotonic with respect to cardinality. More precisely, it must hold that removing a feature from a set MUST NOT increase criterion value. Otherwise there is no guarantee as of the optimality of obtained results with respect to the used criterion. BBPP is generally slower than FBB but faster than IBB. Its principle is identical to that of Improved Branch & Bound (IBB), but it replaces large amount of criterion evaluations needed for ordering nodes in tree levels by quickly predicted values. The optimality of final result is not jeopardized but considerable amount of time is saved. See FBB for a more radical but still optimality preserving use of value prediction.
Note:
B&B are d-parametrized procedures.
Due to the necessary monotonicity condition the B&B algorithms can not be used with Criterion_Wrapper criteria.
Although B&B algorithms are much faster than exhaustive search, they are exponential in nature. As such they are generally unusable with problems of dimensionality (roughly) 50 or above.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Lp<DATATYPE,REALTYPE,DIMTYPE,SUBSET,3,2> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Criterion_Normal_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTCRIT; typedef FST::Result_Tracker_Dupless<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKER; typedef FST::Branch_And_Bound_Predictor_Averaging<RETURNTYPE,DIMTYPE> PREDICTOR; std::cout << "Starting Example 42: Branch and Bound with Partial Prediction (BBPP) optimal feature selection..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/wdbc_30.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up normal Bhattacharyya criterion boost::shared_ptr<BHATTCRIT> cb(new BHATTCRIT); cb->initialize(da); // initialization = normal model parameter estimation on training data on the current split // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up Branch and Bound procedure FST::Search_Branch_And_Bound_Partial_Prediction<RETURNTYPE,DIMTYPE,SUBSET,BHATTCRIT,PREDICTOR> srch; // set-up result tracker to enable logging of candidate solutions, ordered descending by value // (optionally limit the number of kept records to 50000 highest valued to prevent memory exhaustion due to possibly excessive number of candidates) boost::shared_ptr<TRACKER> tracker(new TRACKER(50000)); // let the tracker register only solution no worse than "the best known criterion value minus 0.05" tracker->set_inclusion_margin(0.05); // register the result tracker with the used search object srch.enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) const DIMTYPE target_d=10; // target subset size to be specified by the user if(!srch.search(target_d,critval_train,sub,cb,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // report tracker contents std::cout << "Result tracker records " << tracker->size(0.0) << " solutions with criterion value equal to " << critval_train << "." << std::endl << std::endl; for(unsigned int i=1;i<5;i++) std::cout << "Result tracker records " << tracker->size((double)i*0.01) << " solutions with criterion value greater or equal to " << critval_train-(double)i*0.005 << "." << std::endl << std::endl; TRACKER::PResultRec result; if(tracker->get_first(result) && tracker->size(0.0)>1) { RETURNTYPE firstvalue=result->value; std::cout << "All recorded feature subsets yielding the same best known criterion value " << firstvalue << ":" << std::endl; while(tracker->get_next(result) && result->value==firstvalue) std::cout << *(result->sub) << std::endl; } // print out BBPP predictor statistics std::cout<<srch<<std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 43: Fast Branch and Bound (FBB) optimal feature selection.
Branch & Bound algorithms yield optimal result in shorter time than exhaustive search, but can not be used with non-monotonic criteria. All Branch & Bound feature selection algorithms require the used Criterion to be monotonic with respect to cardinality. More precisely, it must hold that removing a feature from a set MUST NOT increase criterion value. Otherwise there is no guarantee as of the optimality of obtained results with respect to the used criterion. FBB is in most feature selection tasks the fastest of all Branch & Bound algorithms and as such should be the method of first choice whenever optimal feature selection is required and possible (see the warning below). Nevertheless, the FBB's prediction mechanism can theoretically fail and slow the search down (an analogy is perhaps the Quick Sort which is known as the best sorting algorithm for the general case but no guarantee is given about its actual speed). If you prefer more conservative option, try BBPP or the even slower but more predictable IBB or BBB.
Note:
B&B are d-parametrized procedures.
Due to the necessary monotonicity condition the B&B algorithms can not be used with Criterion_Wrapper criteria.
Although B&B algorithms are much faster than exhaustive search, they are exponential in nature. As such they are generally unusable with problems of dimensionality (roughly) 50 or above.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Lp<DATATYPE,REALTYPE,DIMTYPE,SUBSET,3,2> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Criterion_Normal_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTCRIT; typedef FST::Result_Tracker_Dupless<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKER; typedef FST::Branch_And_Bound_Predictor_Averaging<RETURNTYPE,DIMTYPE> PREDICTOR; std::cout << "Starting Example 43: Fast Branch and Bound (FBB) optimal feature selection..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/wdbc_30.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up normal Bhattacharyya criterion boost::shared_ptr<BHATTCRIT> cb(new BHATTCRIT); cb->initialize(da); // initialization = normal model parameter estimation on training data on the current split // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up Branch and Bound procedure FST::Search_Branch_And_Bound_Fast<RETURNTYPE,DIMTYPE,SUBSET,BHATTCRIT,PREDICTOR> srch; // set-up result tracker to enable logging of candidate solutions, ordered descending by value // (optionally limit the number of kept records to 50000 highest valued to prevent memory exhaustion due to possibly excessive number of candidates) boost::shared_ptr<TRACKER> tracker(new TRACKER(50000)); // let the tracker register only solution no worse than "the best known criterion value minus 0.05" tracker->set_inclusion_margin(0.05); // register the result tracker with the used search object srch.enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) const DIMTYPE target_d=10; // target subset size to be specified by the user if(!srch.search(target_d,critval_train,sub,cb,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // report tracker contents std::cout << "Result tracker records " << tracker->size(0.0) << " solutions with criterion value equal to " << critval_train << "." << std::endl << std::endl; for(unsigned int i=1;i<5;i++) std::cout << "Result tracker records " << tracker->size((double)i*0.01) << " solutions with criterion value greater or equal to " << critval_train-(double)i*0.005 << "." << std::endl << std::endl; TRACKER::PResultRec result; if(tracker->get_first(result) && tracker->size(0.0)>1) { RETURNTYPE firstvalue=result->value; std::cout << "All recorded feature subsets yielding the same best known criterion value " << firstvalue << ":" << std::endl; while(tracker->get_next(result) && result->value==firstvalue) std::cout << *(result->sub) << std::endl; } // print out FBB predictor statistics std::cout<<srch<<std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Tools to prevent or detect over-fitting
Example 50: Voting ensemble of criteria.
In this example features are selected by means of voting of three criteria: 1-NN, 5-NN and 9-NN classifier wrappers (estimated classification accuracy). The various criteria vote about feature candidates in each sequential search algorithm step. To enable voting, each criterion builds a list of feature candidates, ordered descending according to criterion value. Feature candidate positions in all lists are then joined (averaged) and the candidate with best average position (i.e., most universal preference) is selected for addition/removal to/from the current working subset. The three k's in k-NN should cover feature preferences with various (over)sensitivity to detail, resulting possibly in more robust subset (less coupled with particular classifier). The example setup requires one primary criterion to be used in standard FST3 way to guide sequential search algorithm steps; here 3-NN wrapper is chosen for this purpose. First 50% of data is used for training, second 50% is chosen to be used at the end for validating the classification performance on the finally selected subspace. Classification accuracy (wrapper value) is evaluated on the training data by means of 3-fold cross-validation. SFS is called in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef boost::shared_ptr<FST::Criterion<RETURNTYPE,SUBSET> > PCRITERION; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Ensemble<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> ENSEMBLEEVALUATOR; std::cout << "Starting Example 50: Voting ensemble of criteria..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/spambase_57.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances as the global algorithm step guiding criterion boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up 1,5,9-Nearest Neighbor classifiers based on Euclidean distance to form // a voting ensemble of wrappers to make decisions about single feature preferences boost::shared_ptr<CLASSIFIERKNN> cknn1(new CLASSIFIERKNN); cknn1->set_k(1); boost::shared_ptr<CLASSIFIERKNN> cknn2(new CLASSIFIERKNN); cknn2->set_k(5); boost::shared_ptr<CLASSIFIERKNN> cknn3(new CLASSIFIERKNN); cknn3->set_k(9); boost::shared_ptr<WRAPPERKNN> wknn1(new WRAPPERKNN); wknn1->initialize(cknn1,da); boost::shared_ptr<WRAPPERKNN> wknn2(new WRAPPERKNN); wknn2->initialize(cknn2,da); boost::shared_ptr<WRAPPERKNN> wknn3(new WRAPPERKNN); wknn3->initialize(cknn3,da); boost::shared_ptr<std::vector<PCRITERION> > ensemble(new std::vector<PCRITERION>); ensemble->push_back(wknn1); ensemble->push_back(wknn2); ensemble->push_back(wknn3); // set-up the standard sequential search step object (option: hybrid, straight, etc.) boost::shared_ptr<ENSEMBLEEVALUATOR> eval(new ENSEMBLEEVALUATOR(ensemble)); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,ENSEMBLEEVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating classifier accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"NN accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 51: (DOS) Result regularization using secondary criterion.
It is known that feature selection may over-fit. As in the case of over-trained classifiers, over-selected feature subsets may generalize poorly. This unwanted effect can lead to serious degradation of generalization ability, i.e., model or decision-rule behavior on previously unknown data. It has been suggested [9, 19] that preferring a subset with slightly-worse-than-maximal criterion value can actually improve generalization. FST3 makes this possible through result tracking and subsequent selection of alternative solution by means of secondary criterion maximization. In this example we show a 3-Nearest Neighbor Wrapper based feature selection process, where the final result is eventually chosen among a group of solutions close enough to the achieved maximum, so as to optimize the secondary criterion. The group of solutions to select from is defined by means of a user-selected margin value (permitted primary criterion value difference from the known maximum). In this case we show that even the simplest secondary criterion (mere preference of smaller subsets) can improve classifcation accuracy on previously unknown data.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Criterion_Subset_Size<RETURNTYPE,SUBSET> CRITSUBSIZE; typedef FST::Criterion_Negative<CRITSUBSIZE,RETURNTYPE,SUBSET> NEGATIVECRIT; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; typedef FST::Result_Tracker_Regularizer<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,NEGATIVECRIT> TRACKER; std::cout << "Starting Example 51: (DOS) Result regularization using secondary criterion..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/waveform_40.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Dynamic Oscillating Search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_delta(3); // set-up the regularizing result tracker boost::shared_ptr<TRACKER> tracker(new TRACKER); // register the result tracker with the used search step object eval->enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << *tracker << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // set-up the secondary criterion (regularization criterion); in this case to minimize subset size boost::shared_ptr<CRITSUBSIZE> critsubsiz(new CRITSUBSIZE); //Criterion_Subset_Size does not need to be initialized boost::shared_ptr<NEGATIVECRIT> regulcrit(new NEGATIVECRIT(critsubsiz)); //Criterion_Negative does not need to be initialized // select final solution among those recorded by tracker (show more alternatives for various margins) tracker->set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) for(unsigned int i=1; i<10; i++) { RETURNTYPE margin=(double)i*0.001; da->setSplittingDepth(1); // necessary with criteria than need access to training data if(!tracker->optimize_within_margin(margin,critval_train,critval_test,sub,regulcrit)) throw FST::fst_error("tracker->optimize_within_margin() failed."); std::cout << std::endl << "Regularized (margin="<<margin<<") result: " << std::endl << *sub << "Criterion value=" << critval_train << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 52t: (Threaded SFRS) Result regularization using secondary criterion.
It is known that feature selection may over-fit. As in the case of over-trained classifiers, over-selected feature subsets may generalize poorly. This unwanted effect can lead to serious degradation of generalization ability, i.e., model or decision-rule behavior on previously unknown data. It has been suggested [9, 19] that preferring a subset with slightly-worse-than-maximal criterion value can actually improve generalization. FST3 makes this possible through result tracking and subsequent selection of alternative solution by means of secondary criterion maximization. In this example we show a 3-Nearest Neighbor Wrapper based feature selection process, where the final result is eventually chosen among a group of solutions close enough to the achieved maximum, so as to optimize the secondary criterion. The group of solutions to select from is defined by means of a user-selected margin value (permitted primary criterion value difference from the known maximum). In this case we show that even the simplest secondary criterion (mere preference of smaller subsets) can improve classifcation accuracy on previously unknown data.
try{ const unsigned int max_threads=2; typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Criterion_Subset_Size<RETURNTYPE,SUBSET> CRITSUBSIZE; typedef FST::Criterion_Negative<CRITSUBSIZE,RETURNTYPE,SUBSET> NEGATIVECRIT; typedef FST::Sequential_Step_Straight_Threaded<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,max_threads> EVALUATOR; typedef FST::Result_Tracker_Regularizer<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,NEGATIVECRIT> TRACKER; std::cout << "Starting Example 52t: (Threaded SFRS) Result regularization using secondary criterion..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/waveform_40.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Dynamic Oscillating Search procedure FST::Search_SFRS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // set-up the regularizing result tracker boost::shared_ptr<TRACKER> tracker(new TRACKER); // register the result tracker with the used search step object eval->enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << *tracker << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // set-up the secondary criterion (regularization criterion); in this case to minimize subset size boost::shared_ptr<CRITSUBSIZE> critsubsiz(new CRITSUBSIZE); //Criterion_Subset_Size does not need to be initialized boost::shared_ptr<NEGATIVECRIT> regulcrit(new NEGATIVECRIT(critsubsiz)); //Criterion_Negative does not need to be initialized // select final solution among those recorded by tracker (show more alternatives for various margins) tracker->set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) for(unsigned int i=1; i<10; i++) { RETURNTYPE margin=(double)i*0.001; da->setSplittingDepth(1); // necessary with criteria than need access to training data if(!tracker->optimize_within_margin(margin,critval_train,critval_test,sub,regulcrit)) throw FST::fst_error("tracker->optimize_within_margin() failed."); std::cout << std::endl << "Regularized (margin="<<margin<<") result: " << std::endl << *sub << "Criterion value=" << critval_train << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 53: Hybrid feature selection.
Selects features using hybridized SBS algorithm where the primary criterion is SVM wrapper estimated classification accuracy and the pre-filtering criterion is normal Bhattacharyya distance. In each algorithm step all current feature subset candidates are evaluated by means of Bhattacharyya, 50% of the worst is abandoned and only the remaining 50% are evaluated by means of the slow SVM wrapper to select intermediate solutions. 50% of data is randomly chosen to form the training dataset (remains the same for all the time), 40% of data is randomly chosen to be used at the end for validating the classification performance on the finally selected subspace. (selected training and test data parts are disjunct and altogether cover 90% of the original data). SBS is called in d-optimizing setting, invoked by parameter 0 in search(0,...), which is otherwise used to specify the required subset size.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRR; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Criterion_Normal_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTCRIT; typedef FST::Classifier_LIBSVM<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERSVM; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERSVM,DATAACCESSOR> WRAPPERSVM; typedef FST::Sequential_Step_Hybrid<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,BHATTCRIT> EVALUATOR; std::cout << "Starting Example 53: Hybrid feature selection..." << std::endl; // randomly sample 50% of data for training and randomly sample (disjunct) 40% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRR(1, 50, 40)); // (there will be one outer randomized split only) // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // scale data to [0,1] by simple histogram shrinking boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_to01<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to outer split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/40 random data split failed."); // set-up normal Bhattacharyya criterion to serve as pre-filtering criterion in hybrid search // (note that it is to be initialized in outer splitter context (i.e., in splitting depth 0, meaning on all training data)) boost::shared_ptr<BHATTCRIT> cb(new BHATTCRIT); cb->initialize(da); // initiate access to inner split data parts // (note that wrapper evaluation in the course of search works in inner splitting loop (for 3-fold cross-validated SVM accuracy estimation)) da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); // set-up SVM (interface to external library LibSVM) boost::shared_ptr<CLASSIFIERSVM> csvm(new CLASSIFIERSVM); csvm->initialize(da); // wrap the SVM classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERSVM> wsvm(new WRAPPERSVM); wsvm->initialize(csvm,da); // set-up the hybrid search step object so that 50% of subset candidates is pre-filtered in each step (option: straight, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR(cb,50/*_keep_perc*/)); // set-up Sequential Backward Selection search procedure FST::Search_SFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,EVALUATOR> srch(eval); srch.set_search_direction(FST::BACKWARD); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wsvm << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wsvm,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating SVM accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); csvm->train(da,sub); csvm->test(critval_test,da); std::cout << "Validated SVM accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 54: Feature Selection Stability Evaluation.
Feature selection stability is defined as robustness of feature preferences with respect to different sampling of the same training data. Different stability measures have been defined to evaluate the feature selection process [17, 16]. FST3 provides a selection of stability measures capable of evaluating feature selection process that yields subsets of varying subset size. In this example ten feature selection trials are performed, each on differently randomly sampled 80% of the data. In each of the trials the resulting subset is obtained through SFS procedure, optimizing the 3-Nearest Neighbour accuracy evaluated by means of 3-fold cross-validation. Various stability measures are eventually evaluated. Note that each of the measures yields values from [0,1], where 0 marks the least stable and 1 marks the most stable feature selection process characteritic.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRANDRAND; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; typedef FST::Result_Tracker_Stability_Evaluator<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKER; std::cout << "Starting Example 54: Feature Selection Stability Evaluation..." << std::endl; // set-up ten trials where in each 80% of data is randomly sampled PSPLITTER dsp_outer(new SPLITTERRANDRAND(10/*splits=trials*/,80,20)); // in the course of feature subset search (in one trial) use 3-fold cross-validation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("RandRand data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, threaded, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // set-up result tracker to collect results of each trial boost::shared_ptr<TRACKER> tracker(new TRACKER); // run the trials std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << *tracker << std::endl << std::endl; RETURNTYPE critval_train; da->setSplittingDepth(0); unsigned int trial=0; bool run=da->getFirstSplit(); if(!run) throw FST::fst_error("RandRand data split failed."); while(run) { trial++; std::cout << std::endl<<"TRIAL "<<trial<< " ---------------------------------------------------------------------"<<std::endl; da->setSplittingDepth(1); if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); tracker->add(critval_train,sub); std::cout << std::endl << "(TRIAL "<<trial<<") Search result: " << std::endl << *sub << "Criterion value=" << critval_train << std::endl; da->setSplittingDepth(0); run=da->getNextSplit(); } // evaluate stability using results collected by tracker std::cout<<std::endl; std::cout << "---------------------------------------------------------------------" << std::endl; std::cout << "Resulting criterion values' mean: " << tracker->value_mean() << ", standard deviation: " << tracker->value_stddev() << std::endl; std::cout << "Resulting subset sizes' mean: " << tracker->size_mean() << ", standard deviation: " << tracker->size_stddev() << std::endl; std::cout << std::endl; std::cout << "stability measure CWrel("<<da->getNoOfFeatures()<<")=" << tracker->stability_CWrel(da->getNoOfFeatures()) << std::endl; std::cout << "stability measure ATI()=" << tracker->stability_ATI() << std::endl; std::cout << "stability measure CW()=" << tracker->stability_CW() << std::endl; std::cout << "stability measure ANHI("<<da->getNoOfFeatures()<<")=" << tracker->stability_ANHI(da->getNoOfFeatures()) << std::endl; std::cout << "stability measure SH()=" << tracker->stability_SH() << std::endl; std::cout << "stability measure PH("<<da->getNoOfFeatures()<<")=" << tracker->stability_PH(da->getNoOfFeatures()) << std::endl; std::cout << "stability measure C()=" << tracker->stability_C() << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 55: Evaluating Similarity of Two Feature Selection Processes.
To study the difference in feature preferences among principally different feature selection methods or among differently parametrized instances of the same method FST3 provides measures capable of evaluating the level of similarity between two sets of trials [16]. In analogy to stability evaluation (see Example 54: Feature Selection Stability Evaluation) for each of the two feature selection scenarios a series of trials is conducted on various samplings of the same data. In this example ten feature selection trials are performed per scenario, each on randomly sampled 95% of the data. In the first scenario in each trial the resulting subset is obtained using DOS procedure, optimizing the 3-Nearest Neighbour accuracy estimated by means of 3-fold cross-validation. In the second scenario in each trial the resulting subset is obtained using SFFS procedure, maximizing the Bhattacharyya distance based on normal model. A selection of standard stability measures is evaluated separately for each of the two scenarios. Eventually the similarity of the two scenarios is evaluated using analogously founded similarity measures. All measures yield values from [0,1], where values close to 0 denote low stability/similarity and values close to 1 denote high stability/similarity. Note that in this experiment the inter-measures (IATI, ICW, IANHI) yield markedly lower values than the corresponding stability measures (ATI, CW, ANHI). This illustrates well that considerably different results can be expected from differently founded feature selection methods.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRANDRAND; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_L1<DATATYPE,DIMTYPE,SUBSET> DISTANCEL1; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCEL1> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPER; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPER> EVALUATOR1; typedef FST::Criterion_Normal_Bhattacharyya<RETURNTYPE,DATATYPE,REALTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> BHATTCRIT; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,BHATTCRIT> EVALUATOR2; typedef FST::Result_Tracker_Stability_Evaluator<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKER; std::cout << "Starting Example 55: Evaluating Similarity of Two Feature Selection Processes..." << std::endl; // set-up ten trials where in each 95% of data is randomly sampled PSPLITTER dsp_outer(new SPLITTERRANDRAND(10/*splits=trials*/,95,5)); // in the course of wrapper based feature subset search (in one trial) use 3-fold cross-validation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("RandRand data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up result trackers to collect results of each trial in both scenarios boost::shared_ptr<TRACKER> tracker1(new TRACKER); boost::shared_ptr<TRACKER> tracker2(new TRACKER); // FEATURE SELECTION SCENARIO A (wrapper) // set-up 3-Nearest Neighbor classifier based on L1 distances boost::shared_ptr<CLASSIFIERKNN> cknn1(new CLASSIFIERKNN); cknn1->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPER> wknn1(new WRAPPER); wknn1->initialize(cknn1,da); // set-up the standard sequential search step object (option: hybrid, ensemble, threaded) boost::shared_ptr<EVALUATOR1> eval1(new EVALUATOR1); // set-up Sequential Forward Floating Selection search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPER,EVALUATOR1> srch1(eval1); srch1.set_delta(10); sub->deselect_all(); // Technical remark: should threaded evaluator be used in this case, it would be necessary to move both the evaluator and search procedure set-up // inside the trial loop. The reason is technical: threaded evaluator caches criterion clones, including data accessor state. // Therefore no outside changes in splitting level nor current split change can be reflected in criterion evaluation. Renewed // evaluator set-up resets the cache and thus ensures correct threaded criterion evaluation behavior after split change. // run the trials std::cout << "Feature selection setup:" << std::endl << *da << std::endl << *wknn1 << std::endl << *tracker1 << std::endl << std::endl; RETURNTYPE critval_train; da->setSplittingDepth(0); unsigned int trial=0; bool run=da->getFirstSplit(); if(!run) throw FST::fst_error("RandRand data split failed."); while(run) { trial++; std::cout << std::endl<<"TRIAL A"<<trial<< " ---------------------------------------------------------------------"<<std::endl; da->setSplittingDepth(1); if(!srch1.search(0,critval_train,sub,wknn1,std::cout)) throw FST::fst_error("Search not finished."); tracker1->add(critval_train,sub); std::cout << std::endl << "(TRIAL A"<<trial<<") Search result: " << std::endl << *sub << "Criterion value=" << critval_train << std::endl; da->setSplittingDepth(0); run=da->getNextSplit(); } // FEATURE SELECTION SCENARIO B (filter) // set-up normal Bhattacharyya distance criterion boost::shared_ptr<BHATTCRIT> cb(new BHATTCRIT); // set-up the standard sequential search step object (option: hybrid, ensemble, threaded) boost::shared_ptr<EVALUATOR2> eval2(new EVALUATOR2); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFFS<RETURNTYPE,DIMTYPE,SUBSET,BHATTCRIT,EVALUATOR2> srch2(eval2); srch2.set_search_direction(FST::FORWARD); // target subset size must be set because Bhattacharyya is monotonous with respect to subset size (i.e., evaluates full set as the best) const DIMTYPE target_size=7; // run the trials std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch2 << std::endl << *cb << std::endl << *tracker2 << std::endl << std::endl; trial=0; da->setSplittingDepth(0); run=da->getFirstSplit(); if(!run) throw FST::fst_error("RandRand data split failed."); while(run) { trial++; std::cout << std::endl<<"TRIAL B"<<trial<< " ---------------------------------------------------------------------"<<std::endl; cb->initialize(da); // (note that cb initialization = normal model parameter estimation on training data, therefore it must be repeated for each split) da->setSplittingDepth(1); if(!srch2.search(target_size,critval_train,sub,cb,std::cout)) throw FST::fst_error("Search not finished."); tracker2->add(critval_train,sub); std::cout << std::endl << "(TRIAL B"<<trial<<") Search result: " << std::endl << *sub << "Criterion value=" << critval_train << std::endl; da->setSplittingDepth(0); run=da->getNextSplit(); } // evaluate stability of each scenario and similarity of the two scenarios using results collected by trackers std::cout<<std::endl; std::cout << "---------------------------------------------------------------------" << std::endl; std::cout << "Scenario A resulting criterion values' mean: " << tracker1->value_mean() << ", std. dev.: " << tracker1->value_stddev() << std::endl; std::cout << "Scenario A subset sizes' mean: " << tracker1->size_mean() << ", std. dev.: " << tracker1->size_stddev() << std::endl; std::cout << std::endl; std::cout << "Scenario A stability_ATI()=" << tracker1->stability_ATI() << std::endl; std::cout << "Scenario A stability_CW()=" << tracker1->stability_CW() << std::endl; std::cout << "Scenario A stability_ANHI("<<da->getNoOfFeatures()<<")=" << tracker1->stability_ANHI(da->getNoOfFeatures()) << std::endl; std::cout<<std::endl; std::cout << "Scenario B resulting criterion values' mean: " << tracker2->value_mean() << ", std. dev.: " << tracker2->value_stddev() << std::endl; std::cout << "Scenario B subset sizes' mean: " << tracker2->size_mean() << ", std. dev.: " << tracker2->size_stddev() << std::endl; std::cout << std::endl; std::cout << "Scenario B stability_ATI()=" << tracker2->stability_ATI() << std::endl; std::cout << "Scenario B stability_CW()=" << tracker2->stability_CW() << std::endl; std::cout << "Scenario B stability_ANHI("<<da->getNoOfFeatures()<<")=" << tracker2->stability_ANHI(da->getNoOfFeatures()) << std::endl; std::cout<<std::endl; std::cout << "Evaluating similarity between scenario A and scenario B:"<< std::endl; std::cout << "similarity measure IATI()=" << tracker1->similarity_IATI(*tracker2) << std::endl; std::cout << "similarity measure ICW()=" << tracker1->similarity_ICW(*tracker2) << std::endl; std::cout << "similarity measure IANHI("<<da->getNoOfFeatures()<<")=" << tracker1->similarity_IANHI(da->getNoOfFeatures(), *tracker2) << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 56: Revealing feature selection results' bias.
FST enables classifier bias estimation, i.e., estimation of the difference between classification accuracy on dependent (training) and independent (test) data. Values close to 0 are preferable. High estimated bias may indicate severe over-fitting on the evaluated feature subspace and thus may help to avoid poor performance on previously unseen data. In this example features are selected using DOS algorithm and 3-NN wrapper classification accuracy as FS criterion. Classification accuracy (i.e, FS wrapper criterion value) is estimated on the first 50% of data samples by means of 3-fold cross-validation. The final classification performance on the selected subspace is eventually validated on the second 50% of data.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Criterion_Wrapper_Bias_Estimate<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERBIAS; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; std::cout << "Starting Example 56: Revealing feature selection results' bias..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/sonar_60.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_delta(1); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train, critval_test, biasval; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // estimate classifier bias on training data with respect to the selected feature subset boost::shared_ptr<WRAPPERBIAS> wbias(new WRAPPERBIAS); wbias->initialize(cknn,da); //da->setSplittingDepth(0); if(!wbias->evaluate(biasval,sub)) throw FST::fst_error("Bias estimation failed."); std::cout << "Estimated bias=" << biasval << std::endl << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Complementary scenarios
Example 60: Detecting alternative feature selection results with equal criterion value.
In many feature selection tasks multiple solutions may exists yielding the same maximal criterion value. Nevertheless, standard feature selection procedures are defined to yield one solution only. To reveal possible alternatives that would otherwise remain neglected, FST3 implements result tracking, i.e., it is possible to let the search procedure store all candidate solutions that have been evaluated in the course of search. Eventually, all solutions yielding the same maximum criterion value can be retrieved. Here we demonstrate the case in which many equivalent solutions exist. In optical digit recognition task with features representing pixels we use Dynamic Oscillating Search with 3-fold cross-validated 3-Nearest Neighbor Wrapper on 50% of data used for training to select feature subset.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; typedef FST::Result_Tracker_Dupless<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET> TRACKER; std::cout << "Starting Example 60: Detecting alternative feature selection results with equal criterion value..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/optdigits_64.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Dynamic Oscillating Search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_delta(3); // set-up result tracker to enable logging of all candidate solutions boost::shared_ptr<TRACKER> tracker(new TRACKER); // let the tracker register only solution no worse than "the best known criterion value minus 0.05" tracker->set_inclusion_margin(0.05); // register the result tracker with the used search step object eval->enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << *tracker << std::endl << std::endl; RETURNTYPE critval_train, critval_test; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; // report tracker contents std::cout << "Result tracker records " << tracker->size(0.0) << " solutions with criterion value equal to " << critval_train << "." << std::endl << std::endl; for(unsigned int i=1;i<10;i++) std::cout << "Result tracker records " << tracker->size((double)i*0.005) << " solutions with criterion value greater or equal to " << critval_train-(double)i*0.005 << "." << std::endl << std::endl; TRACKER::PResultRec result; if(tracker->get_first(result) && tracker->size(0.0)>1) { RETURNTYPE firstvalue=result->value; std::cout << "All recorded feature subsets yielding the same best known criterion value " << firstvalue << ":" << std::endl; while(tracker->get_next(result) && result->value==firstvalue) std::cout << *(result->sub) << std::endl; } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 61: Feature selection that respects pre-specified feature weights.
In many applications it is desirable to optimize feature subsets not only with respect to the primary objective (e.g., decision rule accuracy), but also with respect to additional factors like known feature acquisition cost. In many cases there might be only negligible difference in discriminatory ability among several features, while the cost of measuring their value may differ a lot. In such a case it is certainly better to select the cheaper feature. In other scenarios it might be even advantageous to trade a minor degradation of classifcation accuracy for substantial saving in measurement acquisition cost. For such cases FST3 implements a mechanism that allows to control the feature accuracy vs. feature cost trade-off. It is made possible through result tracking and subsequent selection of alternative solution so as to minimize the sum of pre-specified feature weights. The lower-weight solution is selected from the pool of all known solutions that differ from the best one by less than a user-specifed margin (permitted primary criterion value difference from the known maximum value). In this example we illustrate how to add the respective mechanism to standard wrapper based feature selection. Here we select features so as to maximize 3-Nearest Neighbor accuracy; then several lower-weight solutions are identified and validated, for various margin values.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Criterion_Sum_Of_Weights<RETURNTYPE,DIMTYPE,SUBSET> WEIGHCRIT; typedef FST::Criterion_Negative<WEIGHCRIT,RETURNTYPE,SUBSET> NEGATIVECRIT; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; typedef FST::Result_Tracker_Regularizer<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,NEGATIVECRIT> TRACKER; std::cout << "Starting Example 61: Feature selection that respects pre-specified feature weights..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, threaded, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // set-up tracker of intermediate results boost::shared_ptr<TRACKER> tracker(new TRACKER); // register the result tracker with the used search step object eval->enable_result_tracking(tracker); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << *tracker << std::endl << std::endl; RETURNTYPE critval_train, critval_test; if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); if(!wknn->evaluate(critval_train,sub)) throw FST::fst_error("crit call failure."); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << ", crit value="<< critval_train << std::endl << std::endl; // set-up the secondary criterion to minimize the sum of feature weights // (note that the concrete weight values shown here are sample only) RETURNTYPE feature_cost[]={1, 1.2, 1, 1.3, 1.02, 2.4, 3.9, 1.2, 7.1, 22, 9.52, 1.08, 3.27, 1.44, 1.04}; assert(sizeof(feature_cost)/sizeof(RETURNTYPE)==da->getNoOfFeatures()); boost::shared_ptr<WEIGHCRIT> weightsumcrit(new WEIGHCRIT); weightsumcrit->initialize(da->getNoOfFeatures(),feature_cost); boost::shared_ptr<NEGATIVECRIT> critminw(new NEGATIVECRIT(weightsumcrit)); // select final solution among those recorded by tracker (show more alternatives for various margins) for(unsigned int i=0; i<10; i++) { const RETURNTYPE margin=(double)i*0.005; if(!tracker->optimize_within_margin(margin,critval_train,critval_test,sub,critminw)) throw FST::fst_error("tracker2->optimize_within_margin() failed."); std::cout << std::endl << "Weight-optimized result (primary criterion margin="<<margin<<"): " << std::endl << *sub << "Criterion value=" << critval_train << std::endl << "Sum of weights=" << -critval_test << std::endl; // (optionally) validate result by estimating kNN accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); cknn->train(da,sub); cknn->test(critval_test,da); std::cout << "Validated "<<cknn->get_k()<<"-NN accuracy=" << critval_test << std::endl << std::endl; } } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 62: (Missing data substitution) Combined feature subset contents, size and SVM parameters optimization.
In many tasks some of the values in data set are missing. Provided such values are marked by a unique value, FST can handle such data -- in the pre-processing phase each missing value is substituted by the average value over all valid valued per feature. This example processess incomplete data in the same way as in Example 23: Combined feature subset contents, size and SVM parameters optimization by means of a repeated sequence of two consecutive operations - feature subset search followed by SVM parameter optimization for the current subset. Features are selected using DOS algorithm and SVM (with sigmoid kernel) wrapper classification accuracy as FS criterion. 50% of data is randomly chosen to form the training dataset (remains the same for all the time), 40% of data is randomly chosen to be used at the end for validating the classification performance on the finally selected subspace (training and test data parts are disjunct and altogether cover 90% of the original data). The training data part is accessed by means of 3-fold cross-validation in the course of search. The optimization process consists of repeated consecutive calls of two procedures: SVM parameter optimization followed by DOS feature subset optimization. (SVM parameters are optimized on the currently best known feature subset, which is then used to initialize next DOS search). The calls are repeated as long as better SVM performance (on the training data) is achieved.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_RandomRandom<INTERVALLER,IDXTYPE,BINTYPE> SPLITTERRR; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Classifier_LIBSVM<RETURNTYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR> CLASSIFIERSVM; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERSVM,DATAACCESSOR> WRAPPERSVM; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM> EVALUATOR; std::cout << "Starting Example 62: (Missing data substitution) Combined feature subset contents, size and SVM parameters optimization..." << std::endl; // randomly sample 50% of data for training and randomly sample (disjunct) 40% for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTERRR(1, 50, 40)); // (there will be one outer randomized split only) // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data const DATATYPE missing_value_code=5; boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>(missing_value_code,1/*to choose correct constructor*/)); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/sonar_60_missing_data.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/40 random data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected + another one as temporary storage boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); boost::shared_ptr<SUBSET> sub_temp(new SUBSET(da->getNoOfFeatures())); // set-up SVM (interface to external library LibSVM) boost::shared_ptr<CLASSIFIERSVM> csvm(new CLASSIFIERSVM); csvm->set_kernel_type(RBF); // (option: LINEAR, RBF, POLY) csvm->initialize(da); // wrap the SVM classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERSVM> wsvm(new WRAPPERSVM); wsvm->initialize(csvm,da); // set-up the standard sequential search step object (option: hybrid, ensemble) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Dynamic Oscillating Search procedure FST::Search_DOS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERSVM,EVALUATOR> srch(eval); srch.set_delta(3); // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wsvm << std::endl << std::endl; RETURNTYPE bestcritval_train, critval_train, critval_test; sub->select_all(); csvm->optimize_parameters(da,sub); double best_svm_param_C=csvm->get_parameter_C(); double best_svm_param_gamma=csvm->get_parameter_gamma(); double best_svm_param_coef0=csvm->get_parameter_coef0(); bool stop=false; sub->deselect_all(); if(!srch.search(0,bestcritval_train,sub,wsvm, std::cout)) throw FST::fst_error("Search not finished."); sub_temp->stateless_copy(*sub); while(!stop) { csvm->optimize_parameters(da,sub); if(!srch.search(0,critval_train,sub_temp,wsvm,std::cout)) throw FST::fst_error("Search not finished."); if(critval_train>bestcritval_train) { bestcritval_train=critval_train; sub->stateless_copy(*sub_temp); best_svm_param_C=csvm->get_parameter_C(); best_svm_param_gamma=csvm->get_parameter_gamma(); best_svm_param_coef0=csvm->get_parameter_coef0(); } else stop=true; } std::cout << std::endl << "Search result: " << std::endl << *sub << std::endl << "Criterion value=" << bestcritval_train << std::endl << std::endl; // (optionally) validate result by estimating SVM accuracy on selected feature sub-space on independent test data da->setSplittingDepth(0); csvm->set_parameter_C(best_svm_param_C); csvm->set_parameter_gamma(best_svm_param_gamma); csvm->set_parameter_coef0(best_svm_param_coef0); csvm->train(da,sub); csvm->test(critval_test,da); std::cout << "Validated SVM accuracy=" << critval_test << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}
Example 63: Classification of new data samples on the selected subspace.
This is merely a technical demo showing how to call Classifier::classify() method to classify a new data sample, doing so on the selected feature subspace.
Note:
This is an extension of Example 11: Wrapper-based feature selection with Floating Search.
try{ typedef double RETURNTYPE; typedef double DATATYPE; typedef double REALTYPE; typedef unsigned int IDXTYPE; typedef unsigned int DIMTYPE; typedef short BINTYPE; typedef FST::Subset<BINTYPE, DIMTYPE> SUBSET; typedef FST::Data_Intervaller<std::vector<FST::Data_Interval<IDXTYPE> >,IDXTYPE> INTERVALLER; typedef boost::shared_ptr<FST::Data_Splitter<INTERVALLER,IDXTYPE> > PSPLITTER; typedef FST::Data_Splitter_CV<INTERVALLER,IDXTYPE> SPLITTERCV; typedef FST::Data_Splitter_5050<INTERVALLER,IDXTYPE> SPLITTER5050; typedef FST::Data_Accessor_Splitting_MemTRN<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for TRN data format //typedef FST::Data_Accessor_Splitting_MemARFF<DATATYPE,IDXTYPE,INTERVALLER> DATAACCESSOR; // uncomment for ARFF data format typedef FST::Distance_Euclid<DATATYPE,DIMTYPE,SUBSET> DISTANCE; typedef FST::Classifier_kNN<RETURNTYPE,DATATYPE,IDXTYPE,DIMTYPE,SUBSET,DATAACCESSOR,DISTANCE> CLASSIFIERKNN; typedef FST::Criterion_Wrapper<RETURNTYPE,SUBSET,CLASSIFIERKNN,DATAACCESSOR> WRAPPERKNN; typedef FST::Sequential_Step_Straight<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN> EVALUATOR; std::cout << "Starting Example 63: Classification of new data samples on the selected subspace..." << std::endl; // keep second half of data for independent testing of final classification performance PSPLITTER dsp_outer(new SPLITTER5050()); // in the course of search use the first half of data by 3-fold cross-validation in wrapper FS criterion evaluation PSPLITTER dsp_inner(new SPLITTERCV(3)); // do not scale data boost::shared_ptr<FST::Data_Scaler<DATATYPE> > dsc(new FST::Data_Scaler_void<DATATYPE>()); // set-up data access boost::shared_ptr<std::vector<PSPLITTER> > splitters(new std::vector<PSPLITTER>); splitters->push_back(dsp_outer); splitters->push_back(dsp_inner); boost::shared_ptr<DATAACCESSOR> da(new DATAACCESSOR("data/speech_15.trn",splitters,dsc)); da->initialize(); // initiate access to split data parts da->setSplittingDepth(0); if(!da->getFirstSplit()) throw FST::fst_error("50/50 data split failed."); da->setSplittingDepth(1); if(!da->getFirstSplit()) throw FST::fst_error("3-fold cross-validation failure."); // initiate the storage for subset to-be-selected boost::shared_ptr<SUBSET> sub(new SUBSET(da->getNoOfFeatures())); sub->deselect_all(); // set-up 3-Nearest Neighbor classifier based on Euclidean distances boost::shared_ptr<CLASSIFIERKNN> cknn(new CLASSIFIERKNN); cknn->set_k(3); // wrap the 3-NN classifier to enable its usage as FS criterion (criterion value will be estimated by 3-fold cross-val.) boost::shared_ptr<WRAPPERKNN> wknn(new WRAPPERKNN); wknn->initialize(cknn,da); // set-up the standard sequential search step object (option: hybrid, ensemble, etc.) boost::shared_ptr<EVALUATOR> eval(new EVALUATOR); // set-up Sequential Forward Floating Selection search procedure FST::Search_SFFS<RETURNTYPE,DIMTYPE,SUBSET,WRAPPERKNN,EVALUATOR> srch(eval); srch.set_search_direction(FST::FORWARD); // try FST::BACKWARD // run the search std::cout << "Feature selection setup:" << std::endl << *da << std::endl << srch << std::endl << *wknn << std::endl << std::endl; RETURNTYPE critval_train; srch.set_output_detail(FST::NORMAL); // set FST::SILENT to disable all text output in the course of search (FST::NORMAL is default) if(!srch.search(0,critval_train,sub,wknn,std::cout)) throw FST::fst_error("Search not finished."); // the following demonstrates how to classify new data samples on the selected feature subspace DATATYPE dummy1[]={1.066, 0.33, 0.24, 1.15, 0.019, 0.295, 1.455, 0.269, 0.127, 1.55, 0.341, 0.075, 1.38, 0.36, 0.159}; assert(sizeof(dummy1)/sizeof(DATATYPE)==da->getNoOfFeatures()); DATATYPE dummy2[]={0.13, 0.562, 0.908, -0.21, 1.03, 0.23, 0.425, 0.782, -0.01, 0.29, 0.198, 0.66, 0.76, -0.34, 0.122}; assert(sizeof(dummy2)/sizeof(DATATYPE)==da->getNoOfFeatures()); da->setSplittingDepth(0); cknn->train(da,sub); DIMTYPE cls; if(!cknn->classify(cls,dummy1)) throw FST::fst_error("Classification failed."); std::cout << "The sample 'dummy1' has been classified to class " << cls <<"." << std::endl << std::endl; if(!cknn->classify(cls,dummy2)) throw FST::fst_error("Classification failed."); std::cout << "The sample 'dummy2' has been classified to class " << cls <<"." << std::endl << std::endl; } catch(FST::fst_error &e) {std::cerr<<"FST ERROR: "<< e.what() << ", code=" << e.code() << std::endl;}