Skip to main content

Table 5 Characteristic quantification values and performance assessment of algorithms applied to the 12 research datasets

From: Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

Dataset Sample size Number of attributes Number of classes Cor1 Cor2 Class entropy Balance Well-performed algorithm rank
Iris 150 4 3 0.9565 0.9629 0.4771 1 Ensemble, single classifier
Adult 30,162 13 2 0.3353 − 0.5849 0.2437 3.017 Ensemble, C4.5
Wine 178 13 3 − 0.8475 0.8646 0.4717 1.479 Ensemble, LR, SVM, other
Car evaluation 1728 6 4 0.4393 0 0.3630 18.62 Ensemble,C4.5, kNNa
Breast cancer Wisconsin 683 9 2 0.8227 0.9072 0.2812 1.858 Ensemble, kNN, C4.5, SVM
Wdbc 569 30 2 0.7936 0.9979 0.2868 1.684 Ensemble,LR, C4.5, kNN, SVM
Wpbc 194 31 2 − 0.3460 0.9959 0.2379 3.217 Ensemble, C4.5, kNNa
Abalone 4177 8 28 0.6276 0.9868 1.084 689 RF, kNN, C4.5
Wine quality_red 1599 11 6 0.4762 − 0.6830 0.5145 68.1 RF, C4.5, kNN
Wine quality_white 4898 11 7 0.4356 0.8390 0.5604 439.6 RFb, C4.5, kNN
Heart disease 297 13 5 0.5212 0.5790 0.5577 12.31 RF, kNN, AB, C4.5
Poker hand 25,010 10 10 0.0102 − 0.0303 0.4277 2498.6 kNN, C4.5
  1. ‘Other’ in the last column means remaining algorithms besides previous listed algorithms
  2. akNN has higher sensitivity on a certain class, namely kNN has higher accuracy when predict the certain class
  3. bRF occupied bigger memory, then 2000 instances were sampled randomly to be training set, and RF showed high classification accuracy and acceptable running speed