Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

Table 5 Characteristic quantification values and performance assessment of algorithms applied to the 12 research datasets

Dataset	Sample size	Number of attributes	Number of classes	Cor1	Cor2	Class entropy	Balance	Well-performed algorithm rank
Iris	150	4	3	0.9565	0.9629	0.4771	1	Ensemble, single classifier
Adult	30,162	13	2	0.3353	− 0.5849	0.2437	3.017	Ensemble, C4.5
Wine	178	13	3	− 0.8475	0.8646	0.4717	1.479	Ensemble, LR, SVM, other
Car evaluation	1728	6	4	0.4393	0	0.3630	18.62	Ensemble,C4.5, kNN^a
Breast cancer Wisconsin	683	9	2	0.8227	0.9072	0.2812	1.858	Ensemble, kNN, C4.5, SVM
Wdbc	569	30	2	0.7936	0.9979	0.2868	1.684	Ensemble,LR, C4.5, kNN, SVM
Wpbc	194	31	2	− 0.3460	0.9959	0.2379	3.217	Ensemble, C4.5, kNN^a
Abalone	4177	8	28	0.6276	0.9868	1.084	689	RF, kNN, C4.5
Wine quality_red	1599	11	6	0.4762	− 0.6830	0.5145	68.1	RF, C4.5, kNN
Wine quality_white	4898	11	7	0.4356	0.8390	0.5604	439.6	RF^b, C4.5, kNN
Heart disease	297	13	5	0.5212	0.5790	0.5577	12.31	RF, kNN, AB, C4.5
Poker hand	25,010	10	10	0.0102	− 0.0303	0.4277	2498.6	kNN, C4.5

‘Other’ in the last column means remaining algorithms besides previous listed algorithms
^akNN has higher sensitivity on a certain class, namely kNN has higher accuracy when predict the certain class
^bRF occupied bigger memory, then 2000 instances were sampled randomly to be training set, and RF showed high classification accuracy and acceptable running speed

ISSN: 1475-925X