Skip to main content

Table 5 Characteristic quantification values and performance assessment of algorithms applied to the 12 research datasets

From: Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

Dataset

Sample size

Number of attributes

Number of classes

Cor1

Cor2

Class entropy

Balance

Well-performed algorithm rank

Iris

150

4

3

0.9565

0.9629

0.4771

1

Ensemble, single classifier

Adult

30,162

13

2

0.3353

− 0.5849

0.2437

3.017

Ensemble, C4.5

Wine

178

13

3

− 0.8475

0.8646

0.4717

1.479

Ensemble, LR, SVM, other

Car evaluation

1728

6

4

0.4393

0

0.3630

18.62

Ensemble,C4.5, kNNa

Breast cancer Wisconsin

683

9

2

0.8227

0.9072

0.2812

1.858

Ensemble, kNN, C4.5, SVM

Wdbc

569

30

2

0.7936

0.9979

0.2868

1.684

Ensemble,LR, C4.5, kNN, SVM

Wpbc

194

31

2

− 0.3460

0.9959

0.2379

3.217

Ensemble, C4.5, kNNa

Abalone

4177

8

28

0.6276

0.9868

1.084

689

RF, kNN, C4.5

Wine quality_red

1599

11

6

0.4762

− 0.6830

0.5145

68.1

RF, C4.5, kNN

Wine quality_white

4898

11

7

0.4356

0.8390

0.5604

439.6

RFb, C4.5, kNN

Heart disease

297

13

5

0.5212

0.5790

0.5577

12.31

RF, kNN, AB, C4.5

Poker hand

25,010

10

10

0.0102

− 0.0303

0.4277

2498.6

kNN, C4.5

  1. ‘Other’ in the last column means remaining algorithms besides previous listed algorithms
  2. akNN has higher sensitivity on a certain class, namely kNN has higher accuracy when predict the certain class
  3. bRF occupied bigger memory, then 2000 instances were sampled randomly to be training set, and RF showed high classification accuracy and acceptable running speed