Skip to content

An implementation of simplified C4.5 algorithm with postpruning.

Notifications You must be signed in to change notification settings

usualwitch/decision-tree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Decision Tree

An implementation of the simplified C4.5 algorithm.

Use milksets package for loading UCI datasets.

Installation

git clone /~https://github.com/usualwitch/decision-tree.git
cd decision-tree
pip install .

Usage

from milksets import seeds
from dt import DecisionTree
X, y = seeds.load()
tr = DecisionTree(X, y, prune_method='Comp', loss='entropy')
tr.postprune()

Algorithms

For details of the tree building and pruning algorithms, see this post.

The general cost-complexity pruning method we employ is to first build a tree without pruning, and calculate alphas from this tree. Then we use cross validation to select a best alpha to prune the tree.

Experiments

Comparison of C4.5 and CART

Both algorithms use cost-complexity pruning strategy. We applied a 5-fold stratified cross-validation to find the best alpha. The results are listed in the following table.

name  tree  alpha  acc 
german  C4.5  0.0  0.700 
CART  0.0  0.675 
0.5  0.704 
iris  C4.5  0.0  0.430 
CART  0.0  0.933 
page_blocks  C4.5  0.0  0.721 
CART  0.0  0.963 
seeds  C4.5  0.0  0.774 
CART  0.0  0.936 
wine  C4.5  0.0  0.636 
CART  0.0  0.936 

In general CART works better than C4.5. This simplified version of C4.5’s test error is quite unstable with regard to the random choice of train and test sets, which should be resolved to make it applicable.

Different prune methods and loss functions

The experiment result is displayed in the picture below.

Prune methods correspond to the three rows, including

  • Reduce error pruning (error can be substituted by another loss function)
  • Pessimistic pruning
  • Cost-complexity pruning

Loss functions correspond to the three columns, including

  • empirical entropy
  • gini impurity
  • 0-1 loss

From the figure,

  • Reduce error pruning and pessimistic pruning is insensitive to loss functions.
  • In C4.5, where we use entropy to build the tree, cost-complexity pruning works pretty bad with 0-1 loss. Using entropy as loss in postpruning is a more coherent choice.

The average accuracy of the prune methods and losses are listed in the table below.

prune method  loss  acc 
cost-complexity  0-1  0.517 
entropy  0.671 
gini  0.601 
pessimistic  0-1  0.489 
entropy  0.469 
gini  0.489 
reduce error  0-1  0.654 
entropy  0.617 
gini  0.620 
The best choice among these is the cost-complexity pruning with entropy loss.

Reference

  1. J. Ross Quinlan. C4.5: programs for machine learning (1993). https://dl.acm.org/doi/book/10.5555/152181

  2. Mingers, J. An Empirical Comparison of Pruning Methods for Decision Tree Induction. Machine Learning, 4, 227–243 (1989). https://doi.org/10.1023/A:1022604100933

  3. Hastie, T.; Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning, Springer New York Inc. , New York, NY, USA (2001).

License

MIT

About

An implementation of simplified C4.5 algorithm with postpruning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published