Pessimistic error pruning illustration with C4.5-python implemention

------------------get the datasets-----------------------------------
We use the following datasets:

predict the age of abalone:
counts of ring+1.5

Of course we know we can use regression methods to reach the above target.
But to test our PEP pruning algorithm,we decided to use C4.5 to predict its ring(the final column of the above datasets).

There are totally 28 values of “ring”,they are

I have made sure no data item belongs to “28”,which is lacked in the above line.

1st step: reorder the above datasets according rings
2nd step:we select the former 200 datasets as our final train sets.

------------------get the model-----------------------------------
Use the simplified Tree(after EBP pruning) gotten from after running

running instruction is on:

then transform the following model:

unpruned Decision Tree C4.5-Release8

Viscera <= 0.0145 :
| Shucked > 0.007 : 4 (66.0/31.0)
| Shucked <= 0.007 :
| | Shucked <= 0.0045 :
| | | Height <= 0.025 : 1 (2.0/1.0)
| | | Height > 0.025 : 3 (2.0)
| | Shucked > 0.0045 :
| | | Shucked <= 0.005 : 4 (3.0)
| | | Shucked > 0.005 :
| | | | Height <= 0.02 : 4 (2.0)
| | | | Height > 0.02 : 3 (4.0)
Viscera > 0.0145 :
| Shell <= 0.0345 :
| | Viscera <= 0.0285 : 5 (50.0/9.0)
| | Viscera > 0.0285 : 4 (3.0)
| Shell > 0.0345 :
| | Sex = M: 6 (6.0/3.0)
| | Sex = F: 5 (3.0)
| | Sex = I: 5 (59.0/12.0)


{‘Viscera’: {’>0.0145’: {‘Shell’: {’<=0.0345’: {‘Viscera’: {’<=0.0285’: ’ 5 (50.0/9.0)’, ‘>0.0285’: ’ 4 (3.0)’}}, ‘>0.0345’: {‘Sex’: {’=M’: ’ 6 (6.0/3.0)’, ‘=F’: ’ 5 (3.0)’, ‘=I’: ’ 5 (59.0/12.0)’}}}}, ‘<=0.0145’: {‘Shucked’: {’>0.007’: ’ 4 (66.0/31.0)’, ‘<=0.007’: {‘Shucked’: {’>0.0045’: {‘Shucked’: {’>0.005’: {‘Height’: {’<=0.02’: ’ 4 (2.0)’, ‘>0.02’: ’ 3 (4.0)’}}, ‘<=0.005’: ’ 4 (3.0)’}}, ‘<=0.0045’: {‘Height’: {’<=0.025’: ’ 1 (2.0/1.0)’, ‘>0.025’: ’ 3 (2.0)’}}}}}}}}

replace ’ with "
and then get the views of above model on link:

    "Viscera": {
        ">0.0145": {
            "Shell": {
                "<=0.0345": {
                    "Viscera": {
                        "<=0.0285": " 5 (50.0/9.0)",
                        ">0.0285": " 4 (3.0)"
                ">0.0345": {
                    "Sex": {
                        "=M": " 6 (6.0/3.0)",
                        "=F": " 5 (3.0)",
                        "=I": " 5 (59.0/12.0)"
        "<=0.0145": {
            "Shucked": {
                ">0.007": " 4 (66.0/31.0)",
                "<=0.007": {
                    "Shucked": {
                        ">0.0045": {
                            "Shucked": {
                                ">0.005": {
                                    "Height": {
                                        "<=0.02": " 4 (2.0)",
                                        ">0.02": " 3 (4.0)"
                                "<=0.005": " 4 (3.0)"
                        "<=0.0045": {
                            "Height": {
                                "<=0.025": " 1 (2.0/1.0)",
                                ">0.025": " 3 (2.0)"

------------------Start to Prune------------------------------------
Now let’s prune it with PEP Algorithm,before pruning,the C4.5 decision tree is:

After being pruned,the C4.5 Tree is:
the sub-trees under orange “X” in the first picture is replaced(pruned) with leaf who has the most items of a same class,
and then,we get second picture.

Here are two models before and after being pruned with PEP:

unpruned_model= {‘Viscera’: {’<=0.0145’: {‘Shucked’: {’>0.007’: ’ 4 (66.0/31.0)’, ‘<=0.007’: {‘Shucked’: {’<=0.0045’: {‘Height’: {’<=0.025’: ’ 1 (2.0/1.0)’, ‘>0.025’: ’ 3 (2.0)’}}, ‘>0.0045’: {‘Shucked’: {’>0.005’: {‘Height’: {’<=0.02’: ’ 4 (2.0)’, ‘>0.02’: ’ 3 (4.0)’}}, ‘<=0.005’: ’ 4 (3.0)’}}}}}}, ‘>0.0145’: {‘Shell’: {’<=0.0345’: {‘Viscera’: {’<=0.0285’: ’ 5 (50.0/9.0)’, ‘>0.0285’: ’ 4 (3.0)’}}, ‘>0.0345’: {‘Sex’: {’=M’: ’ 6 (6.0/3.0)’, ‘=F’: ’ 5 (3.0)’, ‘=I’: ’ 5 (59.0/12.0)’}}}}}}

pruned_model= {‘Viscera’: {’>0.0145’: ‘5(121/28)’, ‘<=0.0145’: {‘Shucked’: {’>0.007’: ’ 4 (66.0/31.0)’, ‘<=0.007’: {‘Shucked’: {’>0.0045’: {‘Shucked’: {’>0.005’: {‘Height’: {’<=0.02’: ’ 4 (2.0)’, ‘>0.02’: ’ 3 (4.0)’}}, ‘<=0.005’: ’ 4 (3.0)’}}, ‘<=0.0045’: ‘3(4/2)’}}}}}}

unpruned_accuracy,pruned_accuracy=(0.72, 0.695)

compared with EBP(Error Based Pruning) with the same 200 items of abalone
(use Quinlan’s implemention,
we get:
Evaluation on training data (200 items):

 Before Pruning                      After Pruning
----------------                    ---------------------------
Size      Errors     Size      Errors   Estimate

  20   56(28.0%)     17        57(28.5%)    (36.1%)   <<

Attention please that the EBP(Error based Pruning) and PEP(Pessimistic Error Pruning) are targeted at to simplify C4.5 trees when the accuracy do not lose too much,instead of improving accuracy only.
Because simplified tree is much easier for user to extract classification rules(knowledge) from huge datasets.

The python-implemention of PEP is available at

Note that EBP is an evolution of PEP,both of which are invented by Ross Quinlan.

You may also want to learn Principles of PEP with examples in details:

