机器学习技法习题3答案

Decision Tree

Impurity functions play an important role in decision tree branching. For binary classification problems, let $\mu_{+}$ be the fraction of positive examples in a data subset, and \(\mu_{-}=1-\mu_+\) be the fraction of negative examples in the data subset.


1. The Gini index is \(1-\mu_+^2-\mu_-^2\). What is the maximum value of the Gini index among all \(\mu_+\in[0,1]\)?Prove your answer.

解:

    当\(\mu_+=\mu_-=\frac{1}{2}\)时,Gini系数有最大值\(1-\mu_+^2-\mu_-^2=\frac{1}{2}\)

2. Following Question 1, we can normalize each impurity function by dividing it with its maximum value among all\(\mu_+\in[0,1]\). For instance, the classification error is 2min\((\mu_+,\mu_-)\). After normalization, prove or disprove that the normalized Gini index is equivalent to the normalized squared regression error (used for branching in classification data sets), where the squared error is by definition \(\mu_+(1-(\mu_+-\mu_-))^2+\mu_-(-1-(\mu_+-\mu_-))^2\).

解:

    归一化后的Gini系数=\(2*(1-\mu_+^2-\mu_-^2)=2*((\mu_++\mu_-)^2-\mu_+^2-\mu_-^2))\)

    $$=4\mu_+\mu_-$$
    归一化后的回归平方误差=\(\mu_+(1-(\mu_+-\mu_-))^2+\mu_-(-1-(\mu_+-\mu_-))^2\)
    $$=\mu_+((\mu_+-\mu_-)-1)^2+\mu_-((\mu_+-\mu_-)+1)^2$$
    $$=(\mu_++\mu_-)(\mu_+-\mu_-)^2-2(\mu_+-\mu_-)^2+(\mu_++\mu_-)$$
    又因为\(\mu_++\mu_-=1\),因此我们有:
    $$=(\mu_+-\mu_-)^2-2(\mu_+-\mu_-)^2+1$$
    $$=1-(\mu_+-\mu_-)^2$$
    $$=(\mu_++\mu_-)^2-\mu_+-\mu_-)^2$$
    $$=4\mu_+\mu_-$$


猜你喜欢

转载自blog.csdn.net/ma412410029/article/details/80571072