Decision Tree
1. The Gini index is \(1-\mu_+^2-\mu_-^2\). What is the maximum value of the Gini index among all \(\mu_+\in[0,1]\)?Prove your answer.
归一化后的回归平方误差=\(\mu_+(1-(\mu_+-\mu_-))^2+\mu_-(-1-(\mu_+-\mu_-))^2\)
$$=\mu_+((\mu_+-\mu_-)-1)^2+\mu_-((\mu_+-\mu_-)+1)^2$$
$$=(\mu_++\mu_-)(\mu_+-\mu_-)^2-2(\mu_+-\mu_-)^2+(\mu_++\mu_-)$$
又因为\(\mu_++\mu_-=1\),因此我们有:
$$=(\mu_+-\mu_-)^2-2(\mu_+-\mu_-)^2+1$$
$$=1-(\mu_+-\mu_-)^2$$
$$=(\mu_++\mu_-)^2-\mu_+-\mu_-)^2$$
$$=4\mu_+\mu_-$$
Impurity functions play an important role in decision tree branching. For binary classification problems, let $\mu_{+}$ be the fraction of positive examples in a data subset, and \(\mu_{-}=1-\mu_+\) be the fraction of negative examples in the data subset.
解:
当\(\mu_+=\mu_-=\frac{1}{2}\)时,Gini系数有最大值\(1-\mu_+^2-\mu_-^2=\frac{1}{2}\)
2. Following Question 1, we can normalize each impurity function by dividing it with its maximum value among all\(\mu_+\in[0,1]\). For instance, the classification error is 2min\((\mu_+,\mu_-)\). After normalization, prove or disprove that the normalized Gini index is equivalent to the normalized squared regression error (used for branching in classification data sets), where the squared error is by definition \(\mu_+(1-(\mu_+-\mu_-))^2+\mu_-(-1-(\mu_+-\mu_-))^2\).解:
归一化后的Gini系数=\(2*(1-\mu_+^2-\mu_-^2)=2*((\mu_++\mu_-)^2-\mu_+^2-\mu_-^2))\)
$$=4\mu_+\mu_-$$归一化后的回归平方误差=\(\mu_+(1-(\mu_+-\mu_-))^2+\mu_-(-1-(\mu_+-\mu_-))^2\)
$$=\mu_+((\mu_+-\mu_-)-1)^2+\mu_-((\mu_+-\mu_-)+1)^2$$
$$=(\mu_++\mu_-)(\mu_+-\mu_-)^2-2(\mu_+-\mu_-)^2+(\mu_++\mu_-)$$
又因为\(\mu_++\mu_-=1\),因此我们有:
$$=(\mu_+-\mu_-)^2-2(\mu_+-\mu_-)^2+1$$
$$=1-(\mu_+-\mu_-)^2$$
$$=(\mu_++\mu_-)^2-\mu_+-\mu_-)^2$$
$$=4\mu_+\mu_-$$