Exercise
A classification tree is being constructed to predict if an insurance policy will lapse. A random sample of 100 policies contains 30 that lapsed. You are considering two splits:
Split 1: One node has 20 observations with 12 lapses and one node has 80 observations with 18 lapses.
Split 2: One node has 10 observations with 8 lapses and one node has 90 observations with 22 lapses.
The total Gini index after a split is the weighted average of the Gini index at each node, with the weights proportional to the number of observations in each node. The total entropy after a split is the weighted average of the entropy at each node, with the weights proportional to the number of observations in each node.
Determine which of the following statements is/are true?
- Split 1 is preferred based on the total Gini index.
- Split 1 is preferred based on the total entropy.
- Split 1 is preferred based on having fewer classification errors.
- I only
- II only
- III only
- I, II, and III
- The correct answer is not given by (A), (B), (C), or (D).
Key: E
The total Gini index for Split 1 is
2[20(12/20)(8/20) + 80(18/80)(62/80)]/100 = 0.375
and for Split 2 is
2[10(8/10)(2/10) + 90(22/90)(68/90)]/100 = 0.3644.
Smaller is better, so Split 2 is preferred. The factor of 2 is due to summing two identical terms (which occurs when there are only two classes).
The total entropy for Split 1 is
–[20(12/20)ln(12/20) +20(8/20)ln(12/20) + 80(18/80)ln(18/80) + 80(62/80)ln(62/80)]/100 = 0.5611
and for Split 2 is
– [10(8/10)ln(8/10) +10(2/10)ln(2/10) + 90(22/90)ln(22/90) + 90(68/90)ln(68/90)]/100 =0.5506.
Smaller is better, so Split 2 is preferred.
For Split 1, there are 8 + 18 = 26 errors and for Split 2 there are 2 + 22 = 24 errors. With fewer errors, Split 2 is preferred.