Before lookin at those, your going to need to know:
- Decision trees are recursive diversity reduction algorithms
- Find the split that most reduces diversity
- Recurse on each split.
- Stop when (e.g.) too few examples in each split.
RandomForests says “if one tree is good, why not build 100?”
- each time, grab (say) log(N) of the attributes and some percent of the rows
- build N trees
- make a conclusion by voting across the forest
Example1
- let us measure diversity using standard deviaton
- sqrt((∑ square(x - μ))/(n-1))
- standard deviation σ of 9,2,5,4,12,7 has μ = 6.5 and σ=3.619.
- Learners like CART and M5prime and Random Forest Regressorts used standard deviation.
- Why? Cause these learners predict for numeric class variables.
Example2:
- let us measure diversity using entropy; i.e. -1* ∑ p*log2(p)
- e.g. 1 orange, 1 apple, 2 bananas, and 4 grapes
- occur at probability 1/8, 1/8, 1/4, and 1/2
- 8 =222 so log2( 1/8 ) = -3
- 4 =22 so log2( 1/4 ) = -2
- 2 =2 so log2( 1/2 ) = -1
- what is entropy of (o,a,b,b,g,g,g,g)
- -1 * (1/8-3 + 1/8-3 + 1/4-2 + 1/2-1)
- = -1 * (1/8-3 + 1/8-3 + 2/8-2 + 4/8*-1)
- = -1/8 * (-6 + -4 + -4)
- = 14/8 - = 1.75 - Learners like decision trees and random forests use entropy
- Why? Cause these learners predict for _symbolic class variables.
Example3:
What is the best split for this data?
Here are the options (note that this is four different splits):
Consider the outlook tree. We have three sub-branches so the expected value of the diversity after the split is
- 5/14 * entropy of sunny split
- 4/14 * entropy of overcast split
- 5/14 * entropy of rainy split
The overcast split is easy:
- we only have yes so p(yes) = 1 and log2(p) = 0
- and 4/14 is zero
The sunny and rainy split are symmetric
- sunny: 2 yes and 3 no = -1 * (2/5 * log2(2/5) + 3/5 * log2(3/5)) = 0.97
- raning: 3 yes and 2 no = same entropy as sunny
So the expected value after the outlook split is
- (5/14 * 0.97) + (4/14 * 0) + (5/14 * 0.97) = 0.69
(BTW, this is an improvement since before the split we have 9 yes, 5 no; ie. entropy was 0.94; i.e. more diversity).
If we repeat this calc over all splits, we get
- outlook split: 0.69
- temperate split: 0.91 = humidity split: 0.79
- windy split: 0.89
So we would split on outlook.