|

Decision Trees 101

Foundations of Software Science, CSC, NC State, Fall, 2017

Before lookin at those, your going to need to know:

  • Decision trees are recursive diversity reduction algorithms
  • Find the split that most reduces diversity
  • Recurse on each split.
  • Stop when (e.g.) too few examples in each split.

RandomForests says “if one tree is good, why not build 100?”

  • each time, grab (say) log(N) of the attributes and some percent of the rows
  • build N trees
  • make a conclusion by voting across the forest

Example1

  • let us measure diversity using standard deviaton
    • sqrt((∑ square(x - μ))/(n-1))
  • standard deviation σ of 9,2,5,4,12,7 has μ = 6.5 and σ=3.619.
  • Learners like CART and M5prime and Random Forest Regressorts used standard deviation.
    • Why? Cause these learners predict for numeric class variables.

Example2:

  • let us measure diversity using entropy; i.e. -1* ∑ p*log2(p)
  • e.g. 1 orange, 1 apple, 2 bananas, and 4 grapes - occur at probability 1/8, 1/8, 1/4, and 1/2 - 8 =222 so log2( 1/8 ) = -3 - 4 =22 so log2( 1/4 ) = -2 - 2 =2 so log2( 1/2 ) = -1 - what is entropy of (o,a,b,b,g,g,g,g) - -1 * (1/8-3 + 1/8-3 + 1/4-2 + 1/2-1) - = -1 * (1/8-3 + 1/8-3 + 2/8-2 + 4/8*-1) - = -1/8 * (-6 + -4 + -4)
    - = 14/8 - = 1.75
  • Learners like decision trees and random forests use entropy
    • Why? Cause these learners predict for _symbolic class variables.

Example3:

What is the best split for this data?

Here are the options (note that this is four different splits):

Consider the outlook tree. We have three sub-branches so the expected value of the diversity after the split is

  • 5/14 * entropy of sunny split
  • 4/14 * entropy of overcast split
  • 5/14 * entropy of rainy split

The overcast split is easy:

  • we only have yes so p(yes) = 1 and log2(p) = 0
  • and 4/14 is zero

The sunny and rainy split are symmetric

  • sunny: 2 yes and 3 no = -1 * (2/5 * log2(2/5) + 3/5 * log2(3/5)) = 0.97
  • raning: 3 yes and 2 no = same entropy as sunny

So the expected value after the outlook split is

  • (5/14 * 0.97) + (4/14 * 0) + (5/14 * 0.97) = 0.69

(BTW, this is an improvement since before the split we have 9 yes, 5 no; ie. entropy was 0.94; i.e. more diversity).

If we repeat this calc over all splits, we get

  • outlook split: 0.69
  • temperate split: 0.91 = humidity split: 0.79
  • windy split: 0.89

So we would split on outlook.