Evaluation

Sometimes you can achieve on one goal and fail on another. So

Which means that the next lecture after this has to be about dancing over multiple goals. Enter multi-objective optimization.

But before that...

Numeric Goals

Some argue that these numbers need to be baselined as a ratio against "do the dumbest thing you can":

Discrete goals

(For code to implment the following, see abcd.lua.

Consdier the output stream,

%are you a dog?
 predicted truth
 --------- -----
 yes       no
 yes       yes
 no        no
 yes       yes
 no        no
 ...       ...

Note that somethimes dogs are correctly predicted and sometimes they ain't (line one).

How do we convert these pairs into evaluation measures?

Discrete detectors can be assessed according to the following measures:

+-------------+------------+
|           truth          |
|     no            yes    |
+-------------+------------+
|      a      |      b     | classifer predicts = no
+-------------+------------+
|      c      |      d     | classifier predicts = yes
+-------------+------------+

(We'll use pos/neg later on, see below.)

 no,  yes,   <-- classified as
 120,  20,   no   
  20,  20,   yes

For more that two classes, need to build one table per class of e.g. (a,nota), (b,notb), (c,notc) then report seperately for each.

 a,   b,    c,  <-- classifieed as
50,  10,    5,  a
 5,  80,   10,  b
20,  30,  100,  c

Ideally...

Ideally, detectors have high PDs, low PFs, and low effort. This ideal state rarely happens:

These links can be seen in a standard receiver operator curve (ROC). Suppose, for example, LOC<x is used as the detector (i.e. we assume large modules have more errors). LOC<x represents a family of detectors:

        pd
      1 |           x  x  x     KEY:
        |        x     .         "."  denotes the line PD=PF
        |     x      .           "x"  denotes the roc curve 
        |   x      .                  for a set of detectors
        |  x     .
        | x    . 
        | x  .
        |x .
        |x
        x------------------ pf    
       0                   1

Note that:

For more on the connection of pd,pf,precision, etc, see Problems with Precision where it is derived that

pf = pos / neg * (1-prec)/prec * recall

Special Measures for SE

Softare engineering, we might have some other measures.

IFA

In 2011, Parnin and Orso noted that developers are not using debugging tools since they grow impatient when they generate false alarms.

So one measure of success of a defect predictor is to minimize IFA (see section V of Supervised vs Unsupervised Models: A Holistic Look at Effort-Aware Just-in-Time Defect Prediction); i.e. number of initial false alarms encountered before we find the first defect.

Popt(20)

Another measure of success is "how little do you need to read to find most bugs". The usual rule is you want find 20% of the code to find 80% of the defects.

For exaple, if we know how many locs of code are seen in the methods in the above cells a,b,c,d, then we can define how many lines of code we need to read to find the bugs (and our goal is read less and find more bugs):

And Many More Besides

Different business contexts need different goals.

So what we need are goal-aware learners.

End data mining.

Begin multi-objective optimization.