Evaluation
Sometimes you can achieve on one goal and fail on another. So
- not everything wins on every goal
- the goals we set determines the conclusions we condone.
Which means that the next lecture after this has to be about dancing over multiple goals. Enter multi-objective optimization.
But before that...
Numeric Goals
- residual = RE = predicted - actual
- magnitude of RE = MRE= abs(predicted - actual)
- PRED(N) = how may estimates are less than N
- e.g. PRED(30) is standard (how many estimates within 30% of actual)
- medianRE = median of lots of RE
- meanRE = mean of lots of RE (not recommended)
Some argue that these numbers need to be baselined as a ratio against "do the dumbest thing you can":
- let
y
= mean of the training set - let
x
= your numeric estimate from the more - Standardised accuracy = (1- x/y) * 100
- Larger values are better
Discrete goals
(For code to implment the following, see abcd.lua.
Consdier the output stream,
%are you a dog?
predicted truth
--------- -----
yes no
yes yes
no no
yes yes
no no
... ...
Note that somethimes dog
s are correctly predicted and sometimes
they ain't (line one).
How do we convert these pairs into evaluation measures?
Discrete detectors can be assessed according to the following measures:
+-------------+------------+
| truth |
| no yes |
+-------------+------------+
| a | b | classifer predicts = no
+-------------+------------+
| c | d | classifier predicts = yes
+-------------+------------+
- accuracy = acc = (a+d)/(a+b+c+d
- probability of detection = pd = recall = d/(b+d)
- probability of false alarm = pf = c/(a+c)
- precision = prec = d/(c+d)
- distance2heaven = d2h = sqrt( (1-pd)^2 + pf^2 )
- pos/neg = = (b+d) / (a+c)
(We'll use pos/neg later on, see below.)
no, yes, <-- classified as
120, 20, no
20, 20, yes
For more that two classes, need to build one table per class of e.g. (a,nota), (b,notb), (c,notc) then report seperately for each.
a, b, c, <-- classifieed as
50, 10, 5, a
5, 80, 10, b
20, 30, 100, c
Ideally...
Ideally, detectors have high PDs, low PFs, and low effort. This ideal state rarely happens:
- PD and effort are linked. The more modules that trigger the detector, the higher the PD. However, effort also gets increases
- High PD or low PF comes at the cost of high PF or low PD (respectively).
These links can be seen in a standard receiver operator curve (ROC). Suppose, for example, LOC<x is used as the detector (i.e. we assume large modules have more errors). LOC<x represents a family of detectors:
- At x=0, EVERY module is predicted to have errors. This detector has a high PD but also a high false alarm rate.
- At x=0, NO module is predicted to have errors. This detector has a low false alarm rate but won't detect anything at all. At 0<x<1, a set of detectors are generated as shown below:
pd
1 | x x x KEY:
| x . "." denotes the line PD=PF
| x . "x" denotes the roc curve
| x . for a set of detectors
| x .
| x .
| x .
|x .
|x
x------------------ pf
0 1
Note that:
- The only way to make no mistakes (PF=0) is to do nothing (PD=0)
- The only way to catch more detects is to make more mistakes (increasing PD means increasing PF).
- Our detector bends towards the "sweet spot" of
but does not reach it. - The line pf=pd on the above graph represents the "no information" line. If pf=pd then the detector is pretty useless. The better the detector, the more it rises above PF=PD towards the "sweet spot".
For more on the connection of pd,pf,precision, etc, see Problems with Precision where it is derived that
pf = pos / neg * (1-prec)/prec * recall
Special Measures for SE
Softare engineering, we might have some other measures.
IFA
In 2011, Parnin and Orso noted that developers are not using debugging tools since they grow impatient when they generate false alarms.
So one measure of success of a defect predictor is to minimize
IFA
(see section V of Supervised vs Unsupervised
Models: A Holistic Look at Effort-Aware Just-in-Time Defect
Prediction);
i.e. number of initial false alarms encountered before we find the
first defect.
Popt(20)
Another measure of success is "how little do you need to read to find most bugs". The usual rule is you want find 20% of the code to find 80% of the defects.
For exaple, if we know how many locs of code are seen in the methods
in the above cells a,b,c,d
, then we can define how many lines of
code we need to read to find the bugs (and our goal is read less
and find more bugs):
- build a defect predictor
- find the code where the prediction = yes, no
- Build
S(m)
as follows:tmp1
= sort the yes ascending on lines of codeS(m)
= sort the no ascending on lines of code, append totmp1
- Build
S(worst)
as follows:tmp2
= sort the yes descending on lines of codeS(worst)
= sort the no descedning on lines of code, append totmp2
- Build
S(optimial)
as follows:- Sort all code descending on number of defects
- Walk the curves left to right, recording what defects we find
Popt = 1 - ( S(optimal) - S(m) ) / (S(optional) - S(worst) )
- Usually, we report the recall when Popt = 20%
And Many More Besides
Different business contexts need different goals.
So what we need are goal-aware learners.
End data mining.
Begin multi-objective optimization.