csc 591-024, (8290)
csc 791-024, (8291)
fall 2024, special topics in computer science
Tim Menzies, timm@ieee.org, com sci, nc state
Find stats.py
Run python3.13 stats.py
.
Collect your Nobel prize.
If we watched 100 women and men walk past us and their mean walking tipped was 42 and 44 cm/second (for men and women respectively), it is true that men walk faster than women?
This is an example of the problem of comparing samples. Which can get tricky.
These problem as two parts:
SE example:
Note that the above needed precise definitions for statistically significant effect and small effect size. How to find those?
Now we increase the standard deviation.
We sample under different treatments (e.g. we put weights on our people, then ask them to walk around)
Sometimes we assume samples come from different distributions (e.g. normal, binomial, etc).
We want to know how to separate samples that are significty distinguisable, by more than a small effect size.
If we assume that our data comes from a certain distrubtion then we could write a formula to compute the overlap or, if we throw darts at both diistributions, waht are the odds that we will hit numbers from one distribution, not aother.
This is called parametric statisitics. You assume a formula (e.g. Gaussian) then look to filling in the parameters of that distribution (for Gaussian, the mean \(\mu\) and the standard deviation \(\sigma\) ).
But there are so many distributions so it is not clear what formula we should use.
Also, real world data may be conform to a simple single distribution. For exam,e here are the query times for 50 SQL statements in one program.
If you want a justification for parametric tests:
Warning: I don’t buy the above except for making approx arguments about the value of X vs Y. So I might use “d= 0.35” to dispense with tiny differences in results.
Many statistical methods (e.g. t-test, Tukey, Duncan, Newman-Keuls procedures) suffer from have overlapping problems. - By overlapping we mean the possibility of one or more sample to be classified in more than one group. - In fact, as the number of samples increase, so to does the the number of overlaps, which makes it hard to distinguish the real groups to which the means should belong. - The Scott-Knott method 3 does not have this problem, what is often cited as a very good quality of this procedure.
Scott-Kott is a recursive clustering procedure that - sorts the samples - divided them on the largest expected difference in the mean before and after division - then recuses on each half, but only if the two halfs are statistically different
The halves are picked to maximize:
\[ E(\Delta) = \frac{|l_1|}{|l|}abs(E({l_1}) - E({l}))^2 + \frac{|l_2|}{|l|}abs(E({l_2}) - E({l}))^2\]
(here \(|l_1|\) means the size of list \(l_1\))
For example, support I had four samples labelled x1,x2… etc
def some1(n=5):
0.34, 0.49 ,0.51, 0.6]*n, txt="x1"),
report([ SOME([0.6 ,0.7 , 0.8 , 0.89]*n, txt="x2"),
SOME([0.09 ,0.22, 0.28 , 0.5]*n, txt="x3"),
SOME([0.6 ,0.7, 0.8 , 0.9]*n, txt="x4"),
SOME([0.1 ,0.2, 0.3 , 0.4]*n, txt="x5")])
SOME([
some1()
I would sort them by their median value the draw a little box plot of their 10-to-30th values, their median, and their 70-to-90th value:
#
0, x3, 0.28, 0.06, ------ *----------|
0, x5, 0.30, 0.10, ----- *---- |
#
1, x1, 0.51, 0.02, ------- *----
#
2, x2, 0.80, 0.10, | ----- *---
2, x4, 0.80, 0.10, | ----- *---
Note the left-handside sk rank
column. This reports what
happens after SK sorts the samples and decides which ones are
different
But how does it do it? The Scott & Knott method make use of a top-down clustering algorithm, where, starting from the the whole group of observed mean effects, it divides, and keep dividing the sub-groups in such a way that the intersection of any two groups formed in that manner is empty.
This means that \(N\) samples might get ranked using only \(\log_2(N)\) statistical comparisons - and even less, if ever sub-trees high up int the process are found to be not statistically different - Also, Scott-Knott converts the problem of ranking samples to clustering probkem (which I do understand) rather than a stats problem (which, in all fairness, I understand only weakly).
def sk(somes,epsilon=0.01):
"Sort nums on mid. give adjacent nums the same rank if they are statistically the same"
def sk1(somes, rank, cut=None):
= -1, SOME(somes)
most, b4 for j in range(1,len(somes)):
= SOME(somes[:j])
lhs = SOME(somes[j:])
rhs = (lhs.n*abs(lhs.mid() - b4.mid()) + rhs.n*abs(rhs.mid() - b4.mid())) / b4.n
tmp if tmp > most and (somes[j].mid() - somes[j-1].mid()) > epsilon:
= tmp,j
most,cut if cut:
= SOME(somes[:cut]), SOME(somes[cut:])
some1,some2 if True: #not some1.cohen(some2): # and abs(some1.div() - some2.div()) > 0.0001:
if some1 != some2:
= sk1(somes[:cut], rank) + 1
rank = sk1(somes[cut:], rank)
rank return rank
for some in somes: some.rank = rank
return rank
= sorted(somes, key=lambda some: some.mid()) #lambda some : some.mid())
somes 0)
sk1(somes,return somes
Note the commented out call to some1.cohen(some2)
.
def cohen(i,j):
return abs( i.mid() - j.mid() ) < the.stats.cohen * i.pooledSd(j)
def pooledSd(i,j):
"Return a measure of the combined standard deviation."
sd1, sd2 = i.div(), j.div()
return (((i.n - 1)*sd1 * sd1 + (j.n-1)*sd2 * sd2) / (i.n + j.n-2))**.5
But how to code up !=
?. Recall that this needs two
functions
def __eq__(i:SOME, j:SOME) -> bool:
"True if all of cohen/cliffs/bootstrap say you are the same."
return i.cliffs(j) and i.bootstrap(j) ## ordered slowest to fastest
Note that that this is a conjuction; i.e. to prove “different” I have to prove both things.
This code is simple. For everything \(x\) in one sample, look in the other sample - Count how often there are bigger and larger numbers in the other sample. - If \(x\) has many numbers less and greater than me, then I tend to be the same as the other sample - Since I tend to fall to the middle of the other sample
def cliffs(i:SOME, j:SOME , dull=None):
"""non-parametric effect size. threshold is border between small=.11 and medium=.28
from Table1 of https://doi.org/10.3102/10769986025002101"""
= 0,0,0
n,lt,gt for x1 in i.has():
for y1 in j.has():
+= 1
n if x1 > y1: gt += 1
if x1 < y1: lt += 1
return abs(lt - gt)/n < (dull or the.stats.cliffs or 0.197)
obs
(observartion)obs
.Here’s the code for that. yhat
and zhat
are
transforms recommended by Efron and Tibshirani to level the playing
field (ensures that both distribution s are scored on mean value that is
common to both distributions).
def bootstrap(i:SOME, j:SOME,confidence=None,bootstraps=None):
"""non-parametric significance test From Introduction to Bootstrap,
Efron and Tibshirani, 1993, chapter 20. https://doi.org/10.1201/9780429246593"""
= i.has(), j.has()
y0,z0 = SOME(inits=y0+z0), SOME(inits=y0), SOME(inits=z0)
x,y,z = y.delta(z)
delta0 = [y1 - y.mid() + x.mid() for y1 in y0]
yhat = [z1 - z.mid() + x.mid() for z1 in z0]
zhat = lambda l:SOME(random.choices(l, k=len(l)))
pull = bootstraps or the.stats.bootstraps or 512
samples= sum(pull(yhat).delta(pull(zhat)) > delta0 for _ in range(samples))
n return n / samples >= (confidence or the.stats.confidence or 0.05)
“The point was that you have to look at the world as it is, not as some elegant theory says it ought to be.” — M. Mitchell Waldrop
When dealing with many treatments with larte variance,
For example, at left, 51 of the 55 treatments all receive the same rank.
When such blurring occurs,
For another example, consider knn results that scores nearest-neighbor regression using \(100*(predicted-actual)/predicted\)
\[ \mathit{prediction}= \frac{\sum_i n_i/d_i}{\sum_i 1/d_i} \]
Please consider \((k=4,p=4, N=32, f=\mathit{triangle})\). Notice anything interesting? ### Runtimes and Storage
Parametric stats are very fast and consume little memory (jsut the memory required for the params).
Non-parametric stats are slower (see all that sampling inside
_bootstrap
and that \(O(N^2)\) traversal inside
cliffsDelta
). So don’t run non-parametric tests inside the
inner-most loop of your reasonong.
If you need a quick and dirty check for differences, just check if the mean difference is larger than a third of the standard deviation of the sample. No, this test is not well-founded. But it is useful as a heuristic.
Then, once you have collected results from (say) 20 repeated runs, run these non-parametric tests as part of your final report generation.
So much discussion of “what stats is best”. Very little experimentation on data.
here,s we asking cfliffsDelta (cd), boostrapping (boot), conjuction of both, and sd/3 if two sample are different wjere
Note the large areas of agreement, with a small dispute zone in the middle.
inc | cd | boot | c+b | sd/3 | dispute? |
---|---|---|---|---|---|
1 | False | False | False | False | |
1.02 | False | False | False | False | |
1.04 | False | False | False | False | |
1.061 | False | False | False | False | |
1.082 | False | False | False | False | |
1.104 | True | False | False | False | n |
1.126 | False | False | False | True | n |
1.149 | True | False | False | True | n |
1.172 | True | False | False | True | n |
1.195 | True | False | False | True | n |
1.219 | True | True | True | True | |
1.243 | True | False | False | True | n |
1.268 | True | True | True | True | |
1.294 | True | True | True | True | |
1.319 | True | True | True | True | |
1.346 | True | True | True | True | |
1.373 | True | True | True | True | |
1.4 | True | True | True | True | |
1.428 | True | True | True | True | |
1.457 | True | True | True | True | |
1.486 | True | True | True | True |
C. Bird, N. Nagappan, P. T. Devanbu, H. Gall, and B. Murphy. Does distributed development affect software quality? an empirical case study of windows vista. In ICSE, pages 518–528, 2009.↩︎
Kocaguneli, E., Zimmermann, T., Bird, C., Nagappan, N., & Menzies, T. (2013, May). Distributed development considered harmful?. In 2013 35th International Conference on Software Engineering (ICSE) (pp. 882-890). IEEE.↩︎
Scott R.J., Knott M. 1974. A cluster analysis method for grouping mans in the analysis of variance. Biometrics, 30, 507-512.↩︎