Baseline for an "Adequate" AI
Software engineering is about engineering and engineering is about generate a produce of adequate quality, given the available constraints. What does that mean for AI-enhanced software?
Within any optimizer or data mining toolkit we can find hundreds of classifiers, regression tools, neural nets, support vector machines, evolutionary algorithms, ant-colony optimizers, etc etc, etc. These primitives can be combined in millions of ways, then tuned in quadrillions of ways (see the very active research literature on all these methods). So given a new problem, which learner/optimizer should we apply?
This is a very hard problem. Wolpert reports in his famous No Free Lunch Theorems that if some optimizer/learner works best for some data, then some other optimizer/learner will work best for other data. This means that when new data arrives, you need commissioning experiments; i.e. try a variety of techniques before you can find what words best for the local data.
(Aside: it turns out that the NFL has some good news for us: the greater the performance gain desired, the fewer the learners exist that produce at least such a performance gain. See the Hyperband optimizer for an adaptive approach to pruning away less-than-great methods. Also, for many learners/optimizers, their performance is indistinguishable for anything less than some ε value. So if we divide the output space into bins of width ε means we can stop looking once we find a few methods that falls into the best ε bins.)
When conducting such commissioning experiments, it is methodologically useful to have a baseline method; i.e. an algorithm which can generate floor performance values. Such baselines let a developer quickly rule out any method that falls “below the floor”. With this, researchers and industrial practitioners can achieve fast early results, while also gaining some guidance in all their subsequent experimentation (specifically: “try to beat the baseline”).
Using baselines for analyzing algorithms has been endorsed by several experienced researchers:
- In his textbook on empirical methods for artificial intelligence, Cohen strongly recommends comparing supposedly sophisticated systems against simpler alternatives. In the machine learning community,
- Holte uses the OneR baseline algorithm as a scout that runs ahead of a more complicated learners as a way to judge the complexity of up-coming tasks.
- In the software engineering community, Sarro et al
et al. recently proposed baseline methods for effort estimation. - Shepperd and Macdonnel argue convincingly that measurements are best viewed as ratios compared to measurements taken from some minimal baseline system.
- Work on cross-versus within-company cost estimation has also recommended the use of some very simple baseline
- I've offered several good baseline AI tools for SE tasks. In both the following, my graduate students were able to replace widely used very complex solutions with much simple alternative. For example:
So for this subject, we propose replacing the question of "what AI tool is best" with two other questions that make more sense to engineers racing to deliver products with limited time and resources:
- How can we quickly commission an initial, adequate, AI baseline system?
- What can we do to test and improve on that baseline?
But what is a "good" baseline?
Here's one list of what a "baseline" means. Items 1..10 are adapted from Sarro, TOSEM'18. (which we extend with our own notes). The other items come from my experience.
Note also that the following list offers a road map for future research in SE+AI. Find cases where some of these points do not matter. Find ways to enhance existing systems such that they perform better on the following criteria. Etc. The key thing to note about the following is that one system may not satisfy all these criteria (in fact, no known system satisfied all of them). That said, each of the following points is important. And by reflecting on the value of each point for a particular AI application, we naturally consider and review (and possibly discard) important design alternatives.
The Checklist
Now stay calm citizens of FSS'18. The following is not as complex as it looks. while there are many complex ways to support the following, there are also very simple ways that can work very well for each. And for your project, you only have to understand a one of the following.
1. SIMPLE: Be simple to describe, implement, and interpret (i.e. interpret the output for business uses).
- My current "simplest" methods is Fast-and-Frugal decision trees (or see also here); which is available in a nice R-package.
2. REASONABLE: Offer comparable performance to standard methods.
- So note the great paradox of simplicity research
-
Curious fact: evaluation can get so hard that we usually try to milk it for everything that can.
So "cross val" experiments lead us to ensemble learning then to boosting;
"round robin" lead to transfer learning;
"jiggle a little" lead to evolutionary methods;
evolutionary methods.
It is very hard to be simple
- Cause the simplest thing has to be compared against other, more complex, things.
3. STABLE: Be deterministic stable in its outcomes
- I've replaced deterministic with stable, since I think that is more important.
- Instability is very unsettling for software project managers struggling to find general policies. Project managers lose faith in the results of software analytics if those results keep changing. Also, such instability prevents project managers from offering clear guidelines on what to change, or what to avoid, in a project.
- And instability plagues SE data. For example:
- Here, Fig1, are the coefficients learned by regression on 20 67% samples of some training data. Note their WILD instability.
- One trick for increasing stability is not to focus on all the smaller details
- e.g. when learning regression equations, do not use all the variables;
- e.g. when learning rules, avoid wide conditions
- e.g. when learning trees, do not learn deep trees).
- Of course, if you optimize for simplicity, you may pay a performance penalty.
4a. INTUITIVE: Be applicable to mixed qualitative and quantitative data.
- It is good to use numeric and symbolic data.
- It is useful to be able to initial a systems with qualitative intuitions.
- Then, at least, you can compare the output to what folks already
believe.
- But be warned, in SE, the beliefs of many developers are... dubious.
- If for no other reasons that humans have numerous cognitive
biases
- For a really long list of those biases, see Wikipedia or this chart
- Then, at least, you can compare the output to what folks already
believe.
- It is useful to be able to guide model construction via high-level qualitative goals.
- One useful technology here are Bayes nets which can be either initially drawn by people, then revised by data miners, or vice versa.
- Another trick is to use some incremental rule learning algorithm that updates many possible new rules, then scores them by their distance to old rules (and the best new rules are those that are closest to old and score highest). In that rig, user background beliefs would become the first generation rules.
- Yet another method is to use multi-objective optimizers that fit rule learner to human biases.
4b. COMPREHENSIBLE:
This is connected to 4a.
- Essential for communities critiquing ideas.
- If the only person reading a model is a carburetor, then we can expect little push back. But if your models are about policies that humans have to implement, then I take it as axiomatic that humans will want to read and critique the models.
5. GENERAL: Offer some explanatory information regarding the prediction by representing generalized properties of the underlying data.
- Many systems offer only "point" solutions; i.e. examples of what might be useful.
- Given N attributes, a point solution offers exact values for all attributes.
- E.g. the output of most evolutionary programs
- E.g. all instance-based (nearest neighbor) methods
- E.g. A happy author might be editing this particular file and this particular time and place.
- Given N attributes, a point solution offers exact values for all attributes.
- Some systems offer solutions that hold over a volume;
- I.e. they ignore some values while saying things like x > 10 for others.
- E.g. Happy authors might be editing html files on many computers (and when they do it does not matter).
- One way to generalize a instance-based method is to cluster the solutions, then only report ranges that are different in different clusters.
6. NO MAGIC: Have no magic parameters within the modeling process that require tuning.
- E.g. for random forests, engineers have to decide on how many trees are included in the forest.
- Alternatively, if such tunings exist, then the must be some automatic method for selecting what tunings are best for particular data sets.
7. AVAILABLE: Be publicly available via a reference implementation and associated environment for execution.
- In this day and age of Docker images and package managers and Github-like environments where everyone can load up each other's code at the drop of a hat, it makes no sense for some baseline tool to be inaccessible.
8. USEFUL: Generally be more accurate than a random guess or (e.g. an estimate based purely on the distribution of the response variable).
- E.g. evaluate the output via "standardarized error"; i.e. compare the prediction to some some prediction generated from (say) the median value of the response variable.
9. CHEAP: Do not be expensive to apply.
- Here we mean that the CPU, Ram, and disk space required to make something work is not crazy high.
- "CHEAP is important since ~Reproducing and improving an old ideas means that you can reproduce that old result. Also, certifying that new ideas often means multiple runs over many sub-samples of the data. Such reproducibility and certification is impractical when such reproduction is impractically slow
10. ROBUST: I.e. does not change much over different data splits and validation methods?
- And if it does vary wildly, can it find ways to find regions in the data where the data conclusions are stable.
11. GOAL-AWARE: Different goals means different models. AND multiple goals = no problem!
- This is important since most data miners build models that optimizer for a single goal (e.g. minimize error or least-square error) yet business users often want their data miners to achieve many goals.
- For example, if we want to ask "what to do" rather than "what is", then we need a planner, not a classifier. Of course, in that case, the classifier can be used as a what-if guide to assess different plans.
12. CONTEXT-AWARE: context-aware:
Easy path context awareness: first cluster the data, then build different models for different clusters: see NbTrees.
- Knows that local parts of data generate different models.
- E.g. hierarchically clusters the data and builds one model per cluster.
- While general principles are good, so too is how to handle particular contexts. For example, in general, exercise is good for maintaining healthy. However, in the particular context of patients who have just had cardiac surgery, then that general principle has to be carefully tailored to particular patients. ideas need to be updated.
13. HUMBLE:
Easy path to
certification envelopes: cluster data, report k items per cluster.
- Can publish succinct certification envelope that can report when new data is
out-of-scope to what was seen before.(so we know when not to trust)
- This is important since the delivered data mined models should be able to recognize when new data is out-of-scope of anything they’ve seen before. This means, at runtime, having access to the data used to build that model.
- Note that phrase succinct here: certification envelopes cannot include all the data relating to a model, otherwise every hard drive in the world will soon fill up.
- Another form of humility is knowing when the baseline should be replaced with something else. Holte uses the OneR baseline algorithm as a scout that runs ahead of a more complicated learners as a way to judge the complexity of up-coming tasks.
14. STREAMING:
- Can run over an infinite stream of data, updating itself (or knows when to go back to old versions of itself).
- Easy path to anomaly detection: cluster data, report items that fall far from each cluster. Can detect anomalies (when new inputs differ from old training data). This is the trigger for re-learning.
15. SHARABLE: Knows how to transfer models, data, between contexts.
Easy path to lightweight sharing: just share reduced data from context awareness.
- Need some way to keep the volume of shared data down (otherwise "sharing" would clog the Internet).
- Such transfer may requires some transformation of the source data to the target data.
16. PRIVACY-AWARE:
Easy path to privacy: within the clusters of the certification envelope, just share k items per cluster, each slightly mutated. See LACE2.
- Can hide an individual's data
- This is essential when sharing a certification envelope
Project
The project of this class is to apply the above to AI tools applied to SE problems. Even trying to apply the above and not getting anywhere, would also be fine (just as you long as you document your comprehension of the ideas of baselines, along the way). So go seek, or build, good baselines:
- Take any SE problem and ask are the current methods "baselines?". Would simpler alternatives suffice?
- Can you make the method simpler to use;
- e.g. replace it with something much simpler to implement and explain
- e.g. apply an optimizer to a data mining to find better settings from that data miner?
- If you replace the complex with the simpler, what (if any) is the performance penalty?
- Can you make the method use less RAM or be faster to use;
- e.g. see what happens if you learn on just X% of the data (randomly
selected) for
X ∈ {50,25,10,5,1}%? - If you apply a prototype generator, can you select/build a very small
subset of the data from which learning is faster and just as effective?
- Finding prototypes can be as easy as "cluster and take just a few from each cluster"
- But there are many other methods
- eg. apply a data miner to an optimizer to divide up the data to make the whole process much faster? See 500+ faster than deep learning.
- e.g. see what happens if you learn on just X% of the data (randomly
selected) for
- Does that method need additional support to enable explanation of their output?
- Do their models fail the stability test?
- How does that method respond if you run it N times on 90% of the data?
- And if they do, can you find regions of the data where the performance is stable?
- How to reduce the CPU and RAM and runtime requirements of that method by large amounts e.g. see 500+ faster than deep learning.
- If we stream over the data, how soon does this model stabilize?
- If we inject mutations into the data, can this method be used to recognize that strange data? Once the weird data arrives, how long (if ever) before the model recovers?
- If a model is update, can be it done some minimally; i.e. with least change to the existing model?
- etc
- etc
- etc
All Connected
The more we compress the smaller the memory and the faster we learn and the less we need to share (so more privacy).
The more we understand the data's prototypes the more we know what is usual/ unusual so we more we know what is anomalous so the easier it is to offer a certification envelope
Note that if our compression method is somehow hierarchical and if we track the errors seen by our learners in different subtrees then the more we know which parts of the model need revising (and which can stay the same). Which means we only make revisions to the parts that matter, leaving the rest stable.
Other Requirements
No eval tools
Tests conclusion stability
- across multiple data sets (if available)
- or across multiple subsets of know scenarios
See Evaluation for many examples of that kind of evaluation.
- Note that these can significantly increase the computational cost of using learners.
- Hence, the need to faster, lighter AI algorithms.
No support for stats tests
- Check if this treatment has same effect as that treatment.
- Need at least two tests: significance and effect size
- I also think you need a third test;
- Something that clusters the treatments before the other tests are applied
- Reduces the number of other statistical tests.
- E.g the Scott-Knot test.