Baseline for an "Adequate" AI

Software engineering is about engineering and engineering is about generate a produce of adequate quality, given the available constraints. What does that mean for AI-enhanced software?

Within any optimizer or data mining toolkit we can find hundreds of classifiers, regression tools, neural nets, support vector machines, evolutionary algorithms, ant-colony optimizers, etc etc, etc. These primitives can be combined in millions of ways, then tuned in quadrillions of ways (see the very active research literature on all these methods). So given a new problem, which learner/optimizer should we apply?

This is a very hard problem. Wolpert reports in his famous No Free Lunch Theorems that if some optimizer/learner works best for some data, then some other optimizer/learner will work best for other data. This means that when new data arrives, you need commissioning experiments; i.e. try a variety of techniques before you can find what words best for the local data.

(Aside: it turns out that the NFL has some good news for us: the greater the performance gain desired, the fewer the learners exist that produce at least such a performance gain. See the Hyperband optimizer for an adaptive approach to pruning away less-than-great methods. Also, for many learners/optimizers, their performance is indistinguishable for anything less than some ε value. So if we divide the output space into bins of width ε means we can stop looking once we find a few methods that falls into the best ε bins.)

When conducting such commissioning experiments, it is methodologically useful to have a baseline method; i.e. an algorithm which can generate floor performance values. Such baselines let a developer quickly rule out any method that falls “below the floor”. With this, researchers and industrial practitioners can achieve fast early results, while also gaining some guidance in all their subsequent experimentation (specifically: “try to beat the baseline”).

Using baselines for analyzing algorithms has been endorsed by several experienced researchers:

So for this subject, we propose replacing the question of "what AI tool is best" with two other questions that make more sense to engineers racing to deliver products with limited time and resources:

But what is a "good" baseline?

Here's one list of what a "baseline" means. Items 1..10 are adapted from Sarro, TOSEM'18. (which we extend with our own notes). The other items come from my experience.

Note also that the following list offers a road map for future research in SE+AI. Find cases where some of these points do not matter. Find ways to enhance existing systems such that they perform better on the following criteria. Etc. The key thing to note about the following is that one system may not satisfy all these criteria (in fact, no known system satisfied all of them). That said, each of the following points is important. And by reflecting on the value of each point for a particular AI application, we naturally consider and review (and possibly discard) important design alternatives.


The Checklist

Now stay calm citizens of FSS'18. The following is not as complex as it looks. while there are many complex ways to support the following, there are also very simple ways that can work very well for each. And for your project, you only have to understand a one of the following.

1. SIMPLE: Be simple to describe, implement, and interpret (i.e. interpret the output for business uses).

2. REASONABLE: Offer comparable performance to standard methods.

3. STABLE: Be deterministic stable in its outcomes

4a. INTUITIVE: Be applicable to mixed qualitative and quantitative data.

4b. COMPREHENSIBLE:

This is connected to 4a.

5. GENERAL: Offer some explanatory information regarding the prediction by representing generalized properties of the underlying data.

6. NO MAGIC: Have no magic parameters within the modeling process that require tuning.

7. AVAILABLE: Be publicly available via a reference implementation and associated environment for execution.

8. USEFUL: Generally be more accurate than a random guess or (e.g. an estimate based purely on the distribution of the response variable).

9. CHEAP: Do not be expensive to apply.

10. ROBUST: I.e. does not change much over different data splits and validation methods?

11. GOAL-AWARE: Different goals means different models. AND multiple goals = no problem!

12. CONTEXT-AWARE: context-aware:

Easy path context awareness: first cluster the data, then build different models for different clusters: see NbTrees.

13. HUMBLE:
Easy path to certification envelopes: cluster data, report k items per cluster.

14. STREAMING:


15. SHARABLE: Knows how to transfer models, data, between contexts.

Easy path to lightweight sharing: just share reduced data from context awareness.


16. PRIVACY-AWARE:

Easy path to privacy: within the clusters of the certification envelope, just share k items per cluster, each slightly mutated. See LACE2.


Project

The project of this class is to apply the above to AI tools applied to SE problems. Even trying to apply the above and not getting anywhere, would also be fine (just as you long as you document your comprehension of the ideas of baselines, along the way). So go seek, or build, good baselines:

All Connected

The more we compress the smaller the memory and the faster we learn and the less we need to share (so more privacy).

The more we understand the data's prototypes the more we know what is usual/ unusual so we more we know what is anomalous so the easier it is to offer a certification envelope

Note that if our compression method is somehow hierarchical and if we track the errors seen by our learners in different subtrees then the more we know which parts of the model need revising (and which can stay the same). Which means we only make revisions to the parts that matter, leaving the rest stable.

Other Requirements

No eval tools

Tests conclusion stability

See Evaluation for many examples of that kind of evaluation.

No support for stats tests