csc 591-024, (8290)
csc 791-024, (8291)
fall 2024, special topics in computer science
Tim Menzies,, com sci, nc state

How Much Data?

How Much data Do we need for Learning?

Before We Start….

Review questions?

  1. What is the standard line on “how much data is enough?”
    • From Peter Norvig
    • From regression theory
    • From semi-supervised learning
  2. Describe each of the following. What are their implications for human decision making
    • Streaming over zero-diversity data
    • STM, LTM
    • Shrikanth’s early bird effect
    • two results (Valerdi’s work; repertory grids) commenting on the rate at which we can extract considered opinions from humans
  3. In English, describe the following math results and their implications for data mining
    • chesboard model
    • probable correctness theory.
  4. Few-shot learning:
    1. describe it
    2. In what sense does FSL mean we can look at fewer examples?
    3. In what sense does FSL require many, many examples?

A more informed position: The question is wrong


Another question: How much data can you handle?

For very fast decision making, there is a cognitive science case that we work from less than a dozen examples:

While first proposed in 1981, this STM/LTM theory still remains relevant 10. This theory can be used to explain both expert competency and incompetency in software engineering tasks such as understanding code 11.

Another question: How much data can you get?

How fast can we gather expert opinion?

Evidence from “cost estimation”

Evidence from “Repertory Grids”

Advice on how long to fill in a rep grid?

Overall, we get, for reflective labels on data:

Advice from Mathematics

One commonly cited rule of thumb [^call] is to have at least 10 times the number of training data instances attributes 16 17.

Historically, how much data was enough?


Chess board model

Data is spread out across a d-dimensional chessboard where each dimension is divided into \(b\) bins 21.

The target is some subset of the data that falls into some of the chessboard cells:

Probable Correctness Theory

Richard Hamlet, Probable correctness theory, 1980 22.

Some what ifs: - If we apply Cohen’s rule (things are indistinguishable if less than \(d{\times}\sigma\) apart, - And if variables are Gaussian ranging \(-3 \le x \le 3\). - Then that space divides into regions of size \(p=\frac{d}{6}\)

scenario d p C n(c,p) \(\log_2(n(c,p))\)
medium effect, non-safety critical 0.35 0.06 0.95 50 6
small effect, safety criticali 0.2 0.03 0.9999 272 8
tiny effects, ultra-safety critical n/a one in a million six sigma
13,815,504 24

Note the above table makes some very optimistic assumptions about the problem:

But it also tells us that the only way we can reason about safety critical systems is via some sorting heuristic (so we can get the log2 effect) [^call]: Application of machine learning techniques in small sample clinical studies, from

Few shot Learning

In the following, the author says LLMs not learners but given the results of this subject, I think an edit is in order:

Need another name

Generalize to new tasks via a sequence of prompts, starting composed of natural language instructions,

Few-shot learning is a subfield of machine learning and deep learning that aims to teach AI models how to learn from only a small number of labeled training data.

More generally “n-shot learning” a category of artificial intelligence that also includes:



Few Shot Learning in SE

March 2024: Google query: “few-shot learning and ‘software engineering’”

In the first 100 returns, after paper70, no more published few shot learning papers in SE.

In the remaining 70 papers:

year citations venue j=journal;
c=conf; w=workshop
title pdf data
2023 1 Icse_NLBSE w Few-Shot Learning for Issue Report Classification pdf 200 + 200
2023 2 SSBSE c . Search-based Optimisation of LLM Learning Shots for Story Point Estimation pdf 6 to 10
2023 2 ICSE c Log Parsing with Prompt-based Few-shot Learning pdf 4 to 128. most improvement before 16
2023 3 AST c FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning pdf 400+
2023 5 ICSE c Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning pdf 6-7 (for code generation (40 to 50 (for code repair)
2022 7 Soft.Lang.Eng c Neural Language Models and Few Shot Learning for Systematic Requirements Processing in MDSE pdf 8 to 11
2023 12 ICSE c Towards using Few-Shot Prompt Learning for Automating Model Completion pdf 212 classes
2020 15 IEEE ACCECSS j Few-Shot Learning Based Balanced Distribution Adaptation for Heterogeneous Defect Prediction pdf 100s - 1000s
2019 21 Big Data j . Exploring the applicability of low-shot learning in mining software repositories pdf 100 =>70% accuracy; 100s ==> 90% accuracy
2021 27 ESEM c An Empirical Examination of the Impact of Bias on Just-in-time Defect Prediction 10^3 samples of defects
2020 29 ICSE c Unsuccessful Story about Few Shot Malware Family Classification and Siamese Network to the Rescue pdf 10,000s ?
2022 65 ASE c Few-shot training LLMs for project-specific code-summarization pdf 10 samples
2022 101 FSE c Less Training_ More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning pdf ?

