Potential Ideas For Class Projects

NOTE: For full credits, you MUST demonstrate you methods' effectiveness in one or more of the fifteen categories from this checklist

Statistcal evaluations

Review last ten years of highly cited papers in softare analytics
List the statistiacal methods they use
Apply them all to the same results
See if we can do early stopping in hyperparameter optimization (what are the odds that after N statistically significant large changes, that we will see one more?)
Data 95 methods
Across all treatments, are ranking stable?
Cluster top-down, bottom-up, best-first, worst-first.
Parametric, non-parametric
effect size test, yes/no

Decisions Trees can be limited to any depth?
How does restricting the depth of the tree effect performance (see Holte 1R) for binary classes? For n-ary classes?
If we had a model in an incomprehensible format (that of say NB, deep learning, neural net) and we ran the training data through decisions trees, what is the performance vs explanation tradeoff?
How do we quantify explanations?
- https://arxiv.org/pdf/1803.05067.pdf
- https://link.springer.com/article/10.1007/s10664-018-9638-1

Pick a task to optimize over (tuning defect predictors for example)
Cache generation zero of an evolutionary algorithm (say DE)
Run the algorithm
Cache generation N of the same algorithm
Run a rule learner over the generations to learn rules about the most informative features.
Questions
How good is that learner at selecting for best (last gen) instances?
How much is that goodness effected by sampling from how much of non-best?

Look at Rahul Krishna's paper on XTREE
This currently works on smaller datasets, can you create a scalable implementation of this that works on a streaming data?
Use ideas from Very Fast Decision Trees: https://homes.cs.washington.edu/~pedrod/papers/kdd00.pdf

Here's a great starter book for data mining in Python. Do not fret, if you choose to do this, we'll give you a digital copy of this book.
The code can be found here
Pick two novel techniques from any two chapters of this book and implement it on any software engineering data.
Remember, for full credits you will have to demonstrate it's effectiveness in one or more of the fifteen categories from this checklist

Look at Rahul Krishna's paper on Bellwethers. See section 5.2 and Figure 3.
- Briefly, given N projects, find the one project that can be used to train a supervised learner. Then, using that one dataset, predict for the class variable in other projects.
As of now, we find this one dataset (aka Bellwether) by running a for loop over all available pairs of projects. Unfortunately, this is O(n^2) algorithm.
Can you find a better way to discover this faster? Must be faster than O(n^2).
Meet Rahul (TA) for datasets and potential ideas on how this can be done.

Commit messages are usually descriptive natural language text that describe the changes the developers made.
- Example, Try to fix the problem with Tbuildsys#40 (/bin/sh: configure: command not found)
Can you use any NLP techniques (word2vec, sentiment analysis, etc..) to automatically classify a commit message as a bug-fix or not.
Again, talk to Rahul (TA) for datasets and other potential ideas.