csc 510-001, (1877)
fall 2024, software engineering
Tim Menzies, timm@ieee.org, com sci, nc state


home :: syllabus :: corpus :: groups :: moodle :: license

Fundamentals

erz.py is a few hundred lines of a simple AI script that demonstrates many important fundamentals of SE and AI.

1 Before We Start….

Macro structure of ezr.py:

Terminology (watch for these words):

2 Overview

2.1 EZR is an active learning

The human condition is that we have to make it up as we go along. Active learning is a strategy for acting with partial knowledge, before all the facts are in.

To understand this, suppose we wanted to learn some model \(f\)

\[Y_1,Y_2,...\;=\;f(X_1,X_2,X_3....)\]

Any example contains zero or more \(X\) and \(Y\) values

2.2 SE Examples where finding \(X\) is cheaper than \(Y\)

2.3 Smart Labeling

n Task Notes
1 Sample a little Get a get a few \(Y\) values (picked at random?)
2 Learn a little Build a tiny model from that sample
3 Reflect Compute \(b,r\)
4 Acquire Label an example that (e.g.) maximizes \(b/r\) then it to the sample.
5 Repeat Goto 2

So, an active learner tries to learn \(f\) using a lot of cheap \(X\) values, but very few \(Y\) values– since they are more expensive to access. Active learners know that they can learn better (faster, with fewer \(Y\) values) if they can select their own training data:

2.4 Training Data

Active learners spend much time reasoning about the \(X\) values (which are cheap to collect) before deciding which dependent variables to collect next. A repeated result is that this tactic can produce good models, with minimal information about the dependent variables.

For training purposes we explore all this using csv files where “?” denotes missing values. Row one list the columns names, defining the roles of the columns:

For example, here is data where the goals are Lbs-,Acc+,Mpg+ i.e. we want to minimize car weight and maximize acceleration and maximize fuel consumption.

 Clndrs   Volume  HpX  Model  origin  Lbs-  Acc+  Mpg+
 -------  ------  ---  -----  ------  ----  ----  ----
  4       90      48   78     2       1985  21.1   40
  4       98      79   76     1       2255  17.7   30
  4       98      68   77     3       2045  18.4   30
  4       79      67   74     2       2000  16     30
  ...
  4      151      85   78     1       2855  17.6   20
  6      168      132  80     3       2910  11.4   30
  8      350      165  72     1       4274  12     10
  8      304      150  73     1       3672  11.1   10
  ------------------------------      ----------------
    independent features (x)          dependent goals (y)

Note that the top rows are better than the bottom ones (lighter, faster cars that are more economical).

For testing purposes here, all the examples explored here come with all their \(Y\) values.

3 AI Notes

3.1 Aggregation Functions

To sort the data, all the goals have to be aggregated into one function. Inspired by the MOEA/D algorithm1, EZR uses the Chebyshev function that returns the max difference between the goal values and the best possible values. The rows shown above are sorted top to bottom, least to most Chebyshev values (so the best rows, with smallest Chebyshev, are shown at top).

Chebyshev is very simple to code. We assume a lost of goal columns self.cols.y each of which knows:

@of("Compute Chebyshev distance of one row to the best `y` values.")
def chebyshev(self:DATA,row:row) -> number:
  return  max(abs(col.goal - col.norm(row[col.at])) for col in self.cols.y)

@of("Returns 0..1 for min..max.")
def norm(self:NUM, x) -> number:
  return x if x=="?" else  ((x - self.lo) / (self.hi - self.lo + 1E-32))

When we say “rank DATA”, we mean sort all the rows by their Chebyshev distance:

@of("Sort rows by the Euclidean distance of the goals to heaven.")
def chebyshevs(self:DATA) -> DATA:
  self.rows = sorted(self.rows, key=lambda r: self.chebyshev(r))
  return self

3.2 Active Learning Buzzwords

e.g.

Before:

      y y y         n n   : Y (dependent, many missing values)
    1 2 3 4 5 6 7 8 9 10  : X (independent, most fully known)
    
So:

             x2 x2        : AQ1: explore  (usual initial tactic)
       x1                 : AQ2: exploit (once we know where the good stuff is)
      x3 x3        x3 x3  : AQ3: representativeiness
    x4     x4    x4   x4  : AQ4: diversity
    x5     x5 x5 x5       : AQ5: perversity (to boldly go where no one has gone before)
words note
explore looking for places that can change our mind
In the above, try x=6
exploit go to where things look best
In the above, try x=(2,3,4) (since that is where things look best). A.k.a. greedy search
adaptive moving from explore to exploit as the reasoning continues
acquisition functions the thing that tells us to explore/exploit/whatever
informativeness a.k.a. exploit. Go where its looks good.
representiveness equal parts in all regions e.g. x=2,3,4 half the time and x=9,10 half the time
diversity ignore the Y values and sample randomly and widely
perversity (my term) go where you ain’t gone before.
cold-start no prior knowledge
warm-start some prior (steam-based is usually warm-start)
population-based objective function knowm, we (e.g.) surf its gradients looking for cool pleases to study next
surrogate-based objective function unknown so we build an approximation from the available data
pool-based surrogate-based. have knowledge of lots of independent variables
stream-based surrogate based. have a (small) window of next examples, after which we wills ee another and and anouter. This is usually a warm-start tactic
model-query synthesis surrogate or population-based. look at model so far, infer where to try next. Kind an extreme version of explore. May include e.g. feature weighting to decide what to ignore

In most of my current experiments:

3.3 Configuration

Other people define their command line options separate to the settings. That is they have to define all those settings twice

This code parses the settings from the doc string (see the SETTINGS class). So the help text and the definitions of the options can never go out of sync.

3.4 Classes

This code has only a few main classes: SETTINGS, DATA, COLS, NUM, SYM

@dataclass
class COLS:
  names: list[str]   # column names
  all  : list[COL] = LIST()  # all NUMS and SYMS
  x    : list[COL] = LIST()  # independent COLums
  y    : list[COL] = LIST()  # dependent COLumns
  klass: COL = None

To build the columns, COLS looks at each name’s a,z (first and last letter).

  def __post_init__(self:COLS) -> None:
    for at,txt in enumerate(self.names):
      a,z = txt[0],txt[-1]
      col = (NUM if a.isupper() else SYM)(at=at, txt=txt)
      self.all.append(col)
      if z != "X":
        (self.y if z in "!+-" else self.x).append(col)
        if z=="!": self.klass = col
        if z=="-": col.goal = 0

3.5 Smarts

3.5.1 Bayes classifier

When you have labels, a simple and fast technique is:

The probability of x belong to a column is pretty simple:

@of("How much a SYM likes a value `x`.")
def like(self:SYM, x:any, prior:float) -> float:
  return (self.has.get(x,0) + the.m*prior) / (self.n + the.m)

@of("How much a NUM likes a value `x`.")
def like(self:NUM, x:number, _) -> float:
  v     = self.sd**2 + 1E-30
  nom   = exp(-1*(x - self.mu)**2/(2*v)) + 1E-30
  denom = (2*pi*v) **0.5
  return min(1, nom/(denom + 1E-30))

The likelihood of a row belonging to a label, given new evidence, is the prior probability of the label times the probability of the evidence. For example, if we have three oranges and six apples, then the prior on oranges is 33%.

For numerical methods reasons, we add tiny counts to the attribute and class frequencies (\(k=1,m=2\)) and treat all the values as logarithms (since these values can get real small, real fast)

@of("How much DATA likes a `row`.")
def loglike(self:DATA, r:row, nall:int, nh:int) -> float:
  prior = (len(self.rows) + the.k) / (nall + the.k*nh)
  likes = [c.like(r[c.at], prior) for c in self.cols.x if r[c.at] != "?"]
  return sum(log(x) for x in likes + [prior] if x>0)

For more notes on Bayes, see this example

3.5.2 Active Learner

The active learner uses a Bayes classifier to guess the likelihood that an unlabeled example should be labeled next.

  1. All the unlabeled data is split into a tiny done set and a much larger todo set
  2. All the dones are labeled, then ranked, then divided into \(\sqrt{N}\) best and \(1-\sqrt{N}\) rest.
  3. Some sample of the todos are the sorted by their probabilities of being best (B), not rest (R)
    • The following code uses \(B-R\)
    • But these values ore logs so this is really \(B/R\).
  4. The top item in that sort is then labelled and move to done.
    • And the cycle repeats
@of("active learning")
def activeLearning(self:DATA, score=lambda B,R: B-R, generate=None, faster=True ):
  def ranked(rows): return self.clone(rows).chebyshevs().rows

  def todos(todo):
    if faster: # Apply our sorting heuristics to just a small buffer at start of "todo"
       # Rotate back half of the buffer to end of list. Shift left to fill in the gap.
       n = the.buffer//2
       return todo[:n] + todo[2*n: 3*n],  todo[3*n:] + todo[n:2*n]
    else: # Apply our sorting heustics to all of todo.
      return todo,[]

  def guess(todo:rows, done:rows) -> rows:
    cut  = int(.5 + len(done) ** the.cut)
    best = self.clone(done[:cut])  # --------------------------------------------------- [2]
    rest = self.clone(done[cut:])
    a,b  = todos(todo) 
    if generate: # don't worry about this bit
      return self.neighbors(generate(best,rest), a) + b  # ----------------------------- [3]
    else:
      key  = lambda r: score(best.loglike(r, len(done), 2), rest.loglike(r, len(done), 2))
      return  sorted(a, key=key, reverse=True) + b # ----------------------------------- [3]

  def loop(todo:rows, done:rows) -> rows:
    for k in range(the.Last - the.label):
      if len(todo) < 3 : break
      top,*todo = guess(todo, done)
      done     += [top]   # ------------------------------------------------------------ [3]
      done      = ranked(done)
    return done

  return loop(self.rows[the.label:], ranked(self.rows[:the.label])) #------------------- [1]

The default configs here is the.label=4 and the.Last=30; i.e. four initial evaluations, then 26 evals after that.

TL;DR: to explore better methods for active learning:

4 SE notes:

Programming idioms are low-level patterns specific to a particular programming language. For example, see decorators which are a Python construct

Idioms are small things. Bigger than idioms are patterns : elegant solution to a recurring problem. Some folks have proposed extensive catalogs of patterns. These are worth reading. As for me, patterns are things I reuse whenever I do development in any languages. This code uses many patterns (see below).

Even bigger than patterns are architectural style is a high-level conceptual view of how the system will be created, organized and/or operated.

4.1 Architectural Styles

This code is pipe and filter. It can accept code from some prior process or if can read a file directly. These two calls are equivalent (since “-” denotes standard input). This pile-and-filter style is important since

python3.13 -B ezr.py -t ../moot/optimize/misc/auto93.csv -e _mqs
cat ../moot/optimize/misc/auto93.csv | python3.13 -B ezr.py -t -  -e _mqs

(Aside: to see how to read from standard input or a file, see def csv in the source code.)

Pipe-and-filters are a very famous architectural style:

Doug McIlroy, Bell Labs, 1986: “We should have some ways of coupling programs like garden hose…. Let programmers screw in another segment when it becomes necessary to massage data in another way…. Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information.”

Pipes changed the whole idea of UNIX:

For example, my build files have help text after a ## symbol. The following script prints a little help text describing that build script. It is a pipe between grep, sort, and awk Note the separation of concerns (which means that now our task divides into tiny tasks, each of which can be optimized separately):

help: ## print help
    printf "\n#readme\nmake [OPTIONS]\n\nOPTIONS:\n"
    grep -E '^[a-zA-Z_\.-]+:.*?## .*$$' $(MAKEFILE_LIST) \
        | sort \
        | awk 'BEGIN {FS = ":.*?## "}\
                   {printf "  \033[36m%-10s\033[0m %s\n", $$1, $$2}'

This produces:

% make help

#readme
make [OPTIONS]

OPTIONS:
  README.md  update README.md, publish
  help       print help
  push       commit to Git. 

Pipes are seen in scripting environments and are used a lot in modern languages.e .g. in “R”. Compare how many gallons of gas I would need for a 75 mile trip among 4-cylinder cars:

library(dplyr) # load dplyr for the pipe and other tidy functions
data(mtcars) # load the mtcars dataset

df <- mtcars %>%               # take mtcars. AND THEN...
    filter(cyl == 4) %>%       # filter it to four-cylinder cars, AND THEN...
    select(mpg) %>%            # select only the mpg column, AND THEN...
    mutate(car = row.names(.), # add a column for car name and # gallons used on a 75 mile trip
    gallons = mpg/75)

4.1.1 Social patterns: Coding for Teams

This code is poorly structured for team work:

4.1.2 Pattern: All code need doco

Code has much auto-documentation - functions have type hints and doc strings - help string at front (from which we parse out the config) - worked examples (at back)

For examples of methods for adding that doco, see make help command above.

4.1.3 Pattern: All code needs tests

Maurice Wilkes recalled the exact moment he realized the importance of debugging: “By June 1949, people had begun to realize that it was not so easy to get a program right as had at one time appeared. It was on one of my journeys between the EDSAC room and the punching equipment that the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.”

Half the time of any system is spent in testing. Lesson: don’t code it the night before.

Testing is more than just “finding bugs”. Test suites are a great way to communicate code and to offer continuous quality assurances.

Exr.py has tests (worked examples at back); about a quarter of the code base

There is much more to say about tests. That is another story and will be told another time.

4.1.4 Pattern: Configuration

4.1.5 Pattern: Function vs Object-Oriented

Object-oriented code is groups by class. But some folks doubt that approach:

My code is function-oriented: methods are grouped via method name (see the of decorator). This makes it easier to teach retlated concepts (since the concepts are together in the code).

Me doing this way was inspired by some words of Donald Knth who pointed out that the order with which we want to explains code may not be the same the order needed by the compiler. So he wrote a “tangle” system where code and comments, ordered for explaining, was rejigged at load time into what the compiler needs. I found I could do a small part of Knthu’s tangle with a 5 line decorator.

4.1.6 Pattern: Short functions

4.1.7 Pattern: DRY, not WET

4.1.8 Pattern: Little Languages

4.1.8.1 Little Languages: Make

Make files let us store all our little command line tricks in one convenient location. Make development was started by Stuart Feldman in 1977 as a Bell Labs summer intern (go interns!). It worthy of study since it is widely available on many environments.

There are now many build tools available, for example Apache ANT, doit, and nmake for Windows. Which is best for you depends on your requirements, intended usage, and operating system. However, they all share the same fundamental concepts as Make.

Make has “rules” and the rules have three parts: target, dependents (which can be empty), and code to build the target from the dependents. For example, the following two rules have code that simplifies our interaction with git.

pull    : ## download
   git pull

push    : ## save
  echo -en "\033[33mWhy this push? \033[0m"; read x; git commit -am "$$x"; git push; git status

For rules with dependents, the target is not changed unless there are newer dependents. For example, here is the rule that made this file. Note that this process needs a bunch of scripts, a css file etc. Make will udpate docs/%.html if ever any of those dependents change.

docs/%.html : %.py etc/py2html.awk etc/b4.html docs/ezr.css Makefile ## make doco: md -> html
    echo "$< ... "
    gawk -f etc/py2html.awk $< \
    | pandoc -s  -f markdown --number-sections --toc --toc-depth=5 \
                    -B etc/b4.html --mathjax \
             --css ezr.css --highlight-style tango \
                     --metadata title="$<" \
                 -o $@ 

This means that make docs/*.html will update all the html files at this site. And if we call this command twice, the second call will do nothing at all since docs/%.html is already up to date. This can save a lot of time during build procedures.

The makefile for any particular project can get very big. Hence, it s good practice to add an auto document rule (see the make help command, above). Note that this is an example of the all code needs doco pattern (also described above).

4.1.8.2 Little Languages: Regular Expressions

4.2 Validation

Validation procedures are designed to counter two anti-patterns in learning. For example, the xval function (described below) randomly shuffles the order of the training data (to handle order effects) then splits the data into bins (where one bin is for testing and the others are for training):

Another validation scheme, suitable for streaming data, is to train on past data the test on some limited buffer into the future:

4.3 Python Idioms

4.3.1 Magic Methods

Dunder = double underscore = “__”

Dunders define certain core functions; e.g.

For example, here’s a quick and dirty hold-all class that can print its own cotents.

class o:
  __init__ = lambda i,**d : i.__dict__.update(d)
  __repr__ = lambda i     : i.__class__.__name__+str(i.__dict__)

the = o(
  seed  = 1234567891,
  round = 2,
  stats = o(cohen=0.35,
            cliffs=0.195, #border between small=.11 and medium=.28
            bootstraps=512,
            confidence=0.05))

With the above, now our code can say things like the.stats.cohen.

If you want, you can even redefine + (__add__) or = (__eq__):

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __eq__(self, other):
        return self.x == other.x and self.y == other.y

p1 = Point(1, 2)
p2 = Point(1, 2)
p3 = Point(3, 4)

print(p1 == p2)  # True
print(p1 == p3)  # False

There are so many other dunders:

Magic Method Description
__init__ Initializes a new instance of a class.
__del__ Called before an object is destroyed.
__repr__ Returns an official string representation of an object.
__str__ Returns a string representation of an object.
__len__ Returns the length of an object.
__getitem__ Gets an item from an object using indexing.
__setitem__ Sets an item in an object using indexing.
__delitem__ Deletes an item from an object using indexing.
__iter__ Returns an iterator object.
__next__ Returns the next item from an iterator.
__contains__ Checks if an object contains a value.
__call__ Allows an object to be called like a function.
__eq__ Checks if two objects are equal.
__ne__ Checks if two objects are not equal.
__lt__ Checks if one object is less than another.
__le__ Checks if one object is less than or equal to another.
__gt__ Checks if one object is greater than another.
__ge__ Checks if one object is greater than or equal to another.
__add__ Adds two objects.
__sub__ Subtracts one object from another.
__mul__ Multiplies two objects.
__truediv__ Divides one object by another.
__floordiv__ Performs floor division on two objects.
__mod__ Returns the remainder of division.
__pow__ Raises one object to the power of another.
__and__ Performs bitwise AND on two objects.
__or__ Performs bitwise OR on two objects.
__xor__ Performs bitwise XOR on two objects.
__lshift__ Performs left bitwise shift on an object.
__rshift__ Performs right bitwise shift on an object.
__neg__ Returns the negation of an object.
__pos__ Returns the positive value of an object.
__abs__ Returns the absolute value of an object.
__invert__ Returns the bitwise inversion of an object.
__round__ Rounds an object to a given number of digits.
__bool__ Returns the boolean value of an object.
__hash__ Returns the hash value of an object.
__enter__ Called when entering a context (with statement).
__exit__ Called when exiting a context (with statement).

4.3.2 Data classes

This code uses dataclasses. These are a great shorthand method for defining classes. All dataclasses supply their own init and pretty-print methods. For example, here is a class with dataclasses

class Person():
    def __init__(self, name='Joe', age=30, height=1.85, email='joe@dataquest.io'):
        self.name = name
        self.age = age
        self.height = height
        self.email = email

Bt with data classes:

from dataclasses import dataclass
@dataclass
class Person():
    name: str = 'Joe'
    age: int = 30
    height: float = 1.85
    email: str = 'joe@dataquest.io'

print(Person(name='Tim', age=1000))
==> Person(name='Tim', age=1000, height=1.85, email='joe@dataquest.io')

4.3.3 Type hints

In other languages, types are taken very seriously and are the basis for computation.

The Python type system was a bolt-on to later versions of the language. Hence, it is not so well-defined.

But it is a great documentation tools since they let the programmer tell the reader what goes in and out of their function.

Firstly, you can define your own types. For example, classes stores rows of data about (e.g.) dogs and cats in a dictionary whose keys are “dogs” and “cats”

data= dict(dogs=[['ralph','poodle',2021],['benhi','labrador',2022]]
           cats=[['miss meow', 'ginger' 2020], etc])

We can define these classes as follows:

from __future__ import annotations
from typing import Any as any
from typing import List, Dict, Type, Callable, Generator

number  = float  | int   #
atom    = number | bool | str # and sometimes "?"
row     = list[atom]
rows    = list[row]
classes = dict[str,rows] # `str` is the class name

Then we can define a classifier as something that accepts classes and a new row and returns a guess as to what class it belongs to:

def classifier(data: classes, example: row) -> str:
  ...

Or, for a nearest neighbor classifier, we can define a function that sorts all the rows by the distance to some new row called row1 as follows (and here, the nearest neighbor to row1 is the first item in the returned row.

def neighbors(self:DATA, row1:row, rows:rows=None) -> rows:
  return sorted(rows, key=lambda row2: self.dist(row1, row2))

4.3.4 Abstraction

(Note that the following abstractions are available in many languages. So are they a pattern? Or an idiom? I place them here since the examples are Python-specific.)

4.3.4.1 Exception Handling

Exception handling in Python allows you to handle errors gracefully without crashing your program. The basic structure involves try, except, else, and finally blocks.

E.g. suppose you want a string to be a float, int, bool and if that all fails, be a string. The ast module has a method literal_eval that can handle most of that. So if it fails, we just return a string

import ast

def coerce(s:str) -> atom:
  "Coerces strings to atoms."
  try: return ast.literal_eval(s)
  except Exception:  return s

Fun fact: I once spent an afternoon trying to debug something where everything had turned to strings. Turns out, I’d forgotten to import ast so the try part of the above was always failing.

4.3.4.2 Iterators

Iterators are things that do some set up, yield one thing, then wait till asked, then yield one other hing, then wait till asked, then yield another other thing, etc. They are offer a simple interface to some under-lying complex process.

For example, my code’s csv function opens a file, removes spaces from each line, skips empty lines, splits lines on a comma, then coerces each item in the row to some Python type. Note that this function does not return, but it yields.

def csv(file) -> Generator[row]:
  infile = sys.stdin if file=="-" else open(file)
  with infile as src:
    for line in src:
      line = re.sub(r'([\n\t\r ]|#.*)', '', line)
      if line: yield [coerce(s.strip()) for s in line.split(",")]

# def coerce(s:str) -> atom: (defined above)

We can call it this way (note the simplicity of the interface)

for row in csv(fileName): 
   # row is now something like [4,86,65,80,3,2110,17.9,50]
   doSomethhing(fileName)

Here’s another that implements a cross validation test rig where learners train on some data, then test on some hold-out.

  1. To avoid learn things due to trivial orderings in the file, we shuffle the whole list
  2. The shuffled list is then split into n bins.
  3. For each bin, yield it as the test and all the other bins as rest.
  4. Optionally, only use some random sample of train, train

The following is an m-by-n cross val. That is, from m shuffling, yield n train,test set pairs. For the default values (m=n=5) this yields 25 train,test set pairs.

def xval(lst:list, m:int=5, n:int=5, some:int=10**6) -> Generator[rows,rows]:
  for _ in range(m):
    random.shuffle(lst)        # -------------------------------- [1]
    for n1 in range (n):
      lo = len(lst)/n * n1      # ------------------------------- [2]  
      hi = len(lst)/n * (n1+1)
      train, test = [],[]
      for i,x in enumerate(lst):
        (test if i >= lo and i < hi else train).append(x) 
      train = random.choices(train, k=min(len(train),some)) # --- [4]
      yield train,test  #---------------------------------------- [3]

4.3.5 Comprehensions

This code makes extensive use of comprehensions . E.g. to find the middle of a cluster, ask each column for its middle point.

@of("Return central tendency of a DATA.")
def mid(self:DATA) -> row:
  return [col.mid() for col in self.cols.all]

@of("Return central tendency of NUMs.")
def mid(self:NUM) -> number: return self.mu

@of("Return central tendency of SYMs.")
def mid(self:SYM) -> number: return self.mode

Comprehensions can be to filter data:

>>> [i for i in range(10) if i % 2 == 0]
[0, 2, 4, 6, 8]

Here’s one for loading tab-separated files with optional comment lines starting with a hash mark:

data = [line.strip().split("\t") for line in open("my_file.tab") \
        if not line.startswith('#')]

e.g. here are two examples of an implicit iterator in the argument to sum:

@of("Entropy = measure of disorder.")
def ent(self:SYM) -> number:
  return - sum(n/self.n * log(n/self.n,2) for n in self.has.values())

@of("Euclidean distance between two rows.")
def dist(self:DATA, r1:row, r2:row) -> float:
  n = sum(c.dist(r1[c.at], r2[c.at])**the.p for c in self.cols.x)
  return (n / len(self.cols.x))**(1/the.p)

E.g here we

  1. Use dictionary comprehensions, make a dictionary with one emery list per key,
  2. Using list comprehensions, add items into those lists
  3. Finally, using dictionary comprehensions, return a dictionary with one prediction per col.
@of("Return predictions for `cols` (defaults to klass column).")
def predict(self:DATA, row1:row, rows:rows, cols=None, k=2):
  cols = cols or self.cols.y
  got = {col.at : [] for col in cols}                           -- [1]
  for row2 in self.neighbors(row1, rows)[:k]:
    d =  1E-32 + self.dist(row1,row2)
    [got[col.at].append( (d, row2[col.at]) )  for col in cols]  -- [2]
  return {col.at : col.predict( got[col.at] ) for col in cols}  -- [3]

4.3.6 Decorators

def of(doc):
  def doit(fun):
    fun.__doc__ = doc
    self = inspect.getfullargspec(fun).annotations['self']
    setattr(globals()[self], fun.__name__, fun)
  return doit

@of("Return central tendency of a DATA.")
def mid(self:DATA) -> row:
  return [col.mid() for col in self.cols.all]

@of("Return central tendency of NUMs.")
def mid(self:NUM) -> number: return self.mu

@of("Return central tendency of SYMs.")
def mid(self:SYM) -> number: return self.mode

@of("Return diversity of a NUM.")
def div(self:NUM) -> number: return self.sd

@of("Return diversity of a SYM.")
def div(self:SYM) -> number: return self.ent()

5 Try it for yourself

5.1 Fork a repo

5.2 Get Python3.13

Make sure you are running Python3.13. On Linux and Github code spaces, that command is

sudo apt update -y; sudo  apt upgrade -y; sudo apt install software-properties-common -y; sudo add-apt-repository ppa:deadsnakes/ppa -y ; sudo apt update -y ; sudo apt install python3.13 -y

Now check you have python3.13

python3.13 -B --version

5.3 Try one run

cd /workspaces/ezr
git checkout 24Aug14
python3.13 -B ezr.py -t data/misc/auto93.csv -e _mqs

5.4 Try a longer run

This takes a new minutes, writes output t

cd /workspaces/ezr
Here=$PWD
cd data/optimize/process
mkdir -p ~/tmp/mqs
for i in *.csv ;  do $Here/ezr.py  -D -e _mqs -t $i | tee  ~/tmp/mqs/$i  ; done

5.5 Write your own extensions

Here’s a file extend.py in the same directory as ezr.py

import sys,random
from ezr import the, DATA, csv, dot

def show(lst):
  return print(*[f"{word:6}" for word in lst], sep="\t")

def myfun(train):
  d    = DATA().adds(csv(train))
  x    = len(d.cols.x)
  size = len(d.rows)
  dim  = "small" if x <= 5 else ("med" if x < 12 else "hi")
  size = "small" if size< 500 else ("med" if size<5000 else "hi")
  return [dim, size, x,len(d.cols.y), len(d.rows), train[17:]]

random.seed(the.seed) #  not needed here, but good practice to always take care of seeds
show(["dim", "size","xcols","ycols","rows","file"])
show(["------"] * 6)
[show(myfun(arg)) for arg in sys.argv if arg[-4:] == ".csv"]

On my machine, when I run …

python3.13 -B extend.py data/optimize/[chmp]*/*.csv > ~/tmp/tmp
sort -r -k 1,2 ~/tmp/tmp

… this prints some stats on the data files:

dim     size    xcols   ycols   rows    file
------  ------  ------  ------  ------  ------
small   small        4       3     398  misc/auto93.csv
small   small        4       2     259  config/SS-H.csv
small   small        3       2     206  config/SS-B.csv
small   small        3       2     196  config/SS-G.csv
small   small        3       2     196  config/SS-F.csv
small   small        3       2     196  config/SS-D.csv
small   small        3       1     196  config/wc+wc-3d-c4-obj1.csv
small   small        3       1     196  config/wc+sol-3d-c4-obj1.csv
small   small        3       1     196  config/wc+rs-3d-c4-obj1.csv
small   med          5       2    1080  config/SS-I.csv
small   med          3       2    1512  config/SS-C.csv
small   med          3       2    1343  config/SS-A.csv
small   med          3       2     756  config/SS-E.csv
small   hi           5       3   10000  hpo/healthCloseIsses12mths0011-easy.csv
small   hi           5       3   10000  hpo/healthCloseIsses12mths0001-hard.csv
med     small        9       1     192  config/Apache_AllMeasurements.csv
med     med         11       2    1023  config/SS-P.csv
med     med         11       2    1023  config/SS-L.csv
med     med         11       2     972  config/SS-O.csv
med     med         10       2    1599  misc/Wine_quality.csv
med     med          9       3     500  process/pom3d.csv
med     med          6       2    3840  config/SS-S.csv
med     med          6       2    3840  config/SS-J.csv
med     med          6       2    2880  config/SS-K.csv
med     med          6       1    3840  config/rs-6d-c3_obj2.csv
med     med          6       1    3840  config/rs-6d-c3_obj1.csv
med     med          6       1    2880  config/wc-6d-c1-obj1.csv
med     med          6       1    2866  config/sol-6d-c2-obj1.csv
med     hi          11       2   86058  config/SS-X.csv
med     hi           9       3   20000  process/pom3c.csv
med     hi           9       3   20000  process/pom3b.csv
med     hi           9       3   20000  process/pom3a.csv
hi      small       22       4      93  process/nasa93dem.csv
hi      med         38       1    4653  config/SQL_AllMeasurements.csv
hi      med         21       2    4608  config/SS-U.csv
hi      med         17       5    1000  process/coc1000.csv
hi      med         17       3     864  config/SS-M.csv
hi      med         16       1    1152  config/X264_AllMeasurements.csv
hi      med         14       2    3008  config/SS-R.csv
hi      med         14       1    3456  config/HSMGP_num.csv
hi      med         13       3    2736  config/SS-Q.csv
hi      hi          23       4   10000  process/xomo_osp2.csv
hi      hi          23       4   10000  process/xomo_osp.csv
hi      hi          23       4   10000  process/xomo_ground.csv
hi      hi          23       4   10000  process/xomo_flight.csv
hi      hi          17       2   53662  config/SS-N.csv
hi      hi          16       2   65536  config/SS-W.csv
hi      hi          16       2    6840  config/SS-V.csv
hi      hi          12       2    5184  config/SS-T.csv

Try modifying the output to add columns to report counts of the number of symbolic and numeric columns.


  1. Q. Zhang and H. Li, “MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition,” in IEEE Transactions on Evolutionary Computation, vol. 11, no. 6, pp. 712-731, Dec. 2007, doi: 10.1109/TEVC.2007.892759.↩︎

  2. Hey you have given me too many knobs, FSE’15↩︎