csc 510-001, (1877)
fall 2024, software engineering
Tim Menzies, timm@ieee.org, com sci, nc state
erz.py is a few hundred lines of a simple AI script that demonstrates many important fundamentals of SE and AI.
Macro structure of ezr.py:
egs
class which can
be called from the command line. E.g. -e mqs
called
egs.mqs()
.Terminology (watch for these words):
The human condition is that we have to make it up as we go along. Active learning is a strategy for acting with partial knowledge, before all the facts are in.
To understand this, suppose we wanted to learn some model \(f\)
\[Y_1,Y_2,...\;=\;f(X_1,X_2,X_3....)\]
Any example contains zero or more \(X\) and \(Y\) values
n | Task | Notes |
---|---|---|
1 | Sample a little | Get a get a few \(Y\) values (picked at random?) |
2 | Learn a little | Build a tiny model from that sample |
3 | Reflect | Compute \(b,r\) |
4 | Acquire | Label an example that (e.g.) maximizes \(b/r\) then it to the sample. |
5 | Repeat | Goto 2 |
So, an active learner tries to learn \(f\) using a lot of cheap \(X\) values, but very few \(Y\) values– since they are more expensive to access. Active learners know that they can learn better (faster, with fewer \(Y\) values) if they can select their own training data:
Active learners spend much time reasoning about the \(X\) values (which are cheap to collect) before deciding which dependent variables to collect next. A repeated result is that this tactic can produce good models, with minimal information about the dependent variables.
For training purposes we explore all this using csv files where “?” denotes missing values. Row one list the columns names, defining the roles of the columns:
For example, here is data where the goals are
Lbs-,Acc+,Mpg+
i.e. we want to minimize car weight and
maximize acceleration and maximize fuel consumption.
Clndrs Volume HpX Model origin Lbs- Acc+ Mpg+
------- ------ --- ----- ------ ---- ---- ----
4 90 48 78 2 1985 21.1 40
4 98 79 76 1 2255 17.7 30
4 98 68 77 3 2045 18.4 30
4 79 67 74 2 2000 16 30
...
4 151 85 78 1 2855 17.6 20
6 168 132 80 3 2910 11.4 30
8 350 165 72 1 4274 12 10
8 304 150 73 1 3672 11.1 10
------------------------------ ----------------
independent features (x) dependent goals (y)
Note that the top rows are better than the bottom ones (lighter, faster cars that are more economical).
X
but very
expensive to get Y
.For testing purposes here, all the examples explored here come with all their \(Y\) values.
To sort the data, all the goals have to be aggregated into one function. Inspired by the MOEA/D algorithm1, EZR uses the Chebyshev function that returns the max difference between the goal values and the best possible values. The rows shown above are sorted top to bottom, least to most Chebyshev values (so the best rows, with smallest Chebyshev, are shown at top).
Chebyshev is very simple to code. We assume a lost of goal columns
self.cols.y
each of which knows:
col.at
norm
alize values 0..1, min..max using
col.norm(x)
col.goal
:
goal=0
.goal=1
.@of("Compute Chebyshev distance of one row to the best `y` values.")
def chebyshev(self:DATA,row:row) -> number:
return max(abs(col.goal - col.norm(row[col.at])) for col in self.cols.y)
@of("Returns 0..1 for min..max.")
def norm(self:NUM, x) -> number:
return x if x=="?" else ((x - self.lo) / (self.hi - self.lo + 1E-32))
When we say “rank DATA”, we mean sort all the rows by their Chebyshev distance:
@of("Sort rows by the Euclidean distance of the goals to heaven.")
def chebyshevs(self:DATA) -> DATA:
self.rows = sorted(self.rows, key=lambda r: self.chebyshev(r))
return self
e.g.
Before:
y y y n n : Y (dependent, many missing values)
1 2 3 4 5 6 7 8 9 10 : X (independent, most fully known)
So:
x2 x2 : AQ1: explore (usual initial tactic)
x1 : AQ2: exploit (once we know where the good stuff is)
x3 x3 x3 x3 : AQ3: representativeiness
x4 x4 x4 x4 : AQ4: diversity
x5 x5 x5 x5 : AQ5: perversity (to boldly go where no one has gone before)
words | note |
---|---|
explore | looking for places that can change our
mind In the above, try x=6 |
exploit | go to where things look best In the above, try x=(2,3,4) (since that is where things look best). A.k.a. greedy search |
adaptive | moving from explore to exploit as the reasoning continues |
acquisition functions | the thing that tells us to explore/exploit/whatever |
informativeness | a.k.a. exploit. Go where its looks good. |
representiveness | equal parts in all regions e.g. x=2,3,4 half the time and x=9,10 half the time |
diversity | ignore the Y values and sample randomly and widely |
perversity (my term) | go where you ain’t gone before. |
cold-start | no prior knowledge |
warm-start | some prior (steam-based is usually warm-start) |
population-based | objective function knowm, we (e.g.) surf its gradients looking for cool pleases to study next |
surrogate-based | objective function unknown so we build an approximation from the available data |
pool-based | surrogate-based. have knowledge of lots of independent variables |
stream-based | surrogate based. have a (small) window of next examples, after which we wills ee another and and anouter. This is usually a warm-start tactic |
model-query synthesis | surrogate or population-based. look at model so far, infer where to try next. Kind an extreme version of explore. May include e.g. feature weighting to decide what to ignore |
In most of my current experiments:
Other people define their command line options separate to the settings. That is they have to define all those settings twice
This code parses the settings from the doc string (see the SETTINGS class). So the help text and the definitions of the options can never go out of sync.
This code has only a few main classes: SETTINGS, DATA, COLS, NUM, SYM
the
variable
(so the = SETTINGS()
).rows
, summarized in cols
(columns).all
(and some are also
stored in x
and y
).@dataclass
class COLS:
list[str] # column names
names: all : list[COL] = LIST() # all NUMS and SYMS
list[COL] = LIST() # independent COLums
x : list[COL] = LIST() # dependent COLumns
y : = None klass: COL
To build the columns, COLS looks at each name’s a,z
(first and last letter).
['Clndrs', 'Volume', 'HpX', 'Model','origin', 'Lbs-', 'Acc+', 'Mpg+']
def __post_init__(self:COLS) -> None:
for at,txt in enumerate(self.names):
= txt[0],txt[-1]
a,z = (NUM if a.isupper() else SYM)(at=at, txt=txt)
col self.all.append(col)
if z != "X":
self.y if z in "!+-" else self.x).append(col)
(if z=="!": self.klass = col
if z=="-": col.goal = 0
When you have labels, a simple and fast technique is:
The probability of x
belong to a column is pretty
simple:
@of("How much a SYM likes a value `x`.")
def like(self:SYM, x:any, prior:float) -> float:
return (self.has.get(x,0) + the.m*prior) / (self.n + the.m)
@of("How much a NUM likes a value `x`.")
def like(self:NUM, x:number, _) -> float:
= self.sd**2 + 1E-30
v = exp(-1*(x - self.mu)**2/(2*v)) + 1E-30
nom = (2*pi*v) **0.5
denom return min(1, nom/(denom + 1E-30))
The likelihood of a row belonging to a label, given new evidence, is the prior probability of the label times the probability of the evidence. For example, if we have three oranges and six apples, then the prior on oranges is 33%.
For numerical methods reasons, we add tiny counts to the attribute and class frequencies (\(k=1,m=2\)) and treat all the values as logarithms (since these values can get real small, real fast)
@of("How much DATA likes a `row`.")
def loglike(self:DATA, r:row, nall:int, nh:int) -> float:
= (len(self.rows) + the.k) / (nall + the.k*nh)
prior = [c.like(r[c.at], prior) for c in self.cols.x if r[c.at] != "?"]
likes return sum(log(x) for x in likes + [prior] if x>0)
For more notes on Bayes, see this example
The active learner uses a Bayes classifier to guess the likelihood that an unlabeled example should be labeled next.
done
set
and a much larger todo
setdone
s are labeled, then ranked, then divided
into \(\sqrt{N}\) best and
\(1-\sqrt{N}\) rest.todo
s are the sorted by their
probabilities of being best (B), not rest (R)
@of("active learning")
def activeLearning(self:DATA, score=lambda B,R: B-R, generate=None, faster=True ):
def ranked(rows): return self.clone(rows).chebyshevs().rows
def todos(todo):
if faster: # Apply our sorting heuristics to just a small buffer at start of "todo"
# Rotate back half of the buffer to end of list. Shift left to fill in the gap.
= the.buffer//2
n return todo[:n] + todo[2*n: 3*n], todo[3*n:] + todo[n:2*n]
else: # Apply our sorting heustics to all of todo.
return todo,[]
def guess(todo:rows, done:rows) -> rows:
= int(.5 + len(done) ** the.cut)
cut = self.clone(done[:cut]) # --------------------------------------------------- [2]
best = self.clone(done[cut:])
rest = todos(todo)
a,b if generate: # don't worry about this bit
return self.neighbors(generate(best,rest), a) + b # ----------------------------- [3]
else:
= lambda r: score(best.loglike(r, len(done), 2), rest.loglike(r, len(done), 2))
key return sorted(a, key=key, reverse=True) + b # ----------------------------------- [3]
def loop(todo:rows, done:rows) -> rows:
for k in range(the.Last - the.label):
if len(todo) < 3 : break
*todo = guess(todo, done)
top,+= [top] # ------------------------------------------------------------ [3]
done = ranked(done)
done return done
return loop(self.rows[the.label:], ranked(self.rows[:the.label])) #------------------- [1]
The default configs here is the.label=4
and
the.Last=30
; i.e. four initial evaluations, then 26 evals
after that.
TL;DR: to explore better methods for active learning:
guess()
functiontodo
items (looking only at the x values, not the y values).Programming idioms are low-level patterns specific to a particular programming language. For example, see decorators which are a Python construct
Idioms are small things. Bigger than idioms are patterns : elegant solution to a recurring problem. Some folks have proposed extensive catalogs of patterns. These are worth reading. As for me, patterns are things I reuse whenever I do development in any languages. This code uses many patterns (see below).
Even bigger than patterns are architectural style is a high-level conceptual view of how the system will be created, organized and/or operated.
This code is pipe and filter
. It can accept code from
some prior process or if can read a file directly. These two calls are
equivalent (since “-” denotes standard input). This pile-and-filter
style is important since
python3.13 -B ezr.py -t ../moot/optimize/misc/auto93.csv -e _mqs
cat ../moot/optimize/misc/auto93.csv | python3.13 -B ezr.py -t - -e _mqs
(Aside: to see how to read from standard input or a file, see
def csv
in the source code.)
Pipe-and-filters are a very famous architectural style:
Doug McIlroy, Bell Labs, 1986: “We should have some ways of coupling programs like garden hose…. Let programmers screw in another segment when it becomes necessary to massage data in another way…. Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information.”
Pipes changed the whole idea of UNIX:
For example, my build files have help text after a ##
symbol. The following script prints a little help text describing that
build script. It is a pipe between grep, sort, and awk Note the
separation of concerns (which means that now our task divides into tiny
tasks, each of which can be optimized separately):
grep
handles feature extraction from the build
file;sort
rearranges the contents alphabeticallygawk
handles some formatting trivia.help: ## print help
printf "\n#readme\nmake [OPTIONS]\n\nOPTIONS:\n"
'^[a-zA-Z_\.-]+:.*?## .*$$' $(MAKEFILE_LIST) \
grep -E
| sort \'BEGIN {FS = ":.*?## "}\
| awk {printf " \033[36m%-10s\033[0m %s\n", $$1, $$2}'
This produces:
% make help
#readme
make [OPTIONS]
OPTIONS:
README.md update README.md, publish
help print help
push commit to Git.
Pipes are seen in scripting environments and are used a lot in modern languages.e .g. in “R”. Compare how many gallons of gas I would need for a 75 mile trip among 4-cylinder cars:
library(dplyr) # load dplyr for the pipe and other tidy functions
data(mtcars) # load the mtcars dataset
<- mtcars %>% # take mtcars. AND THEN...
df filter(cyl == 4) %>% # filter it to four-cylinder cars, AND THEN...
select(mpg) %>% # select only the mpg column, AND THEN...
mutate(car = row.names(.), # add a column for car name and # gallons used on a 75 mile trip
gallons = mpg/75)
This code is poorly structured for team work:
Code has much auto-documentation - functions have type hints and doc strings - help string at front (from which we parse out the config) - worked examples (at back)
For examples of methods for adding that doco, see
make help
command above.
Maurice Wilkes recalled the exact moment he realized the importance of debugging: “By June 1949, people had begun to realize that it was not so easy to get a program right as had at one time appeared. It was on one of my journeys between the EDSAC room and the punching equipment that the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.”
Half the time of any system is spent in testing. Lesson: don’t code it the night before.
Testing is more than just “finding bugs”. Test suites are a great way to communicate code and to offer continuous quality assurances.
Exr.py has tests (worked examples at back); about a quarter of the code base
There is much more to say about tests. That is another story and will be told another time.
Object-oriented code is groups by class. But some folks doubt that approach:
My code is function-oriented: methods are grouped via method name (see the of decorator). This makes it easier to teach retlated concepts (since the concepts are together in the code).
Me doing this way was inspired by some words of Donald Knth who pointed out that the order with which we want to explains code may not be the same the order needed by the compiler. So he wrote a “tangle” system where code and comments, ordered for explaining, was rejigged at load time into what the compiler needs. I found I could do a small part of Knthu’s tangle with a 5 line decorator.
Make files let us store all our little command line tricks in one convenient location. Make development was started by Stuart Feldman in 1977 as a Bell Labs summer intern (go interns!). It worthy of study since it is widely available on many environments.
There are now many build tools available, for example Apache ANT, doit, and nmake for Windows. Which is best for you depends on your requirements, intended usage, and operating system. However, they all share the same fundamental concepts as Make.
Make has “rules” and the rules have three parts: target, dependents (which can be empty), and code to build the target from the dependents. For example, the following two rules have code that simplifies our interaction with git.
make pull
updates the local files (no big
win here)make push
uses read
to collect
a string explaining what a git commit is for, then stages the commit,
then makes the commit, then checks for things that are not
committed.pull : ## download
git pull
push : ## save
echo -en "\033[33mWhy this push? \033[0m"; read x; git commit -am "$$x"; git push; git status
For rules with dependents, the target is not changed unless there are
newer dependents. For example, here is the rule that made this file.
Note that this process needs a bunch of scripts, a css file etc. Make
will udpate docs/%.html
if ever any of those dependents
change.
docs/%.html : %.py etc/py2html.awk etc/b4.html docs/ezr.css Makefile ## make doco: md -> html
echo "$< ... "
$< \
gawk -f etc/py2html.awk
| pandoc -s -f markdown --number-sections --toc --toc-depth=5 \-B etc/b4.html --mathjax \
--css ezr.css --highlight-style tango \
--metadata title="$<" \
-o $@
This means that make docs/*.html
will update all the
html files at this site. And if we call this command twice, the second
call will do nothing at all since docs/%.html
is already up
to date. This can save a lot of time during build procedures.
The makefile for any particular project can get very big. Hence, it s
good practice to add an auto document rule (see the
make help
command, above). Note that this is an example of
the all code needs doco pattern (also described above).
__doc__
string
r"\n\s*-\w+\s*--(\w+).*=\s*(\S+)
(a
leading “r” tells Python to define a regular expression)^[ \t\n]*
(spare brackets means one
or more characters; star means zero or more)[ \t\n]*$
(dollar sign means end
of line)^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$
(round brackets group expressions; vertical bar denotes “or”; “?” means
zero or one)Validation procedures are designed to counter two anti-patterns in
learning. For example, the xval
function (described below)
randomly shuffles the order of the training data (to handle order
effects) then splits the data into bins (where one bin is for testing
and the others are for training):
Another validation scheme, suitable for streaming data, is to train on past data the test on some limited buffer into the future:
Dunder = double underscore = “__”
Dunders define certain core functions; e.g.
__init__
(for initialization)__repr__
to return a printed representation of an
object. i i- or __dict__
which is the magic internal
variable most Python classes to hold instance variablesFor example, here’s a quick and dirty hold-all class that can print its own cotents.
class o:
__init__ = lambda i,**d : i.__dict__.update(d)
__repr__ = lambda i : i.__class__.__name__+str(i.__dict__)
= o(
the = 1234567891,
seed round = 2,
= o(cohen=0.35,
stats =0.195, #border between small=.11 and medium=.28
cliffs=512,
bootstraps=0.05)) confidence
With the above, now our code can say things like
the.stats.cohen
.
If you want, you can even redefine +
(__add__
) or =
(__eq__
):
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def __eq__(self, other):
return self.x == other.x and self.y == other.y
= Point(1, 2)
p1 = Point(1, 2)
p2 = Point(3, 4)
p3
print(p1 == p2) # True
print(p1 == p3) # False
There are so many other dunders:
Magic Method | Description |
---|---|
__init__ |
Initializes a new instance of a class. |
__del__ |
Called before an object is destroyed. |
__repr__ |
Returns an official string representation of an object. |
__str__ |
Returns a string representation of an object. |
__len__ |
Returns the length of an object. |
__getitem__ |
Gets an item from an object using indexing. |
__setitem__ |
Sets an item in an object using indexing. |
__delitem__ |
Deletes an item from an object using indexing. |
__iter__ |
Returns an iterator object. |
__next__ |
Returns the next item from an iterator. |
__contains__ |
Checks if an object contains a value. |
__call__ |
Allows an object to be called like a function. |
__eq__ |
Checks if two objects are equal. |
__ne__ |
Checks if two objects are not equal. |
__lt__ |
Checks if one object is less than another. |
__le__ |
Checks if one object is less than or equal to another. |
__gt__ |
Checks if one object is greater than another. |
__ge__ |
Checks if one object is greater than or equal to another. |
__add__ |
Adds two objects. |
__sub__ |
Subtracts one object from another. |
__mul__ |
Multiplies two objects. |
__truediv__ |
Divides one object by another. |
__floordiv__ |
Performs floor division on two objects. |
__mod__ |
Returns the remainder of division. |
__pow__ |
Raises one object to the power of another. |
__and__ |
Performs bitwise AND on two objects. |
__or__ |
Performs bitwise OR on two objects. |
__xor__ |
Performs bitwise XOR on two objects. |
__lshift__ |
Performs left bitwise shift on an object. |
__rshift__ |
Performs right bitwise shift on an object. |
__neg__ |
Returns the negation of an object. |
__pos__ |
Returns the positive value of an object. |
__abs__ |
Returns the absolute value of an object. |
__invert__ |
Returns the bitwise inversion of an object. |
__round__ |
Rounds an object to a given number of digits. |
__bool__ |
Returns the boolean value of an object. |
__hash__ |
Returns the hash value of an object. |
__enter__ |
Called when entering a context (with statement). |
__exit__ |
Called when exiting a context (with statement). |
This code uses dataclasses. These are a great shorthand method for defining classes. All dataclasses supply their own init and pretty-print methods. For example, here is a class with dataclasses
class Person():
def __init__(self, name='Joe', age=30, height=1.85, email='joe@dataquest.io'):
self.name = name
self.age = age
self.height = height
self.email = email
Bt with data classes:
from dataclasses import dataclass
@dataclass
class Person():
str = 'Joe'
name: int = 30
age: float = 1.85
height: str = 'joe@dataquest.io'
email:
print(Person(name='Tim', age=1000))
==> Person(name='Tim', age=1000, height=1.85, email='joe@dataquest.io')
In other languages, types are taken very seriously and are the basis for computation.
The Python type system was a bolt-on to later versions of the language. Hence, it is not so well-defined.
But it is a great documentation tools since they let the programmer tell the reader what goes in and out of their function.
Firstly, you can define your own types. For example,
classes
stores rows of data about (e.g.) dogs and cats in a
dictionary whose keys are “dogs” and “cats”
= dict(dogs=[['ralph','poodle',2021],['benhi','labrador',2022]]
data=[['miss meow', 'ginger' 2020], etc]) cats
We can define these classes
as follows:
from __future__ import annotations
from typing import Any as any
from typing import List, Dict, Type, Callable, Generator
= float | int #
number = number | bool | str # and sometimes "?"
atom = list[atom]
row = list[row]
rows = dict[str,rows] # `str` is the class name classes
Then we can define a classifier as something that accepts
classes
and a new row and returns a guess as to what class
it belongs to:
def classifier(data: classes, example: row) -> str:
...
Or, for a nearest neighbor classifier, we can define a function that
sorts all the rows by the distance to some new row called
row1
as follows (and here, the nearest neighbor to
row1
is the first item in the returned row.
def neighbors(self:DATA, row1:row, rows:rows=None) -> rows:
return sorted(rows, key=lambda row2: self.dist(row1, row2))
(Note that the following abstractions are available in many languages. So are they a pattern? Or an idiom? I place them here since the examples are Python-specific.)
Exception handling in Python allows you to handle errors gracefully
without crashing your program. The basic structure involves
try
, except
, else
, and
finally
blocks.
E.g. suppose you want a string to be a float, int, bool and if that
all fails, be a string. The ast
module has a method
literal_eval
that can handle most of that. So if it fails,
we just return a string
import ast
def coerce(s:str) -> atom:
"Coerces strings to atoms."
try: return ast.literal_eval(s)
except Exception: return s
Fun fact: I once spent an afternoon trying to debug something where
everything had turned to strings. Turns out, I’d forgotten to
import ast
so the try
part of the above was
always failing.
Iterators are things that do some set up, yield one thing, then wait till asked, then yield one other hing, then wait till asked, then yield another other thing, etc. They are offer a simple interface to some under-lying complex process.
For example, my code’s csv
function opens a file,
removes spaces from each line, skips empty lines, splits lines on a
comma, then coerces each item in the row to some Python type. Note that
this function does not return
, but it
yields
.
def csv(file) -> Generator[row]:
= sys.stdin if file=="-" else open(file)
infile with infile as src:
for line in src:
= re.sub(r'([\n\t\r ]|#.*)', '', line)
line if line: yield [coerce(s.strip()) for s in line.split(",")]
# def coerce(s:str) -> atom: (defined above)
We can call it this way (note the simplicity of the interface)
for row in csv(fileName):
# row is now something like [4,86,65,80,3,2110,17.9,50]
doSomethhing(fileName)
Here’s another that implements a cross validation test rig where learners train on some data, then test on some hold-out.
n
bins.test
and all the other
bins as rest
.The following is an m-by-n cross val. That is, from m
shuffling, yield n
train,test set pairs. For the default
values (m=n=5
) this yields 25 train,test set pairs.
def xval(lst:list, m:int=5, n:int=5, some:int=10**6) -> Generator[rows,rows]:
for _ in range(m):
# -------------------------------- [1]
random.shuffle(lst) for n1 in range (n):
= len(lst)/n * n1 # ------------------------------- [2]
lo = len(lst)/n * (n1+1)
hi = [],[]
train, test for i,x in enumerate(lst):
if i >= lo and i < hi else train).append(x)
(test = random.choices(train, k=min(len(train),some)) # --- [4]
train yield train,test #---------------------------------------- [3]
This code makes extensive use of comprehensions . E.g. to find the middle of a cluster, ask each column for its middle point.
@of("Return central tendency of a DATA.")
def mid(self:DATA) -> row:
return [col.mid() for col in self.cols.all]
@of("Return central tendency of NUMs.")
def mid(self:NUM) -> number: return self.mu
@of("Return central tendency of SYMs.")
def mid(self:SYM) -> number: return self.mode
Comprehensions can be to filter data:
>>> [i for i in range(10) if i % 2 == 0]
0, 2, 4, 6, 8] [
Here’s one for loading tab-separated files with optional comment lines starting with a hash mark:
data = [line.strip().split("\t") for line in open("my_file.tab") \
if not line.startswith('#')]
e.g. here are two examples of an implicit iterator in the argument to
sum
:
@of("Entropy = measure of disorder.")
def ent(self:SYM) -> number:
return - sum(n/self.n * log(n/self.n,2) for n in self.has.values())
@of("Euclidean distance between two rows.")
def dist(self:DATA, r1:row, r2:row) -> float:
= sum(c.dist(r1[c.at], r2[c.at])**the.p for c in self.cols.x)
n return (n / len(self.cols.x))**(1/the.p)
E.g here we
@of("Return predictions for `cols` (defaults to klass column).")
def predict(self:DATA, row1:row, rows:rows, cols=None, k=2):
= cols or self.cols.y
cols = {col.at : [] for col in cols} -- [1]
got for row2 in self.neighbors(row1, rows)[:k]:
= 1E-32 + self.dist(row1,row2)
d for col in cols] -- [2]
[got[col.at].append( (d, row2[col.at]) ) return {col.at : col.predict( got[col.at] ) for col in cols} -- [3]
of
decorator lets you define methods outside
of a function. Here it is used to group together the mid
(middle) and div
(diversity) functions.
mid
(ddle) of a NUMeric and a SYMbol column are
their means and modes.DATA
, thsee hold rows which are summarized in
cols
. The mid
of those rows is the
mid
of the summary for each column.def of(doc):
def doit(fun):
= doc
fun.__doc__ self = inspect.getfullargspec(fun).annotations['self']
setattr(globals()[self], fun.__name__, fun)
return doit
@of("Return central tendency of a DATA.")
def mid(self:DATA) -> row:
return [col.mid() for col in self.cols.all]
@of("Return central tendency of NUMs.")
def mid(self:NUM) -> number: return self.mu
@of("Return central tendency of SYMs.")
def mid(self:SYM) -> number: return self.mode
@of("Return diversity of a NUM.")
def div(self:NUM) -> number: return self.sd
@of("Return diversity of a SYM.")
def div(self:SYM) -> number: return self.ent()
Make sure you are running Python3.13. On Linux and Github code spaces, that command is
sudo apt update -y; sudo apt upgrade -y; sudo apt install software-properties-common -y; sudo add-apt-repository ppa:deadsnakes/ppa -y ; sudo apt update -y ; sudo apt install python3.13 -y
Now check you have python3.13
python3.13 -B --version
cd /workspaces/ezr
git checkout 24Aug14
python3.13 -B ezr.py -t data/misc/auto93.csv -e _mqs
This takes a new minutes, writes output t
cd /workspaces/ezr
Here=$PWD
cd data/optimize/process
mkdir -p ~/tmp/mqs
for i in *.csv ; do $Here/ezr.py -D -e _mqs -t $i | tee ~/tmp/mqs/$i ; done
Here’s a file extend.py
in the same directory as
ezr.py
import sys,random
from ezr import the, DATA, csv, dot
def show(lst):
return print(*[f"{word:6}" for word in lst], sep="\t")
def myfun(train):
= DATA().adds(csv(train))
d = len(d.cols.x)
x = len(d.rows)
size = "small" if x <= 5 else ("med" if x < 12 else "hi")
dim = "small" if size< 500 else ("med" if size<5000 else "hi")
size return [dim, size, x,len(d.cols.y), len(d.rows), train[17:]]
# not needed here, but good practice to always take care of seeds
random.seed(the.seed) "dim", "size","xcols","ycols","rows","file"])
show(["------"] * 6)
show([for arg in sys.argv if arg[-4:] == ".csv"] [show(myfun(arg))
On my machine, when I run …
python3.13 -B extend.py data/optimize/[chmp]*/*.csv > ~/tmp/tmp
sort -r -k 1,2 ~/tmp/tmp
… this prints some stats on the data files:
dim size xcols ycols rows file
------ ------ ------ ------ ------ ------
small small 4 3 398 misc/auto93.csv
small small 4 2 259 config/SS-H.csv
small small 3 2 206 config/SS-B.csv
small small 3 2 196 config/SS-G.csv
small small 3 2 196 config/SS-F.csv
small small 3 2 196 config/SS-D.csv
small small 3 1 196 config/wc+wc-3d-c4-obj1.csv
small small 3 1 196 config/wc+sol-3d-c4-obj1.csv
small small 3 1 196 config/wc+rs-3d-c4-obj1.csv
small med 5 2 1080 config/SS-I.csv
small med 3 2 1512 config/SS-C.csv
small med 3 2 1343 config/SS-A.csv
small med 3 2 756 config/SS-E.csv
small hi 5 3 10000 hpo/healthCloseIsses12mths0011-easy.csv
small hi 5 3 10000 hpo/healthCloseIsses12mths0001-hard.csv
med small 9 1 192 config/Apache_AllMeasurements.csv
med med 11 2 1023 config/SS-P.csv
med med 11 2 1023 config/SS-L.csv
med med 11 2 972 config/SS-O.csv
med med 10 2 1599 misc/Wine_quality.csv
med med 9 3 500 process/pom3d.csv
med med 6 2 3840 config/SS-S.csv
med med 6 2 3840 config/SS-J.csv
med med 6 2 2880 config/SS-K.csv
med med 6 1 3840 config/rs-6d-c3_obj2.csv
med med 6 1 3840 config/rs-6d-c3_obj1.csv
med med 6 1 2880 config/wc-6d-c1-obj1.csv
med med 6 1 2866 config/sol-6d-c2-obj1.csv
med hi 11 2 86058 config/SS-X.csv
med hi 9 3 20000 process/pom3c.csv
med hi 9 3 20000 process/pom3b.csv
med hi 9 3 20000 process/pom3a.csv
hi small 22 4 93 process/nasa93dem.csv
hi med 38 1 4653 config/SQL_AllMeasurements.csv
hi med 21 2 4608 config/SS-U.csv
hi med 17 5 1000 process/coc1000.csv
hi med 17 3 864 config/SS-M.csv
hi med 16 1 1152 config/X264_AllMeasurements.csv
hi med 14 2 3008 config/SS-R.csv
hi med 14 1 3456 config/HSMGP_num.csv
hi med 13 3 2736 config/SS-Q.csv
hi hi 23 4 10000 process/xomo_osp2.csv
hi hi 23 4 10000 process/xomo_osp.csv
hi hi 23 4 10000 process/xomo_ground.csv
hi hi 23 4 10000 process/xomo_flight.csv
hi hi 17 2 53662 config/SS-N.csv
hi hi 16 2 65536 config/SS-W.csv
hi hi 16 2 6840 config/SS-V.csv
hi hi 12 2 5184 config/SS-T.csv
Try modifying the output to add columns to report counts of the number of symbolic and numeric columns.
Q. Zhang and H. Li, “MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition,” in IEEE Transactions on Evolutionary Computation, vol. 11, no. 6, pp. 712-731, Dec. 2007, doi: 10.1109/TEVC.2007.892759.↩︎