csc 591-024, (8290)
csc 791-024, (8291)
fall 2024, special topics in computer science
Tim Menzies, timm@ieee.org, com sci, nc state
home
::
timetable
::
syllabus
::
groups
::
moodle
::
license
Install Python 31.3
- for linux
- for mac
- for windows: when you work it out, tell the class
Get the extension extension
going
- remember to use the 24Aug14 branch
- see examples here
- note that data divides into small,medium, high dimensionality
(number of x columns) and size (number of rows)
Here is a command line that runs all the current built-in examples.
Please run it
python3 -B ezr.py -D -e all -t data/optimize/misc/auto93.csv > ~/tmp/out
Run this code and answer the following questions. Write short answers
for each. Submit one set of answers per team.
- heavens d2h is short for “distance to heaven”. How
is it calculated?
- chebys : how is the cheyshev distance different to
d2h?
- likings
- in english, explain how loglike is calculated? and how is that
calculation different for numeric and symbolic columns?
- Diversity sampling means that the next thing we look out should be
different to everything seen before. So explain: “selecting for min
loglike is a synonym for diversity sampling”
- mean-vs-median
- This code recursively divides data by (a) slitting data according to
everyone’s distance to two far points; then (b) recursing into each
half.
- What is the difference between
half_median
and
half_mean
?
- Referring to this
output from
python3 ezr.py -e mean_vs_median -t data/optimize/misc/auto93.csv
does mean or median splits make a difference?
- python3 ezr.py -e clusters -t
data/optimize/misc/auto93.csv generates a tree generated via
mean splits
- reproduce the same output
- How would this tree be different if we used median splits?
- python3 ezr.py -e clusters2 -t
data/optimize/misc/auto93.csv shows the results of prediction
by (a) cluster the data (see clusters) then for each
test (b) find its nearest leaf cluster; then (c) using either the median
value of that leaf or the 1,2,3,5 nearest neighbor.
- generate that output and show it in the homeworks response. t
- Based on these results, what approach would you recommend?