csc 591-024, (8290)
csc 791-024, (8291)
fall 2024, special topics in computer science
Tim Menzies, timm@ieee.org, com sci, nc state


home :: timetable :: syllabus :: groups :: moodle :: license

Howework2


What to hand in

Output from Part 8. Answers to Questions 9,10,11.12,13,14,15,16 All in one pdf file.

Note

The following works for LINUX and Mac. Windows users, please join in a discussion at windows. Or, don’t use windows and use codespaces on Github.

(If using codespaces on Github, carefully monitor your monthly costs using [this link])https://docs.github.com/en/billing/managing-billing-for-github-codespaces/viewing-your-github-codespaces-usage)


  1. Update your code from http://github.com/timm/ezr. Remember to use the 24Aug14 branch. Do not edit my files.. Instead write an extension (see last week on how to write an extension).
git clone https://github.com/timm/ezr/
git checkout 24Aug14
  1. Install the htop monitor so you can track executions (Windows users many have an equivalent tool).

  2. Generate a run file for the “clusters2” action.

mkdir -p ~/tmp/clusters2
make Act=clusters2 actb4 > ~/tmp/clusters2.sh
  1. Edit that file. Using the “#” character, comment one very line EXCEPT those that mention SS-*.csv. Also comment out the lines than mention SS-W, SS-X SS-N (cuase these are slow to run),

  2. Run that file. You will not see any output for a minute or two.

bash ~/tmp/clusters2.sh
  1. Monitor with execution with “htop -F Python” or some equivalent.

  1. If, after 30 minutes elapsed time, there are still jobs running, kill them.
    • If you kill jobs, delete any output file corrupted by that kill (i.e. edit all the ~/tmp/clusters2/*.csv files looking for strange last lines).
  2. Generate a report of all the Go to the output directory and type the following commands:
% cd ~/tmp/clusters2
% grep k1 *.csv  | cut -d, -f 3 | sort -n > /tmp/k1
% grep k2 *.csv  | cut -d, -f 3 | sort -n > /tmp/k2
% grep k3 *.csv  | cut -d, -f 3 | sort -n > /tmp/k3
% grep k5 *.csv  | cut -d, -f 3 | sort -n > /tmp/k5
% grep mid *.csv | cut -d, -f 3 | sort -n > /tmp/mid
% paste /tmp/k1 /tmp/k2 /tmp/k3 /tmp/k5 /tmp/mid

The results should look like this. Here, we guess a goal value by clustering the data then for each test instance (a) finding its relevant cluster; then (b) using either the k=1,2,3,5 neighbors closest neighbors (or the mid-point of that cluster).

 k=1     k=2     k=3     k=5      mid
 ====     =====  ====     =====   =====
 -0.42   -0.21   -0.34   -0.28   -0.92
 -0.07   -0.10   -0.16   -0.24   -0.62
 -0.03   -0.03   -0.09   -0.19   -0.36
 -0.02   -0.02   -0.05   -0.19   -0.22
 -0.01   -0.01   -0.03   -0.18   -0.18
  0.00    0.00   -0.02   -0.06   -0.17
  0.00    0.00   -0.01   -0.05   -0.13
  0.00    0.00   -0.01   -0.02   -0.12
  0.00    0.00   -0.01   -0.01   -0.12
  0.00    0.00    0.00    0.00   -0.11
  0.00   -0.00    0.00    0.00   -0.11
  0.00   -0.00    0.00   -0.00   -0.08
 -0.00   -0.00    0.00   -0.00   -0.07
 -0.00   -0.00    0.00   -0.00   -0.02
 -0.00   -0.00    0.00   -0.00   -0.02
 -0.00    0.01    0.00    0.01   -0.02
 -0.00    0.01   -0.00    0.01   -0.01
  0.01    0.01   -0.00    0.03    0.03
  0.01    0.03    0.01    0.04    0.03
  0.01    0.05    0.01    0.05    0.07
  0.02    0.05    0.01    0.05    0.20

These are all “z” scores; i.e. (x - mid)/sd. Note that:

Questions (to hand in)

  1. clustering:
    • How does the predict function in ezr.py make a prediction (predict is called inside the clusters2 function)?
    • What might be faster to run? K-th nearest neighbors in all the data? k-th nearest neighbors in a leaf cluster? or just using the value of the mid point of each cluster? Justify your answer.
    • What might be more accurate (for prediction)? K-th nearest neighbors in all the data? k-th nearest neighbors in a leaf cluster? or just using the value of the mid point of each cluster? Justify your answer.
    • Check your last answer as follows.
      According the Cohen, differences less than 35% of a standard deviation are a “small effect” (i.e. trivial). After discounting small effects, what do you observe for part (8) about the relative accuracy of prediction using k=1,2,3,5 neighbors (or just the mid=point)?
  2. regression, classification:
  1. multi-objective:
  1. normalization: look at the code def chebyshev. Whey do we normalize the goal values?
  2. entropy:
  1. standard deviation:
  1. distance: define and distinguish Euclidean from the Chebyshev distance.
  2. Bayes Classifier: Ignoring any low frequency factors (like m or k), compute the following. Show all working:
    • prior(no)
    • prior(yes)
    • Looking data just from outlook and the play! column, jdp you predict yes or no for the following. Show all working.
      • outlook=overcast
      • humidity=normal
outlook   ,temperature   ,humidity   ,windy   ,play!
=======   ==========   =========    ======   =====
sunny     ,hot         ,high       ,FALSE   ,no
sunny     ,hot         ,high       ,TRUE    ,no
rainy     ,cool        ,normal     ,TRUE    ,no
sunny     ,mild        ,high       ,FALSE   ,no
rainy     ,mild        ,high       ,TRUE    ,no
overcast  ,hot         ,high       ,FALSE   ,yes
rainy     ,mild        ,high       ,FALSE   ,yes
rainy     ,cool        ,normal     ,FALSE   ,yes
overcast  ,cool        ,normal     ,TRUE    ,yes
sunny     ,cool        ,normal     ,FALSE   ,yes
rainy     ,mild        ,normal     ,FALSE   ,yes
sunny     ,mild        ,normal     ,TRUE    ,yes
overcast  ,mild        ,high       ,TRUE    ,yes
overcast  ,hot         ,normal     ,FALSE   ,yes