Different Learners for Different Data

Let us start at the very beginning (a very good place to start). When you read you begin with A-B-C. When you mine, you begin with data.

Different kinds of data miners work best of different kinds of data. Such data may be viewed as tables of examples:

For example, in text mining, where there is one column per word and one row per document, the columns contain many missing values (since not all words appear in all documents) and there may be hundreds of thousands of columns.

On the other hand,

If a table has no goal columns, then this is an unsupervised learning problem that might be addressed by (say) finding clusters of similar rows using, say, K- means or expectation maximization.

An alternate approach, taken by the Apriori association rule learner, is to assume that every column is a goal and to look for what combinations of any values predict for any combination of any other.

If a table has one goal, then this is a supervised learning problem where the task is to find combinations of values from the other columns that predict for the goal values. Note that for datasets with one discrete goal feature, it is common to call that goal the class of the dataset.

For example, here is a table of data for a simple data mining problem:

outlook, temp,humid,wind,play
-----------------------------
sunny,    85,   85, FALSE,  no
sunny,    80,   90, TRUE,     no
overcast,   83, 86, FALSE,  yes
rainy,    70,   96, FALSE,  yes
rainy,    68,   80, FALSE,  yes
rainy,    65,   70, TRUE,     no
overcast,   64, 65, TRUE,     yes
sunny,    72,   95, FALSE,  no
sunny,    69,   70, FALSE,  yes
rainy,    75,   80, FALSE,  yes
sunny,    75,   70, TRUE,     yes
overcast,   72, 90, TRUE,     yes
overcast,   81, 75, FALSE,  yes
rainy,    71,   91, TRUE,     no

In this table, we are trying to predict for the goal of play?, given a record of the weather.

Each row is one example where we did or did not play golf (and the goal of data mining is to find what weather predicts for playing golf).

Note that temp and humidity are numeric columns and there are no missing values.

Such simple tables are characterized by just a few columns and not many rows (say, dozens to thousands).

Traditionally, such simple data mining problems have been explored by C4.5 and CART.

However, with some clever sampling of the data, it is possible to scale these traditional learners to Big Data problems.

Y = F(X)

One way to look at a table of data is an example of some function that computes columns "Y" from input columns "X".

Splits

Another way to look at a table of data is as a source of Splits.

Sym columns:

Num columns:

Why Split?

Timm's rule: the best thing to do with data is to throw most of it away.

Occam's Razor

Feature Section

The case for FSS

Repeated result: throwing out features rarely damages a theory

And, sometimes, feature removal is very useful:

Excess attributes

Why FSS?

Problem

Supervised vs Unsupervised

E.g. here unsupervised discretization running on the horsepower column of auto.csv. Note these numbers run 46 to 230 and this code](http://menzies.us/lean/unsuper.html) decides that this should be divided into:

46.. 230
|.. 46.. 116
|.. |.. 46.. 82
|.. |.. |.. 46.. 65 (..65)
|.. |.. |.. 66.. 82
|.. |.. |.. |.. 66.. 72
|.. |.. |.. |.. |.. 66.. 69 (66..69)
|.. |.. |.. |.. |.. 69.. 72 (69..72)
|.. |.. |.. |.. 74.. 82 (74..82)
|.. |.. 83.. 116
|.. |.. |.. 83.. 97
|.. |.. |.. |.. 83.. 91
|.. |.. |.. |.. |.. 83.. 86 (83..86)
|.. |.. |.. |.. |.. 87.. 91
|.. |.. |.. |.. |.. |.. 87.. 89 (87..89)
|.. |.. |.. |.. |.. |.. 90.. 91 (90..91)
|.. |.. |.. |.. 92.. 97 (92..97)
|.. |.. |.. 98.. 116
|.. |.. |.. |.. 98.. 105 (98..105)
|.. |.. |.. |.. 107.. 116 (107..116)
|.. 120.. 230
|.. |.. 120.. 160
|.. |.. |.. 120.. 140 (120..140)
|.. |.. |.. 142.. 160 (142..160)
|.. |.. 165.. 230 (165..)

FSS types:

Hall and Holmes:

This paper: pre-discretize numerics using entropy.

Hall & Holmes

INFO GAIN

RELIEF

CBS (consistency-based evaluation)

WRAPPER

PRINCIPAL COMPONENTS ANALYSIS (PCA)

(The traditional way to do FSS.)

CFS (correlation-based feature selection)

And the winner is:

Instance Selection

Prototype Selection with Clusters

Reduction of 800 rows by 24 attributes to 5 attributes by 22 rows

For many data sets:

Note that for classification by weighted scores from 2 nearest neighbors, the reduced data as accurate as the full data.