-
Notifications
You must be signed in to change notification settings - Fork 49
Basic random forest
Download and install Titus. This article was tested with Titus 0.7.1; newer versions should work with no modification. Python >= 2.6 and < 3.0 is required.
Launch a Python prompt and import the following:
Python 2.7.6
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> import csv
>>> import math
>>>
>>> import titus.prettypfa
>>> from titus.genpy import PFAEngine
And download the Iris dataset.
Scoring a random forest involves scoring every tree and reporting the most frequent result (a.mode
for the majority vote). The tree-scoring process is identical to the decision tree example It's also common to report the fraction of trees that gave each class of score (use a.count
and a.len
to compute a fraction).
pfaDocument = titus.prettypfa.jsonNode('''
types:
Input = record(Input,
sepal_length: double,
sepal_width: double,
petal_length: double,
petal_width: double);
TreeNode = record(TreeNode,
field: enum([sepal_length, sepal_width, petal_length, petal_width]),
operator: string,
value: double,
pass: union(string, TreeNode),
fail: union(string, TreeNode))
input: Input
output: string
cells:
forest(array(TreeNode)) = []
action:
var scores = a.map(forest, fcn(tree: TreeNode -> string) {
model.tree.simpleWalk(input, tree, fcn(d: Input, t: TreeNode -> boolean) {
model.tree.simpleTest(d, t)
})
});
a.mode(scores) // majority vote
''')
The following code is a literal copy of the decision tree builder with tweaks to turn it into a random forest. The CART decision tree algorithm optimizes its choice of field to split on and the cut point; random forests randomize the choice of field and optimize the cut point for that field. They also stop building after a certain depth (to make an ensemble of weak classifiers).
class Datum(object):
def __init__(self, sepal_length, sepal_width, petal_length, petal_width, truth):
self.sepal_length = float(sepal_length)
self.sepal_width = float(sepal_width)
self.petal_length = float(petal_length)
self.petal_width = float(petal_width)
self.truth = truth.replace("Iris-", "")
def attemptSplit(field, data):
bestGain = None
bestSplit = None
values = sorted(set(getattr(x, field) for x in data))
splits = [(low + high)/2.0 for low, high in zip(values[:-1], values[1:])]
for split in splits:
gainPass = 0.0
gainFail = 0.0
for truth in "setosa", "versicolor", "virginica":
numer = sum(x.truth == truth for x in data if getattr(x, field) < split)
denom = sum(getattr(x, field) < split for x in data)
if numer > 0.0:
p = numer / float(denom)
gainPass -= p * math.log(p, 2)
numer = sum(x.truth == truth for x in data if getattr(x, field) >= split)
denom = sum(getattr(x, field) >= split for x in data)
if numer > 0.0:
p = numer / float(denom)
gainFail -= p * math.log(p, 2)
gain = gainPass + gainFail
if bestGain is None or gain > bestGain:
bestGain = gain
bestSplit = split
if bestGain is None:
return 0.0, 0.0
else:
return bestGain, bestSplit
def predominant(data):
setosa = sum(x.truth == "setosa" for x in data)
versicolor = sum(x.truth == "versicolor" for x in data)
virginica = sum(x.truth == "virginica" for x in data)
if setosa > versicolor and setosa > virginica:
return "setosa"
elif versicolor > setosa and versicolor > virginica:
return "versicolor"
else:
return "virginica"
def makeTree(data, depth):
field = random.choice(["sepal_length", "sepal_width", "petal_length", "petal_width"])
gain, split = attemptSplit(field, data)
dataPass = [x for x in data if getattr(x, field) < split]
dataFail = [x for x in data if getattr(x, field) >= split]
if depth == 0:
branchPass = {"string": predominant(dataPass)}
branchFail = {"string": predominant(dataFail)}
else:
branchPass = makeTree(dataPass, depth - 1)
branchFail = makeTree(dataFail, depth - 1)
return {"TreeNode": {"field": field,
"operator": "<",
"value": split,
"pass": branchPass,
"fail": branchFail}}
iris = csv.reader(open("iris.csv"))
iris.next()
dataset = [Datum(*x) for x in iris]
forest = [makeTree(dataset, 5)["TreeNode"] for x in xrange(100)]
Just as with the decision tree example, the trees are already formatted for PFA.
pfaDocument["cells"]["forest"]["init"] = forest
Depending on how your random number generator is seeded, this forest correctly identifies about 95% of the Iris dataset. Compare with the ruleset example (95%) and overtrained decision tree example (100%).
engine, = PFAEngine.fromJson(pfaDocument)
iris = csv.reader(open("iris.csv"))
iris.next()
numer = 0
denom = 0
for sepal_length, sepal_width, petal_length, petal_width, truth in iris:
result = engine.action({"sepal_length": sepal_length, "sepal_width": sepal_width, "petal_length": petal_length, "petal_width": petal_width})
print result, "vs", truth.replace("Iris-", "")
if result != truth.replace("Iris-", ""):
numer += 1
denom += 1
print "Fraction mislabeled:", float(numer) / float(denom)
Return to the Hadrian wiki table of contents.
Licensed under the Hadrian Personal Use and Evaluation License (PUEL).