-
Notifications
You must be signed in to change notification settings - Fork 49
Antinous producers
In addition to providing a general way to plug Jython code into PFA applications, Antinous produces models. Only k-means has been implemented.
Antinous producers adhere to the following suite of abstract interfaces in com.opendatagroup.antinous.producer
.
-
A
Dataset
is a source of training data, filled in Jython and used by the producer to make a model. It has at least these methods:-
revert(): Unit
empties theDataset
-
-
A
Model
is what a producer makes, something that can be converted into PFA. It has at least these methods:-
pfa: AnyRef
makes a PFA cell or pool item representing the model usingJsonObject
,JsonArray
, and primitive types -
pfa(options: java.util.Map[String, AnyRef]): AnyRef
makes PFA with options (probably coming from Jython) -
avroType: AvroType
declares the Avro type of the PFA cell or pool item
-
-
A
ModelRecord
extendsModel
and Scala'sProduct
so that it can be a case class -
A
Producer[D <: Dataset, M <: Model]
uses aDataset
to produce aModel
. It has at least these methods:-
dataset: D
the dataset -
optimize(): Unit
updates the state of the producer in-place to improve the model (possibly many times) -
model: M
get the current state of the model
-
-
A
JsonObject[X]
is ajava.util.Map[String, X]
for representingModel
data as PFA -
A
JsonArray[X]
is ajava.util.List[X]
for representingModel
data as PFA
The package also has a random number seed, which is used to randomize all producer algorithms. It can be set via
setRandomSeed(x: Long)
The usual procedure is to create a concrete Dataset
in the global Jython namespace and fill it in the action
phase, then create a Producer
from that Dataset
, run optimize()
to make a Model
and emit
PFA in the end
phase.
Here is an example that builds a k-means clustering model for one key in a Hadoop reducer (one segment of the whole model).
from antinous import *
from com.opendatagroup.antinous.producer.kmeans import VectorSet, KMeans
input = record(key = string, value = array(double))
output = record(segment = string,
clusters = array(record(center = array(double),
weight = double)))
segment = None
vectorSet = VectorSet()
def action(input):
global segment, vectorSet
segment = input.key
vectorSet.add(input.value)
def end():
if segment is not None:
kmeans = KMeans(3, vectorSet)
kmeans.optimize()
emit({"segment": segment, "clusters": kmeans.model().pfa()})
In package com.opendatagroup.antinous.producer.kmeans
,
-
VectorSet
is aDataset
with anadd(pos: java.lang.Iterable[Double], weight: Double)
method for adding points with optional weights. -
ClusterSet(clusters: java.util.List[Cluster])
is aModel
-
Cluster(center: java.util.List[Double], weight: Double, covariance: java.util.List[java.util.List[Double]])
is aModelRecord
that takes options-
weight
: if true, show the weight -
covariance
: if true, show the covariance -
totalVariance
: if true, show the total variance -
determinant
: if true, show the determinant -
limitDimensions
: if a list of integers, only present the dimensions specified incovariance
,totalVariance
, anddeterminant
-
-
KMeans(numberOfClusters: Int, dataset: VectorSet)
is aProducer[VectorSet, ClusterSet]
with the following methods:model: ClusterSet
-
metric: Metric
andsetMetric(m: Metric)
-
stoppingCondition: StoppingCondition
andsetStoppingCondition(s: StoppingCondition)
-
randomClusters()
: pick random initial clusters (done automatically by constructor) -
optimize()
andoptimize(subsampleSize: Int)
to perform k-means on a random subset, using themetric
and stopping whenstoppingCondition
is met.
Metrics adhere to interface Metric
and can be constructed with:
Euclidean
SquaredEuclidean
Chebyshev
Taxicab
Minkowski(p: Double)
-
M(f: PyFunction)
wheref
is any Jython function that takes two Python lists of numbers
Stopping conditions adhere to interface StoppingCondition
and can be constructed with:
-
MaxIterations(max: Int)
triggers when the iteration number reaches or exceeds a given maximum -
Moving
triggers when all changes are below a threshold of 1e-15 BelowThreshold(threshold: Double)
-
HalfBelowThreshold(threshold: Double)
triggers when half the clusters' changes are below a given threshold -
WhenAll(conditions: java.lang.Iterable[StoppingCondition])
triggers when all subconditions are met -
WhenAny(conditions: java.lang.Iterable[StoppingCondition])
triggers when any subconditions are met -
PrintValue(numberFormat: String = "%g")
does not actually stop iteration, but prints out the current values -
PrintValue(numberFormat: String = "%g")
does not actually stop iteration, but prints out the last changes -
S(f: PyFunction)
wheref
is a Python function that takes- iteration number (
int
) - model (
ClusterSet
) - changes (
list
oflists
of numbers)
- iteration number (
Return to the Hadrian wiki table of contents.
Licensed under the Hadrian Personal Use and Evaluation License (PUEL).