Skip to content
jcanny edited this page May 13, 2014 · 7 revisions

Overview

Mixins are additive likelihood functions that allow tailoring of a learning algorithm. The most familiar examples are L1 and L2 regularization. When used with a gradient-based optimizer, L1 and L2 gradients are simply multiples of either the model coefficients or their signs. L1 gradients are discontinuous, which makes it necessary to numerically integrate along constraint surfaces to get an exact optimum. But in a stochastic gradient framework, simply adding the L1 gradient and allow the model coefficients to hunt around the zero value works quite well. This is what BIDMach implements now. Mixins must implement two methods:

def compute(mats:Array[Mat], step:Float)
def score(mats:Array[Mat], step:Float):FMat

the first computes and applies a gradient matrix which is added to the update (total gradient) for each model matrix. The second computes the score, or likelihood for that mixin. The score is broken down into k parts if there are k matrices being updated.

Regularizers

As mentioned, regularizers implement a simple additive gradient based on the current model. L1 and L2 regularizers are implemented. Each has a separate regularizer weight: reg1weight and reg2weight. The scores for these regularizers are the L1 and (squared) L2 norms respectively.

Clustering/Topic Model Mixins

There are three mixins designed to help tune cluster or topic models. They are:

CosineSim: the sum of cosine similarities between all pairs of factors/clusters.
Perplexity: the sum of (log) perplexities for all of the topics. 
Top: an L1 norm applied to non-top features

All of these mixin scores are "bad" and so updates are the negative gradients of these functions. Minimizing cosine similarity drives the factors apart. Minimizing perplexity forces topic probabilities to concentrate around fewer terms. Top applies a Lasso to most terms but leaves the strongest terms alone. This favors topics with a few dominant terms, which empirically have been found to be "good". They are also easy to label using the dominant terms.

Note that both L1 and L2 norms exert similar influences. L2 norm minimizes the sum of squares of topic weights, and favors a few dominant terms in each topic.

Clone this wiki locally