Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

ENH: implement Agglomerative (hierarchical) clustering #162

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sinhrks
Copy link
Contributor

@sinhrks sinhrks commented Dec 20, 2016

This needs a discussion, because hierarchical clustering uses the same data for training and prediction. It doesn't meet very well with current SupModel trait.

Maybe we should have train_predict method to use same data for training and prediction?

@AtheMathmo
Copy link
Owner

Awesome! Thank you so much for these PRs - sorry that it's taking me a little while to look through them properly as well. This is another algorithm I'm not super familiar with and so will need a little time to understand it.

There are definitely some issues with the current traits for learning. We've had discussions ( #124 ) about these before but it is hard to agree on the best. If you have anything to add I would love to hear your opinion.

As for this particular example, could you elaborate a little more on what exactly is lacking in the trait signature. From a brief look it seems that you train the model and return the clusters. The way I have handled this elsewhere is to use the UnSupModel trait and store the clusters within the model, providing an access function. And then the predict function will assign the new points to their closest cluster. This is a little messy but keeps things consistent until we can find a nicer way to link models and traits together.

Copy link
Owner

@AtheMathmo AtheMathmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part I only have minor comments and questions. The code looks really good and I think you've organized it really well.

I can't really comment on the correctness of the algorithm but I don't see anything obviously wrong. I'm happy to see the existing regression tests but I think personally we should try to produce some more convincing examples. Perhaps we can construct a small dataset which exhibits some obvious heirachy that we can check that we get back? I'll have to leave you to figure that one out for now...

From my brief (basic) reading I'm a little confused by your use of Metric. It seems that you are describing what I am reading as the linkage criterion. It seems that the metric you are using is the Euclidean metric, which is in the DistanceMatrix::from_mat function

As one final comment you may be interested in this PR: AtheMathmo/rulinalg#101 . The motivation behind it is to provide some generic way to reuse metrics across different models in rusty-machine. This may come in useful here - though it's not ready yet.

struct DistanceMatrix {
// Distance is symmetric, no need to hold all pairs
// use HashMap to easier update
data: HashMap<(usize, usize), f64>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is more to open up a conversation - are you sure this is necessary?

We are storing 3 64bit values for half the entries of our original matrix. This means we take up more memory than just using an abstraction over the original matrix. Of course - if we need to regularly use the distance then it is more efficient to store the computation once. In this case maybe we can just store the values in a Vec and implement some custom indexing?

This is all pretty minor (but interesting) stuff. I imagine it will probably end up being kept as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for simple implementation rather than memory efficiency.

Because distance matrix should keep only remaining nodes and merged clusters, mapping node id to matrix cell (of triangular flattened to vector) makes impl complex.

I think it is OK ATM as hierarchical clustering is not likely to be used for larger data, but appreciated if anyone lmk better impl.


unsafe {
for i in 0..n {
for j in i..inputs.rows() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip the j == i case here as we tackle it in the get function below.

/// Add distance between i-th and j-th item
/// i must be smaller than j
fn insert(&mut self, i: usize, j: usize, dist: f64) {
assert!(i < j, "i must be smaller than j");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this assert_debug. The function is only used internally so we wouldn't expect a failure - as such we don't want a performance hit when run in release mode.


/// Delete distance between i-th and j-th item
fn delete(&mut self, i: usize, j: usize) {
assert!(i != j, "DistanceMatrix doesn't store distance when i == j, because it is 0.0");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, switch to assert_debug.


/// Agglomerative clustering distances
#[derive(Debug)]
pub enum Metrics {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is relevant here but I would consider using a trait here instead of an enum. The upside is that users can define their own metrics and plug them into the model (without modifying rusty-machine source code). The downside is an extra generic parameter on the model.

@sinhrks
Copy link
Contributor Author

sinhrks commented Dec 20, 2016

You're right. Renamed Metrics to Linkage.

@AtheMathmo
Copy link
Owner

AtheMathmo commented Dec 24, 2016

Sorry I forgot to comment on this again.

Thanks for making those changes - I think that this is probably ready to be merged now. Before I do merge I'd like to have a play with it and convince myself it is working as expected. Do you have any recommendations on how I can do this? Any data sets that lend themselves to this kind of model?

Edit: If you have no other suggestions then I'll merge #161 and try it out on the iris data set.

@sinhrks
Copy link
Contributor Author

sinhrks commented Dec 25, 2016

iris is ok as test data. Here is R example( https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html). I'm adding tests once #161 has been merged.

Note: current test data is derived from a Japanese textbook in R.

Another point is training API. Currently, it doesn't impl UnSupModel. What do you think to change UnSupModel to:

    pub trait UnSupModel<T, U> {
        /// Predict output from inputs.
        fn predict(&self, inputs: &T) -> LearningResult<U>;

        /// Train the model using inputs.
        fn train(&mut self, inputs: &T) -> LearningResult<()>;

        /// Train the model using inputs, then predict outputs from the same inputs.
        fn train_predict(&mut self, inputs: &T) -> LearningResult<U> {
            self.train(inputs).unwrap();
            self.predict(inputs)
    }

AgglomerativeClustering should raise unimplemented! in predict/train. And should have custom impl for train_predict.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants