ENH: implement Agglomerative (hierarchical) clustering #162

sinhrks · 2016-12-20T14:13:54Z

This needs a discussion, because hierarchical clustering uses the same data for training and prediction. It doesn't meet very well with current SupModel trait.

Maybe we should have train_predict method to use same data for training and prediction?

AtheMathmo · 2016-12-20T14:56:57Z

Awesome! Thank you so much for these PRs - sorry that it's taking me a little while to look through them properly as well. This is another algorithm I'm not super familiar with and so will need a little time to understand it.

There are definitely some issues with the current traits for learning. We've had discussions ( #124 ) about these before but it is hard to agree on the best. If you have anything to add I would love to hear your opinion.

As for this particular example, could you elaborate a little more on what exactly is lacking in the trait signature. From a brief look it seems that you train the model and return the clusters. The way I have handled this elsewhere is to use the UnSupModel trait and store the clusters within the model, providing an access function. And then the predict function will assign the new points to their closest cluster. This is a little messy but keeps things consistent until we can find a nicer way to link models and traits together.

AtheMathmo

For the most part I only have minor comments and questions. The code looks really good and I think you've organized it really well.

I can't really comment on the correctness of the algorithm but I don't see anything obviously wrong. I'm happy to see the existing regression tests but I think personally we should try to produce some more convincing examples. Perhaps we can construct a small dataset which exhibits some obvious heirachy that we can check that we get back? I'll have to leave you to figure that one out for now...

From my brief (basic) reading I'm a little confused by your use of Metric. It seems that you are describing what I am reading as the linkage criterion. It seems that the metric you are using is the Euclidean metric, which is in the DistanceMatrix::from_mat function

As one final comment you may be interested in this PR: AtheMathmo/rulinalg#101 . The motivation behind it is to provide some generic way to reuse metrics across different models in rusty-machine. This may come in useful here - though it's not ready yet.

AtheMathmo · 2016-12-20T15:19:14Z

src/learning/agglomerative.rs

+struct DistanceMatrix {
+    // Distance is symmetric, no need to hold all pairs
+    // use HashMap to easier update
+    data: HashMap<(usize, usize), f64>


This comment is more to open up a conversation - are you sure this is necessary?

We are storing 3 64bit values for half the entries of our original matrix. This means we take up more memory than just using an abstraction over the original matrix. Of course - if we need to regularly use the distance then it is more efficient to store the computation once. In this case maybe we can just store the values in a Vec and implement some custom indexing?

This is all pretty minor (but interesting) stuff. I imagine it will probably end up being kept as-is.

This is for simple implementation rather than memory efficiency.

Because distance matrix should keep only remaining nodes and merged clusters, mapping node id to matrix cell (of triangular flattened to vector) makes impl complex.

I think it is OK ATM as hierarchical clustering is not likely to be used for larger data, but appreciated if anyone lmk better impl.

AtheMathmo · 2016-12-20T15:20:27Z

src/learning/agglomerative.rs

+
+        unsafe {
+            for i in 0..n {
+                for j in i..inputs.rows() {


We can skip the j == i case here as we tackle it in the get function below.

AtheMathmo · 2016-12-20T15:25:13Z

src/learning/agglomerative.rs

+    /// Add distance between i-th and j-th item
+    /// i must be smaller than j
+    fn insert(&mut self, i: usize, j: usize, dist: f64) {
+        assert!(i < j, "i must be smaller than j");


I would make this assert_debug. The function is only used internally so we wouldn't expect a failure - as such we don't want a performance hit when run in release mode.

AtheMathmo · 2016-12-20T15:25:49Z

src/learning/agglomerative.rs

+
+    /// Delete distance between i-th and j-th item
+    fn delete(&mut self, i: usize, j: usize) {
+        assert!(i != j, "DistanceMatrix doesn't store distance when i == j, because it is 0.0");


Same here, switch to assert_debug.

AtheMathmo · 2016-12-20T15:28:33Z

src/learning/agglomerative.rs

+
+/// Agglomerative clustering distances
+#[derive(Debug)]
+pub enum Metrics {


I'm not sure if this is relevant here but I would consider using a trait here instead of an enum. The upside is that users can define their own metrics and plug them into the model (without modifying rusty-machine source code). The downside is an extra generic parameter on the model.

sinhrks · 2016-12-20T23:43:06Z

You're right. Renamed Metrics to Linkage.

AtheMathmo · 2016-12-24T16:04:52Z

Sorry I forgot to comment on this again.

Thanks for making those changes - I think that this is probably ready to be merged now. Before I do merge I'd like to have a play with it and convince myself it is working as expected. Do you have any recommendations on how I can do this? Any data sets that lend themselves to this kind of model?

Edit: If you have no other suggestions then I'll merge #161 and try it out on the iris data set.

sinhrks · 2016-12-25T01:04:17Z

iris is ok as test data. Here is R example( https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html). I'm adding tests once #161 has been merged.

Note: current test data is derived from a Japanese textbook in R.

Another point is training API. Currently, it doesn't impl UnSupModel. What do you think to change UnSupModel to:

    pub trait UnSupModel<T, U> {
        /// Predict output from inputs.
        fn predict(&self, inputs: &T) -> LearningResult<U>;

        /// Train the model using inputs.
        fn train(&mut self, inputs: &T) -> LearningResult<()>;

        /// Train the model using inputs, then predict outputs from the same inputs.
        fn train_predict(&mut self, inputs: &T) -> LearningResult<U> {
            self.train(inputs).unwrap();
            self.predict(inputs)
    }

AgglomerativeClustering should raise unimplemented! in predict/train. And should have custom impl for train_predict.

AtheMathmo reviewed Dec 20, 2016

View reviewed changes

sinhrks added 2 commits December 29, 2016 14:25

ENH: implement Agglomerative clustering

1bbc7a7

rename linkage, use debug_assert

84fe356

sinhrks force-pushed the agg branch from 4c46d25 to 84fe356 Compare December 29, 2016 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: implement Agglomerative (hierarchical) clustering #162

ENH: implement Agglomerative (hierarchical) clustering #162

sinhrks commented Dec 20, 2016

AtheMathmo commented Dec 20, 2016

AtheMathmo left a comment

AtheMathmo Dec 20, 2016

sinhrks Dec 20, 2016

AtheMathmo Dec 20, 2016

AtheMathmo Dec 20, 2016

AtheMathmo Dec 20, 2016

AtheMathmo Dec 20, 2016

sinhrks commented Dec 20, 2016

AtheMathmo commented Dec 24, 2016 •

edited

Loading

sinhrks commented Dec 25, 2016

ENH: implement Agglomerative (hierarchical) clustering #162

Are you sure you want to change the base?

ENH: implement Agglomerative (hierarchical) clustering #162

Conversation

sinhrks commented Dec 20, 2016

AtheMathmo commented Dec 20, 2016

AtheMathmo left a comment

Choose a reason for hiding this comment

AtheMathmo Dec 20, 2016

Choose a reason for hiding this comment

sinhrks Dec 20, 2016

Choose a reason for hiding this comment

AtheMathmo Dec 20, 2016

Choose a reason for hiding this comment

AtheMathmo Dec 20, 2016

Choose a reason for hiding this comment

AtheMathmo Dec 20, 2016

Choose a reason for hiding this comment

AtheMathmo Dec 20, 2016

Choose a reason for hiding this comment

sinhrks commented Dec 20, 2016

AtheMathmo commented Dec 24, 2016 • edited Loading

sinhrks commented Dec 25, 2016

AtheMathmo commented Dec 24, 2016 •

edited

Loading