-
Notifications
You must be signed in to change notification settings - Fork 152
ENH: implement Agglomerative (hierarchical) clustering #162
base: master
Are you sure you want to change the base?
Conversation
Awesome! Thank you so much for these PRs - sorry that it's taking me a little while to look through them properly as well. This is another algorithm I'm not super familiar with and so will need a little time to understand it. There are definitely some issues with the current traits for learning. We've had discussions ( #124 ) about these before but it is hard to agree on the best. If you have anything to add I would love to hear your opinion. As for this particular example, could you elaborate a little more on what exactly is lacking in the trait signature. From a brief look it seems that you train the model and return the clusters. The way I have handled this elsewhere is to use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part I only have minor comments and questions. The code looks really good and I think you've organized it really well.
I can't really comment on the correctness of the algorithm but I don't see anything obviously wrong. I'm happy to see the existing regression tests but I think personally we should try to produce some more convincing examples. Perhaps we can construct a small dataset which exhibits some obvious heirachy that we can check that we get back? I'll have to leave you to figure that one out for now...
From my brief (basic) reading I'm a little confused by your use of Metric
. It seems that you are describing what I am reading as the linkage criterion. It seems that the metric you are using is the Euclidean metric, which is in the DistanceMatrix::from_mat
function
As one final comment you may be interested in this PR: AtheMathmo/rulinalg#101 . The motivation behind it is to provide some generic way to reuse metrics across different models in rusty-machine. This may come in useful here - though it's not ready yet.
struct DistanceMatrix { | ||
// Distance is symmetric, no need to hold all pairs | ||
// use HashMap to easier update | ||
data: HashMap<(usize, usize), f64> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is more to open up a conversation - are you sure this is necessary?
We are storing 3 64bit values for half the entries of our original matrix. This means we take up more memory than just using an abstraction over the original matrix. Of course - if we need to regularly use the distance then it is more efficient to store the computation once. In this case maybe we can just store the values in a Vec
and implement some custom indexing?
This is all pretty minor (but interesting) stuff. I imagine it will probably end up being kept as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for simple implementation rather than memory efficiency.
Because distance matrix should keep only remaining nodes and merged clusters, mapping node id to matrix cell (of triangular flattened to vector) makes impl complex.
I think it is OK ATM as hierarchical clustering is not likely to be used for larger data, but appreciated if anyone lmk better impl.
|
||
unsafe { | ||
for i in 0..n { | ||
for j in i..inputs.rows() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip the j == i
case here as we tackle it in the get
function below.
/// Add distance between i-th and j-th item | ||
/// i must be smaller than j | ||
fn insert(&mut self, i: usize, j: usize, dist: f64) { | ||
assert!(i < j, "i must be smaller than j"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this assert_debug
. The function is only used internally so we wouldn't expect a failure - as such we don't want a performance hit when run in release mode.
|
||
/// Delete distance between i-th and j-th item | ||
fn delete(&mut self, i: usize, j: usize) { | ||
assert!(i != j, "DistanceMatrix doesn't store distance when i == j, because it is 0.0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, switch to assert_debug
.
|
||
/// Agglomerative clustering distances | ||
#[derive(Debug)] | ||
pub enum Metrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is relevant here but I would consider using a trait here instead of an enum. The upside is that users can define their own metrics and plug them into the model (without modifying rusty-machine source code). The downside is an extra generic parameter on the model.
You're right. Renamed |
Sorry I forgot to comment on this again. Thanks for making those changes - I think that this is probably ready to be merged now. Before I do merge I'd like to have a play with it and convince myself it is working as expected. Do you have any recommendations on how I can do this? Any data sets that lend themselves to this kind of model? Edit: If you have no other suggestions then I'll merge #161 and try it out on the iris data set. |
iris is ok as test data. Here is R example( https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html). I'm adding tests once #161 has been merged. Note: current test data is derived from a Japanese textbook in R. Another point is training API. Currently, it doesn't impl
|
This needs a discussion, because hierarchical clustering uses the same data for training and prediction. It doesn't meet very well with current
SupModel
trait.Maybe we should have
train_predict
method to use same data for training and prediction?