-
-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PAM does not implement PAM #85
Comments
Can you provide a link/reference to what you think is the correct way to do PAM? |
BUILD and SWAP, as in the PAM paper.
BUILD is a greedy initialization that always adds the medoid that reduces TD most. This finds much better results than the naive k-means-style approach. |
I think what is in the paper is equivalent to the code I have for PAM. BUILD basically describes just the seed selection step, where the first seed is the medoid, and subsequent seeds are selected in a kind of greedy fashion (its not clear to me from the original paper exactly how that happens). While I don't have this implemented in the exact same style, I would suspect Kmeans++ selection to produce good results of comparable initialization. the SWAP selects points which should move clusters with the most "profitable" pair, and "do not depend upon the order", which means its not a greedy selection updated on-the-fly, but after checking all points. This is what my current code does with the "k-means" style as you call it. Ignoring the exact nature of the BUILD step, since its not super precisely specified, can you help me understand a scenario where what is implemented does not produce the same result of PAM as described in the paper? |
No, it is not equivalent. |
My institution does not have access to Willey, do you have any other references on the details? I find the original PAM paper very hard to parse on what the details are.
I’m still not understanding the difference. In the current code we re-assign points to their closets clustered, do ownership changes, then medoids are recomputed. The paper says order does not matter, so it sounds like it’s reassigning points to nearest clusters (“most profitable” as they state), and then we need to recompute medoids for things to make sense, yes?
…Sent from my iPhone
On Dec 27, 2019, at 7:16 AM, Erich Schubert ***@***.***> wrote:
No, it is not equivalent.
The SWAP procedure can reassign points to other clusters when swapping. Hence it finds much better improvements, in my experiments the results were even significantly better. Because of the discrete nature of the median, the k-means-style approach does not work well; it gets stuck easily.
BUILD is very clearly specified - greedily add the next medoid that minimizes TD - and it is also very important. BUILD itself usually finds a better solution than the entire k-means-style procedure, and the PAM authors suggested that often you will not need to run SWAP at all. If you are missing detail in the first paper, you can check out the corresponding chapter in the later book:
http://dx.doi.org/10.1002/9780470316801.ch2
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The new medoid will always be chosen as to cover all currently assigned points. |
Erich, I’m just not on the same page in understanding how the SWAP procedure works. If your paper has this in more explicit detail I’m happy to add it to my reading list. At the moment, I don’t plan on doing the tests you suggest. It’s a good amount of work and I’ve got other items higher on my property list for JSAT. Especially since TRIKMEDS is implemented with the same kind of approach as the current PAM, but with n sqrt(n) complexity. Their paper also mentions that it is the same approach as PAM, so I'm getting conflicting information on this. If you have tested this previously and have this information well documented, and are interested in doing a pull request, I'm happy to work with you on that! |
My suggestion is to rename the class for now, and adjust documentation, to make clear it is not PAM. The algorithm you implemented is known in facility location literature by the name "alternate", because it has these two alternating steps (and e.g. Teitz and Bart 1968, but also several other authors in that domain observed that it performs rather poor). In my experiments, on the ORlib problems, the "alternate" algorithm with k-means++ initialization at 37%. BUILD alone is 3.14% random, BUILD+PAM is 0.51%; where 0% means always finding the optimum solution (which is known for the ORlib problems), and where 100% is the average of 100x picking medoids randomly. Doing the k-means++-Alternate combination 10 times, and keeping the best, gets you down to 9% on average, but there is a large standard deviation on the data sets, too. |
Thanks for pointing out the TRIKMEDS paper. I wasn't aware of it, but I have seen other papers of the authors, and they were very good; based on that and the abstract I can imagine how they use metric bounds here. They also confirm some observation that I made (that Park's initialization performs poorly), so I'll definitely cite this. I cannot include it in my current experiments, because my data is given as a graph, and supposedly does not fulfill the triangle inequality. |
I just noticed that they also give an example (technically not really for k-medoids though) where CLARANS can reach a better optimum by swapping than the lloyd-style approach. They use it for motivating to use CLARANS for initializing k-means. Figure 1 in https://papers.nips.cc/paper/7104-k-medoids-for-k-means-seeding.pdf |
I am not going to introduce a breaking change like renaming a file without understanding the situation. I really appreciate your interest, but I am simply not making any changes until I understand what is "wrong". Again, if you are aware of any resources that provide more detailed math/pesudo code as to the nature of SWAP and BUILD, I'm happy to add them to my reading list. I'm afraid from your descriptions in this thread/the original paper, I don't understand what the distinction is on SWAP or how I would implement it. Perhaps your background/training makes this all very obvious to you, and if you would like to write code to fix this and show the difference in results, I'm happy to work with you on merging it. But it is simply not obviously/clear to me from what I have read thus far. |
Likely the best source remains the book by the authors... I'm sure you can find a way to access it. Given that your "PAM" currently isn't "PAM", I'd argue that renaming wouldn't break more than what already is broken. ;-) In particular since the quality of the results is pretty poor with the k-means-style algorithm. |
Yes, PAM tries all of them, not randomly; and it doesn't include BUILD. |
But you see how many wrong versions of PAM are in literature and software; which is really annoying when you want to benchmark and compare them, and half of them turn out to be a different algorithm, which is fast but yields much worse results. |
If you are trying to benchmark and compare, I totally understand the frustration. But please also understand from my side, I have no clarity one what is the real PAM at the moment. Should I trust you, or this book, or poorly worded original paper, or some other paper? So I will not be changing anything for some time as I both 1) do my own research on this, and 2) adress other items on my TODO list that are higher priorities. |
You can check out the R version that is as far as I can tell derived from the original Fortran code, but now includes extensions (pamonce >0) and is not very readable (likely because of fortran-to-c code style, as you can tell from the use of "goto L60"). |
And sorry, I beg to disagree that the original paper is unclear or poorly worded. |
The PAM algorithm is quite different from a k-means style approach you implemented.
Any idea what reference you used? I am trying to figure out why so many use the name "PAM" for a different algorithm than in the original work. Is it incorrect in some book?
The text was updated successfully, but these errors were encountered: