Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not reproducible results with find.clusters #335

Open
Deepak12Kaushik opened this issue Aug 1, 2022 · 1 comment
Open

Not reproducible results with find.clusters #335

Deepak12Kaushik opened this issue Aug 1, 2022 · 1 comment

Comments

@Deepak12Kaushik
Copy link

I try using the find.clusters function with the phenotypic data of wheat (you can think of my data set similar to USArrets dataset) for the purpose of cutting the dendrogram into these number of clusters. But every time the sequence of cluster changes like if first cluster having 4 members, second as 2 members etc. then repeating the function with similar conditions give first cluster with, say, 5 members and so on. Not reproducible results.

#df is my dataset
foo.BIC <- find.clusters(df, max.n = 20, n.pca =200, scale = FALSE,
stat = "BIC", method = "kmeans")
plot(foo.BIC$Kstat, type="o", xlab="number of clusters (K)", ylab="BIC",
col="green", main="Detection based on BIC")
points(5, foo.BIC$Kstat[5], pch="x", cex=3)
mtext(3, tex="'X' indicates the actual number of clusters")

foo.BIC$size
foo.BIC$grp

@sanderdebacker
Copy link

sanderdebacker commented Aug 19, 2024

Responding my findings here because I myself was looking for an answer to a similar problem. Hopefully this is useful for other users.

I've found this in another thread:

Odd shapes of the decrease of BIC can occur for several reasons. The possible explanations I can think of are:
a) there are no clearly identifiable clusters in the data.
b) there are clusters to be identified, but not enough information to disentangle different values of k. In your case this seems very likely: there are few SNPs, and if half of them are specific to one individual they are not informative in terms of clusters.

Original reference:
https://lists.r-forge.r-project.org/pipermail/adegenet-forum/2011-June/000303.html

Otherwise, it would be worth increasing the number of runs of k-means (n.start, default is 10) and increase the number of iterations for each run (n.iter, default is 1e5) to gain a bit of stability. Hopefully that makes your analysis reproducible.

EDIT: just as an example, for my data the analysis stabilised for n.start=1000 and n.iter=1e9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants