-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
infinite loop when clustering #463
Comments
Hi |
thanks your reply! and , after applying gdb to the core file generated by gcore on the stucked program, we think that faiss might be stucked in the function knn_L2sqr_blas, the following is the stack info of the stuck program finally, by the way, does this bug #150 and the "#pragma omp parallel" in Clustering::train be related to our question? |
Hi |
HI~Thanks for your attention~! And we get more data as following files: 2、const float *x in void Clustering::train (idx_t nx, const float *x_in, Index & index) 3、idx_t[nx]* assign and float[nx] * dis in void Clustering::train (idx_t nx, const float *x_in, Index & index) 4、std::vector cur_centroids in void Clustering::train (idx_t nx, const float *x_in, Index & index) 5、obj that stores the err in void Clustering::train (idx_t nx, const float *x_in, Index & index) 6、the args x、y and local var x_norms, y_norms in static void knn_L2sqr_blas (const float * x, const float * y, size_t d, size_t nx, size_t ny, float_maxheap_array_t * res, const DistanceCorrection &corr) binary files: and the thread stack info got by gdb is shown as following: |
First, please stop posting screenshots. I will look into the datafiles. |
ok,i' m sorry for that. I just think that the datafiles do not tell where the program stucked..so... |
I loaded x-clus-train-262144 and xslice-dsub-16-ksub-256-n-16384-m-9-M-32, both cluster without problems. |
For ref, what I do: import faiss
#x = np.fromfile('/tmp/xslice-dsub-16-ksub-256-n-16384-m-9-M-32', dtype='float32')
x = np.fromfile('/tmp/x-clus-train-262144', dtype='float32')
xslice = x.reshape(-1, 16)
xslice.shape
print "distinct vectors:", len(set(x.tostring() for x in xslice))
kmeans = faiss.Kmeans(16, 256, verbose=True)
kmeans.train(xslice) |
errrrrr..In fact..we still can't reproduce the problem stably..it sometimes happen when we call GpuIndexIVFPQ::train continuous, i.e.cluster 20 or more IVFPQ. |
This sounds like a memory corruption, probably on your side. |
oh..that sounds possible.. |
Closing issue. Feel free to re-open if the bug can be tracked down to Faiss. |
hi, when we are running Clustering::train to train PQ slice, sometimes we will get in a infinite loop, as the attachment shown.
we think that it may because our faiss version is too old (the commit 5ca0521), and we found that you have lots of commits, we noticed that some commits may be related to this bug, but we are not sure because some of the commits do not describe what bugs do the commit fix(such as the commit 250a3d3.)
So, would you mind helping me figure out which bugs may cause the problem?
Sincerely thanks!
The text was updated successfully, but these errors were encountered: