faiss crash when doing the search #596

GitHubProgress3 · 2018-09-19T11:51:00Z

I have 8 tesla P4 cards in my machine, each GPU contain three faiss::gpu::GpuIndexIVFPQ objects working on three databases, each data base size is 6250000(number of features)*128(each feature have 128 dimensions)*sizeof(float).

The training code is

 m_vec_GpuIndexIVFPQ.at(idxId).get()->train(feat_num[idxId],vec_feats[idxId]);
m_vec_GpuIndexIVFPQ.at(idxId).get()->reset();
m_vec_GpuIndexIVFPQ.at(idxId).get()->add(feat_num[idxId],vec_feats[idxId]);

During training:the parameters are

feat_num = 6250000, 
 FEATURE_DIM = 128;
 ShardCount = 3;  //3  faiss::gpu::GpuIndexIVFPQ objects
 Cl_Centroid = 2000;
 SubM = 64;
 nProbe = 500;
TempMemoryFraction = 0.18 in standardGPUResources

2.The searching code is

 for(int i = 0; i<m_vec_GpuIndexIVFPQ.size();i++)
    {
 m_vec_GpuIndexIVFPQ[i].get()->search((size_t)feat_num,query_feats,k,res_dists+i*k*feat_num,res_nns+i*k*feat_num);
    }

During searching: the paramenters are:

feat_num = 240; k = 1000;

during searching, the code crash, it crash when the second faiss::gpu::GpuIndexIVFPQ object is doing the searching.

3.The gdb information is:

Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ff31c3ff700 (LWP 7991)]
0x00007ffff5849c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff5849c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff584d028 in __GI_abort () at abort.c:89
#2  0x00007ffff6482257 in faiss::gpu::fromDevice<float> (src=0x7ff38201e000, dst=0x7ff3018ae410, num=240000, stream=0x5adc81b0) at utils/CopyUtils.cuh:65
#3  0x00007ffff6480ff9 in faiss::gpu::fromDevice<float, 2> (src=..., dst=0x7ff3018ae410, stream=0x5adc81b0) at utils/CopyUtils.cuh:100
#4  0x00007ffff648dd7b in faiss::gpu::GpuIndexIVFPQ::searchImpl_ (this=0x5adc3ab0, n=240, x=0x5b6cbed0, k=1000, distances=0x7ff3018ae410, labels=0x7ff2cbde9810)
    at GpuIndexIVFPQ.cu:425
#5  0x00007ffff650823f in faiss::gpu::GpuIndex::search (this=0x5adc3ab0, n=240, x=0x5b6cbed0, k=1000, distances=0x7ff3018ae410, labels=0x7ff2cbde9810) at GpuIndex.cu:142
#6  0x00007ffff647254d in GpuCwKnnShardImpl::Search (this=0x4f7f76d0, query_feats=0x5b6cbed0, feat_num=240, feat_dim=<optimized out>, k=1000, res_dists=0x7ff3017c3e10,
    res_nns=0x7ff2cbc14c10) at GpuCwKnnShardImpl.cpp:246
#7  0x00007ffff646efc9 in Run (arg=0x4f7f7670) at GpuCwMultiKnnImpl.cpp:149
#8  0x00007ffff5be0184 in start_thread (arg=0x7ff31c3ff700) at pthread_create.c:312
#9  0x00007ffff590d37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Other information:
a) program do not crash when I using only 1 GPU or 2 GPUs.
b) I have checked the memory on the CPU size when doing the cudaMemcpy(device to host), the memory on the CPU side do not have any error. I can read the each memory address on the CPU side.
c) Sometime the program success in running,if it success on the first search, it will keep on running on the following loop search. Sometime it failed on the first time, the debug information is on the above.
d) If I only use two faiss::gpu::GpuIndexIVFPQ objects, each data base have 9375000 features. the program never crash
e) I check the GPU memory when doing search, each GPU have about 1GB sizes left when doing searching.

Platform

OS: Ubuntu 14.04
cuda 8.0
gcc 4.8.4

Running on:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.12                 Driver Version: 390.12                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:04:00.0 Off |                    0 |
| N/A   55C    P0    39W /  75W |   4859MiB /  7611MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            Off  | 00000000:05:00.0 Off |                    0 |
| N/A   57C    P0    46W /  75W |   4857MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P4            Off  | 00000000:08:00.0 Off |                    0 |
| N/A   57C    P0    45W /  75W |   4859MiB /  7611MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P4            Off  | 00000000:09:00.0 Off |                    0 |
| N/A   54C    P0    45W /  75W |   4859MiB /  7611MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P4            Off  | 00000000:85:00.0 Off |                    0 |
| N/A   57C    P0    42W /  75W |   4863MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   54C    P0    45W /  75W |   4865MiB /  7611MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P4            Off  | 00000000:89:00.0 Off |                    0 |
| N/A   53C    P0    47W /  75W |   4865MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P4            Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   52C    P0    45W /  75W |   4859MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Any help on fixing this?

The text was updated successfully, but these errors were encountered:

wickedfoo · 2018-09-26T00:07:34Z

Can you rerun this using the cuda-memcheck program to see what errors it reports?

GitHubProgress3 · 2018-09-27T12:58:32Z

I tried to do the cuda-memcheck, but the remote machine goes dead during the cuda-memcheck program. The program runs a really long time(longer than 5 hours) . It have tons of report in a function( if I remember correctly getDeviceForAddress), in this function, whenever the address is in CPU side, the cuda-memcheck will make an error log in it. The program have not finished training the data base before it goes to search.
However I tried others approach, I make the exactly same call cudaMemcpy(device to host) (just like what is doing in copyfrom functions) before and after some functions, and currently I know that something error happens in the fuction call runPQScanMultiPassNoPrecomputed inside the PQScanMultiPassNoPrecomputed.cu.
Is there any method to speed up finding out what goes wrong?

GitHubProgress3 · 2018-10-15T09:45:21Z

I have find the work around, close comments

GitHubProgress3 · 2018-10-15T09:46:19Z

Hi author of faiss: The cuda-memcheck takes a really long time to run, I run 3 days and do not finished. But I find a workaround to avoid this errors. thanks for your support. I have another issues if you can help me: 1. Current faiss support PQ<=96, in PQScanMultiPassNoPrecomputed.cu it has the switch RUN_PQ(96); 2. The current k(nprobe) only support less or equal than 1024 Is it possible if I want to break these two limits? Do you have the code? If you do not have the code , could you please give me some guide to me(how to do it easily)? and how long it might takes to change the code in your estimation.

…

------------------------------------------------------------------ 发件人：Jeff Johnson <notifications@github.com> 发送时间：2018年9月26日(星期三) 08:07 收件人：facebookresearch/faiss <faiss@noreply.github.com> 抄　送：GitHubProgress3 <seal_w2000@aliyun.com>; Author <author@noreply.github.com> 主　题：Re: [facebookresearch/faiss] faiss crash when doing the search (#596) Can you rerun this using the cuda-memcheck program to see what errors it reports? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

wickedfoo · 2018-10-16T03:20:25Z

k or nprobe > 1024 will not be supported any time soon, if ever, for the GPU.
What dimension are your vectors? What PQ size do you want? The max PQ size may not change either, but this is easier to implement. I'm wondering if it is in fact useful for your case.

GitHubProgress3 · 2018-10-18T02:14:37Z

The vector dimension is 512. currently I use PQ 64, I wonder if PQ 128/256 can make the search accuracy better ------------------------------------------------------------------ 发件人：Jeff Johnson <notifications@github.com> 发送时间：2018年10月16日(星期二) 11:20 收件人：facebookresearch/faiss <faiss@noreply.github.com> 抄　送：GitHubProgress3 <seal_w2000@aliyun.com>; State change <state_change@noreply.github.com> 主　题：Re: [facebookresearch/faiss] faiss crash when doing the search (#596) k or nprobe > 1024 will not be supported any time soon, if ever, for the GPU. What dimension are your vectors? What PQ size do you want? The max PQ size may not change either, but this is easier to implement. I'm wondering if it is in fact useful for your case. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

wickedfoo · 2018-10-19T19:10:42Z

PCA reduction to, say, 128 or 256 dimensions might be a better strategy than PQ on such a high dimensional vector. It is likely that the variation across the dimensions is very non-uniform anyways.

wickedfoo · 2018-10-19T19:11:47Z

float16 IVF flat would also be more efficient, faster and take the same amount of memory as PQ 256 on a 512 dimensional vector.

PQ is a form of lossy compression of the vectors anyways.

vincentLk · 2020-05-12T02:49:28Z

I have 8 tesla P4 cards in my machine, each GPU contain three faiss::gpu::GpuIndexIVFPQ objects working on three databases, each data base size is 6250000(number of features)*128(each feature have 128 dimensions)*sizeof(float).

The training code is

 m_vec_GpuIndexIVFPQ.at(idxId).get()->train(feat_num[idxId],vec_feats[idxId]);
m_vec_GpuIndexIVFPQ.at(idxId).get()->reset();
m_vec_GpuIndexIVFPQ.at(idxId).get()->add(feat_num[idxId],vec_feats[idxId]);

During training:the parameters are

feat_num = 6250000, 
 FEATURE_DIM = 128;
 ShardCount = 3;  //3  faiss::gpu::GpuIndexIVFPQ objects
 Cl_Centroid = 2000;
 SubM = 64;
 nProbe = 500;
TempMemoryFraction = 0.18 in standardGPUResources

2.The searching code is

 for(int i = 0; i<m_vec_GpuIndexIVFPQ.size();i++)
    {
 m_vec_GpuIndexIVFPQ[i].get()->search((size_t)feat_num,query_feats,k,res_dists+i*k*feat_num,res_nns+i*k*feat_num);
    }

During searching: the paramenters are:

feat_num = 240; k = 1000;

during searching, the code crash, it crash when the second faiss::gpu::GpuIndexIVFPQ object is doing the searching.

3.The gdb information is:

Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::fromDevice(T*, T*, size_t, cudaStream_t) [with T = float; size_t = long unsigned int; cudaStream_t = CUstream_st*] at utils/CopyUtils.cuh:69; details: CUDA error 77

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ff31c3ff700 (LWP 7991)]
0x00007ffff5849c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff5849c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff584d028 in __GI_abort () at abort.c:89
#2  0x00007ffff6482257 in faiss::gpu::fromDevice<float> (src=0x7ff38201e000, dst=0x7ff3018ae410, num=240000, stream=0x5adc81b0) at utils/CopyUtils.cuh:65
#3  0x00007ffff6480ff9 in faiss::gpu::fromDevice<float, 2> (src=..., dst=0x7ff3018ae410, stream=0x5adc81b0) at utils/CopyUtils.cuh:100
#4  0x00007ffff648dd7b in faiss::gpu::GpuIndexIVFPQ::searchImpl_ (this=0x5adc3ab0, n=240, x=0x5b6cbed0, k=1000, distances=0x7ff3018ae410, labels=0x7ff2cbde9810)
    at GpuIndexIVFPQ.cu:425
#5  0x00007ffff650823f in faiss::gpu::GpuIndex::search (this=0x5adc3ab0, n=240, x=0x5b6cbed0, k=1000, distances=0x7ff3018ae410, labels=0x7ff2cbde9810) at GpuIndex.cu:142
#6  0x00007ffff647254d in GpuCwKnnShardImpl::Search (this=0x4f7f76d0, query_feats=0x5b6cbed0, feat_num=240, feat_dim=<optimized out>, k=1000, res_dists=0x7ff3017c3e10,
    res_nns=0x7ff2cbc14c10) at GpuCwKnnShardImpl.cpp:246
#7  0x00007ffff646efc9 in Run (arg=0x4f7f7670) at GpuCwMultiKnnImpl.cpp:149
#8  0x00007ffff5be0184 in start_thread (arg=0x7ff31c3ff700) at pthread_create.c:312
#9  0x00007ffff590d37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Other information:
a) program do not crash when I using only 1 GPU or 2 GPUs.
b) I have checked the memory on the CPU size when doing the cudaMemcpy(device to host), the memory on the CPU side do not have any error. I can read the each memory address on the CPU side.
c) Sometime the program success in running,if it success on the first search, it will keep on running on the following loop search. Sometime it failed on the first time, the debug information is on the above.
d) If I only use two faiss::gpu::GpuIndexIVFPQ objects, each data base have 9375000 features. the program never crash
e) I check the GPU memory when doing search, each GPU have about 1GB sizes left when doing searching.

Platform

OS: Ubuntu 14.04
cuda 8.0
gcc 4.8.4

Running on:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.12                 Driver Version: 390.12                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:04:00.0 Off |                    0 |
| N/A   55C    P0    39W /  75W |   4859MiB /  7611MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            Off  | 00000000:05:00.0 Off |                    0 |
| N/A   57C    P0    46W /  75W |   4857MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P4            Off  | 00000000:08:00.0 Off |                    0 |
| N/A   57C    P0    45W /  75W |   4859MiB /  7611MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P4            Off  | 00000000:09:00.0 Off |                    0 |
| N/A   54C    P0    45W /  75W |   4859MiB /  7611MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P4            Off  | 00000000:85:00.0 Off |                    0 |
| N/A   57C    P0    42W /  75W |   4863MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   54C    P0    45W /  75W |   4865MiB /  7611MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P4            Off  | 00000000:89:00.0 Off |                    0 |
| N/A   53C    P0    47W /  75W |   4865MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P4            Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   52C    P0    45W /  75W |   4859MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Any help on fixing this?

hi, i have the same problem, have you fixed it? how? i need your help

thebirdgr · 2022-09-08T12:19:15Z

Hi @GitHubProgress3, how did you solve issue in question? Thanks.

mdouze added the GPU label Sep 19, 2018

beauby assigned wickedfoo Sep 20, 2018

GitHubProgress3 closed this as completed Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faiss crash when doing the search #596

faiss crash when doing the search #596

GitHubProgress3 commented Sep 19, 2018 •

edited by beauby

Loading

wickedfoo commented Sep 26, 2018

GitHubProgress3 commented Sep 27, 2018

GitHubProgress3 commented Oct 15, 2018

GitHubProgress3 commented Oct 15, 2018 via email

wickedfoo commented Oct 16, 2018

GitHubProgress3 commented Oct 18, 2018 via email

wickedfoo commented Oct 19, 2018

wickedfoo commented Oct 19, 2018

vincentLk commented May 12, 2020

Platform

thebirdgr commented Sep 8, 2022

faiss crash when doing the search #596

faiss crash when doing the search #596

Comments

GitHubProgress3 commented Sep 19, 2018 • edited by beauby Loading

Platform

wickedfoo commented Sep 26, 2018

GitHubProgress3 commented Sep 27, 2018

GitHubProgress3 commented Oct 15, 2018

GitHubProgress3 commented Oct 15, 2018 via email

wickedfoo commented Oct 16, 2018

GitHubProgress3 commented Oct 18, 2018 via email

wickedfoo commented Oct 19, 2018

wickedfoo commented Oct 19, 2018

vincentLk commented May 12, 2020

Platform

thebirdgr commented Sep 8, 2022

GitHubProgress3 commented Sep 19, 2018 •

edited by beauby

Loading