Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HNSW indexing much slower on GPU than CPU #1348

Closed
2 tasks
lelayf opened this issue Aug 23, 2020 · 6 comments
Closed
2 tasks

HNSW indexing much slower on GPU than CPU #1348

lelayf opened this issue Aug 23, 2020 · 6 comments
Labels

Comments

@lelayf
Copy link

lelayf commented Aug 23, 2020

Summary

Indexing numpy array in vanilla HNSWFlat index works linearly on CPU (about 1000 vectors/second) but the same procedure collapses polynomially on GPU (over 20 minutes to index 3000 vectors).

Platform

OS: Linux x86_64

Faiss version: 1.6.3
Cuda version: 10.1
Pytorch version: 1.6.0-cuda101

Faiss compilation options: unknown, binary distribution

Running on:

  • CPU
  • [ x] GPU

Interface:

  • C++
  • [ x] Python

Reproduction instructions

$ pip install faiss-gpu
[...]
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from faiss-gpu) (1.19.0)
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.6.3

4 GPUs available (P100 on GCP Compute VM)

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    39W / 250W |   3168MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   39C    P0    26W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   44C    P0    30W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   42C    P0    26W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9966      C   ...e33745-ae59-4857-a071-791dc3043df5.json  1311MiB |
|    0     10173      C   /opt/conda/bin/python                       1847MiB |
+-----------------------------------------------------------------------------+

CPU test

import numpy as np
import pandas as pd
import faiss

wall_times =  []
dbsizes = range(100,5000,100)
for nb in dbsizes:
    d = 768                          # dimension    
    np.random.seed(1234)             # make reproducible
    xb = np.random.random((nb, d)).astype('float32')
    xb[:, 0] += np.arange(nb) / 1000.

    cpu_index = faiss.IndexHNSWFlat(d, 32)
    
    result = %timeit -n1 -r1 -o cpu_index.add(xb)
    wall_times.append(result.average)
28.3 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
29.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
47.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
68.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
147 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
361 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
503 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
377 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
744 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
620 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
747 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.06 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
960 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
841 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.07 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.04 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.24 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.45 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.93 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.96 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.02 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.43 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.61 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.62 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.98 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.03 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.03 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.73 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.29 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.43 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.52 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4.04 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4.32 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4.25 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4.98 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.25 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.11 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.94 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.73 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.86 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.99 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.76 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
6.21 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
6.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
g = pd.Series(wall_times,index=dbsizes)
g.plot()

On CPU all is well, indexing scales linearly with the numbers of vectors to index.

single GPU test

import numpy as np
import faiss

ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
# Output: "number of GPUs: 4"

res = faiss.StandardGpuResources()  # use a single GPU

wall_times =  []
dbsizes = range(100,3700,100)
for nb in dbsizes:
    d = 768                          # dimension    
    np.random.seed(1234)             # make reproducible
    xb = np.random.random((nb, d)).astype('float32')
    xb[:, 0] += np.arange(nb) / 1000.

    index = faiss.IndexHNSWFlat(d, 32)
    
    gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index)
    result = %timeit -n1 -r1 -o gpu_index_flat.add(xb)  # add vectors to the index
    wall_times.append(result.average)
number of GPUs: 4
292 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.56 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.08 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
9.05 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
23.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
27.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
25.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
52.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
46.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1min 13s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1min 26s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1min 34s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2min 17s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2min 35s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3min 30s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4min 59s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
6min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
7min 9s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
6min 27s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
10min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
11min 35s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
13min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
13min 49s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
16min 52s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
17min 44s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
20min 20s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
22min 20s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
20min 9s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
27min 58s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
22min 31s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
25min 40s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
25min 42s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
30min 36s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
34min 51s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
34min 17s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
37min 42s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

On GPU indexing is intractable as it scales polynomially with the numbers of vectors to index.

Multi-GPU tets

import numpy as np
import faiss

ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
# Output: "number of GPUs: 4"

wall_times=[]

for nb in range(100,900,100):
    d = 768                          # dimension    
    np.random.seed(1234)             # make reproducible
    xb = np.random.random((nb, d)).astype('float32')
    xb[:, 0] += np.arange(nb) / 1000.

    index = faiss.IndexHNSWFlat(d, 32)
    
    gpu_index_flat = faiss.index_cpu_to_all_gpus(index)
    result = %timeit -n1 -r1 -o gpu_index_flat.add(xb)   # add vectors to the index
    wall_times.append(result.average)
number of GPUs: 4
301 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.55 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.25 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.84 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
9.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
24.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
28.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
27.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
53.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Roughly similar performance/collapsing on multi-GPU.

@lelayf
Copy link
Author

lelayf commented Aug 23, 2020

FYI @thomwolf I ran into this behavior when trying the FAISS indexing of embeddings in an nlp.Dataset.

@lelayf
Copy link
Author

lelayf commented Aug 23, 2020

Actually I just noticed that faiss-gpu is a wheel maintained in another repo https://github.com/kyamagu/faiss-wheels so I will have to reinstall faiss through the conda channel described in original repo and see if this behavior still holds.

@lelayf
Copy link
Author

lelayf commented Aug 23, 2020

So I did pip uninstall faiss-gpu and then reinstalled from the conda channel.

(base) jupyter@mdm-fine-tuning-ssd:~$ conda install faiss-gpu cudatoolkit=10.0 -c pytorch
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - cudatoolkit=10.0
    - faiss-gpu


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.4                |   py37hc8dfbb8_2         3.1 MB  conda-forge
    cudatoolkit-10.0.130       |                0       261.2 MB
    faiss-1.6.3                |py37hf70ad10_1_cuda         686 KB  conda-forge
    faiss-gpu-1.6.3            |           py37_1          18 KB  conda-forge
    libfaiss-1.6.3             |  hb17eacc_1_cuda        89.3 MB  conda-forge
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:       356.4 MB

The following NEW packages will be INSTALLED:

  cudatoolkit        pkgs/main/linux-64::cudatoolkit-10.0.130-0
  faiss              conda-forge/linux-64::faiss-1.6.3-py37hf70ad10_1_cuda
  faiss-gpu          conda-forge/linux-64::faiss-gpu-1.6.3-py37_1
  libfaiss           conda-forge/linux-64::libfaiss-1.6.3-hb17eacc_1_cuda

The following packages will be UPDATED:

  conda                                4.8.3-py37hc8dfbb8_1 --> 4.8.4-py37hc8dfbb8_2
  openssl                                 1.1.1g-h516909a_0 --> 1.1.1g-h516909a_1


Proceed ([y]/n)? 


Downloading and Extracting Packages
openssl-1.1.1g       | 2.1 MB    | #################################################################################################################################################################################################################################################### | 100% 
libfaiss-1.6.3       | 89.3 MB   | #################################################################################################################################################################################################################################################### | 100% 
conda-4.8.4          | 3.1 MB    | #################################################################################################################################################################################################################################################### | 100% 
faiss-1.6.3          | 686 KB    | #################################################################################################################################################################################################################################################### | 100% 
cudatoolkit-10.0.130 | 261.2 MB  | #################################################################################################################################################################################################################################################### | 100% 
faiss-gpu-1.6.3      | 18 KB     | #################################################################################################################################################################################################################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

I re-ran the CPU and GPU tests above and observing:

  • CPU seems faster with 5000 vectors indexed in 1.5s.
  • GPU indexing still collapsing.

@mdouze
Copy link
Contributor

mdouze commented Aug 25, 2020

HNSW is not implemented on gpu, so index_cpu_to_gpu just copies the index (it is on our todo list to provide more accurate error messages to show this). Thus it is not clear why HSNW should be slower on GPU than CPU.

@lelayf
Copy link
Author

lelayf commented Aug 25, 2020

Ha! That will explain it :) As you say I see a cloning call. I am not sure if I am reading the execution flow correctly, but would not the ensuing cloning op exhibit unexpected recursion?

@wickedfoo
Copy link
Contributor

no, the storage object is a separate sub-index that actually contains the vector data (e.g., an IndexFlat).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants