Is it possible to compress 1 billion sentence embeddings (d=384) in an index under 4.5 GB? What type to use? #3541

robotheart · 2022-12-15T16:13:58Z

robotheart
Dec 15, 2022

Summary

Hi, thanks to the Faiss team for making this library available!

We have a usecase where we have nearly 1 billion vectors of sentence embeddings dimension 384 each. We need to build an index of all of these and have a memory constraint of 4.5 GB max index size (ideally, we'd be a little smaller than this size, as the dataset grows daily).

From my understanding an index built with configs IVF65536_HNSW32,PQ32 would get us the smallest memory footprint [ref: https://towardsdatascience.com/ivfpq-hnsw-for-billion-scale-similarity-search-89ff2f89d90e]
but when I do this we still have an index size of ~14GB.

Is there any other combination we should try? Or is an index of 4.5 GB not possible given how big our vectors/dataset are?

Thank you!

Faiss version: faiss-cpu 1.7.3

Installed from: Pypi [https://pypi.org/project/faiss-cpu/]

Running on:

[ X] CPU
GPU

Interface:

C++
[ X] Python

Reproduction instructions

NA

*please let me know also if I need to include more of the details related to what version Faiss etc I am using. Since this is just a general question I omitted those currently for brevity.

mdouze · 2022-12-19T21:36:11Z

mdouze
Dec 19, 2022
Collaborator

IVF65536_HNSW32,PQ32 uses at least 40G for 1B vectors (32 bytes for the PQ + 8 bytes for the vector ID), so unclear where the 14GB number comes from.
1B vectors in 4.5G means you allocate 4.5 bytes per sentence, which is very little. This may be possible only if you can group the vectors in a meaningful way (eg. if there are many small variations of the same vector).

0 replies

robotheart · 2022-12-21T18:56:13Z

robotheart
Dec 21, 2022
Author

Hi @mdouze Thank you so much for getting back to me on this!
Huh, so it sounds like whatever result I got doesn't actually make sense.
Thanks for breaking down the math, it sounds like may not be possible for our use case.
Our dataset is such that we DO have groups of vectors that are small variations of each other, but it isn't like, 10% of the data is like this. Rather each grouping might be .001% of the data at most, and there are like millions of these small groupings...

Is there any sort of other advice for what we might try if you can think of any? Thank you again for the help so far!

0 replies

robotheart · 2024-06-24T15:15:36Z

robotheart
Jun 24, 2024
Author

Thanks!

…

On Mon, Jun 24, 2024 at 8:14 AM Matthijs Douze ***@***.***> wrote: IVF65536_HNSW32,PQ32 uses at least 40G for 1B vectors (32 bytes for the PQ + 8 bytes for the vector ID), so unclear where the 14GB number comes from. 1B vectors in 4.5G means you allocate 4.5 bytes per sentence, which is very little. This may be possible only if you can group the vectors in a meaningful way (eg. if there are many small variations of the same vector). — Reply to this email directly, view it on GitHub <#3541 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AONZNO7QOWUPB7ND3LMHCETZJAZW3AVCNFSM6AAAAABJ2BCTRCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQNRRHEZDM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to compress 1 billion sentence embeddings (d=384) in an index under 4.5 GB? What type to use? #3541

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is it possible to compress 1 billion sentence embeddings (d=384) in an index under 4.5 GB? What type to use? #3541

robotheart Dec 15, 2022

Summary

Reproduction instructions

Replies: 3 comments

mdouze Dec 19, 2022 Collaborator

robotheart Dec 21, 2022 Author

robotheart Jun 24, 2024 Author

robotheart
Dec 15, 2022

mdouze
Dec 19, 2022
Collaborator

robotheart
Dec 21, 2022
Author

robotheart
Jun 24, 2024
Author