Is it possible to (efficiently) query different subsets of the database vectors for different query vectors? #3580

tomleung1996 · 2023-10-17T07:48:18Z

tomleung1996
Oct 17, 2023

I would like to calculate the similarities between the embedding of a paper and the embeddings of its cited references, and I have a lot of them (~30 million).

By using multiple GPUs, I was able to calculate pair-wise similarities between all papers, but I could only save the top K most similar results. The problem is that the cited references of a paper are not always the most similar papers in terms of semantic distance. Even though I have calculated all pair-wise similarities, I cannot obtain my desired results.

Therefore, I am wondering if is it possible to (efficiently) query different subsets of the database vectors for different query vectors. Or maybe there is a smarter way to achieve my goal?

mdouze · 2023-10-24T10:11:02Z

mdouze
Oct 24, 2023
Collaborator

This looks like a filtered search problem. The answer depends on what the filtering criterion would be. If the dataset is clustered then you can build one index per clustrer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to (efficiently) query different subsets of the database vectors for different query vectors? #3580

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is it possible to (efficiently) query different subsets of the database vectors for different query vectors? #3580

tomleung1996 Oct 17, 2023

Replies: 1 comment

mdouze Oct 24, 2023 Collaborator

tomleung1996
Oct 17, 2023

mdouze
Oct 24, 2023
Collaborator