core: improve performance of InMemoryVectorStore #27538

VMinB12 · 2024-10-22T10:47:29Z

Description: We improve the performance of the InMemoryVectorStore.
Isue: Originally, similarity was computed document by document:

for doc in self.store.values():
            vector = doc["vector"]
            similarity = float(cosine_similarity([embedding], [vector]).item(0))

This is inefficient and does not make use of numpy vectorization.
This PR computes the similarity in one vectorized go:

docs = list(self.store.values())
similarity = cosine_similarity([embedding], [doc["vector"] for doc in docs])

Dependencies: None
Twitter handle: @b12_consulting, @Vincent_Min

vercel · 2024-10-22T10:47:33Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Oct 25, 2024 8:45pm

eyurtsev · 2024-10-22T16:40:18Z

@VMinB12 this makes the code more complex and the in memory implementations are mainly meant as simple reference implementations.

Could you please provide some benchmark so it's possible to get a sense of what kind of an improvement this makes? Does it make a substantial difference with 1,000 or 10,000 docs?

eyurtsev · 2024-10-22T16:39:16Z

libs/core/langchain_core/vectorstores/in_memory.py

+                    metadata=doc_dict["metadata"],
+                )
+            ]
+            if filter is None or filter(doc)


Should we filter prior to applying any computation?

Whether one should depends on if the cosine_similarity or filter takes longer. Since cosine_similarity can be vectorized, I assumed that generally (although not always) cosine_similarity would be quicker and that it is preferable to filter on the prefetched subset. Note that filter can be any callable, so we have no control over how fast filter is.

VMinB12 · 2024-10-22T19:02:21Z

Thanks for the feedback, I see your point. I will try to find some time to do benchmarks. We can postpone the merge until then. Given the focus for simplicity, I will likely remove the prefilter_k_multiplier argument and just filter up front.

VMinB12 · 2024-10-22T20:11:51Z

@eyurtsev Ok, I found time now.

I simplified the PR, I think it is as simple to read and understand as the previous implementation.

Results from benchmark:

InMemoryVectorStore
Corpus size:         10, Search speed: 9.1238 +/- 0.3630 miliseconds
Corpus size:        100, Search speed: 12.1048 +/- 0.2929 miliseconds
Corpus size:       1000, Search speed: 44.2421 +/- 11.1799 miliseconds
Corpus size:      10000, Search speed: 401.4741 +/- 67.1619 miliseconds
FasterInMemoryVectorStore
Corpus size:         10, Search speed: 8.8437 +/- 0.1215 miliseconds
Corpus size:        100, Search speed: 9.9804 +/- 0.3290 miliseconds
Corpus size:       1000, Search speed: 21.1376 +/- 1.5299 miliseconds
Corpus size:      10000, Search speed: 123.6145 +/- 4.1343 miliseconds

Benchmark script ran with langchain-core==0.3.12:

from typing import Any, Callable, Optional
from time import time
import statistics
from tqdm import tqdm
import random
import string

from langchain_core.documents import Document
from langchain_core.vectorstores.utils import _cosine_similarity as cosine_similarity
from langchain_core.vectorstores.in_memory import InMemoryVectorStore

from langchain_huggingface.embeddings import HuggingFaceEmbeddings


class FasterInMemoryVectorStore(InMemoryVectorStore):
    def _similarity_search_with_score_by_vector(
        self,
        embedding: list[float],
        k: int = 4,
        filter: Optional[Callable[[Document], bool]] = None,
        **kwargs: Any,
    ) -> list[tuple[Document, float, list[float]]]:
        # get all docs with fixed order in list
        docs = list(self.store.values())

        if filter is not None:
            docs = [
                doc
                for doc in docs
                if filter(Document(page_content=doc["text"], metadata=doc["metadata"]))
            ]

        if not docs:
            return []

        similarity = cosine_similarity([embedding], [doc["vector"] for doc in docs])[0]

        # get the indices ordered by similarity score
        top_k_idx = similarity.argsort()[::-1][:k]

        return [
            (
                Document(
                    id=doc_dict["id"],
                    page_content=doc_dict["text"],
                    metadata=doc_dict["metadata"],
                ),
                float(similarity[idx].item()),
                doc_dict["vector"],
            )
            for idx in top_k_idx
            for doc_dict in [docs[idx]]
        ]


model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=model_name)


def generate_random_string(length: int = 1000):
    return "".join(random.choices(string.ascii_letters, k=length))


def benchmark(
    vector_store_class: InMemoryVectorStore, corpus_size: int, n_iterations: int = 100
):
    texts = [generate_random_string() for _ in range(corpus_size)]
    query = generate_random_string()

    vectorstore = vector_store_class.from_texts(
        texts=texts,
        embedding=embeddings,
    )

    search_times = []
    for _ in tqdm(range(n_iterations), desc="Search"):
        start = time()
        vectorstore.similarity_search(query, k=5)
        end = time()
        search_times.append(end - start)

    print(
        f"Corpus size: {corpus_size:10}, Search speed: {1000*statistics.mean(search_times):.4f} +/- {1000*statistics.stdev(search_times):.4f} miliseconds"
    )


print("InMemoryVectorStore")
benchmark(InMemoryVectorStore, 10)
benchmark(InMemoryVectorStore, 100)
benchmark(InMemoryVectorStore, 1000)
benchmark(InMemoryVectorStore, 10000)

print("FasterInMemoryVectorStore")
benchmark(FasterInMemoryVectorStore, 10)
benchmark(FasterInMemoryVectorStore, 100)
benchmark(FasterInMemoryVectorStore, 1000)
benchmark(FasterInMemoryVectorStore, 10000)

eyurtsev · 2024-10-25T20:46:03Z

libs/core/langchain_core/vectorstores/in_memory.py

+            )
+            for idx in top_k_idx
+            # Assign using walrus operator to avoid multiple lookups
+            if (doc_dict := docs[idx])


@VMinB12 hope you're OK swapped this to use a walrus operator which I think is less surprising than a double comprehension

Perfect 👍

VMinB12 added 2 commits October 22, 2024 12:35

improve performance of InMemoryVectorStore

4f73c99

ensure we return a float score

e9d649a

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: vector store Related to vector store module labels Oct 22, 2024

eyurtsev reviewed Oct 22, 2024

View reviewed changes

efriis assigned eyurtsev Oct 22, 2024

simplify implementation

32b3769

more update

c9a3bf2

eyurtsev reviewed Oct 25, 2024

View reviewed changes

eyurtsev merged commit 7bc4e32 into langchain-ai:master Oct 25, 2024
78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: improve performance of InMemoryVectorStore #27538

core: improve performance of InMemoryVectorStore #27538

VMinB12 commented Oct 22, 2024

vercel bot commented Oct 22, 2024 •

edited

Loading

eyurtsev commented Oct 22, 2024

eyurtsev Oct 22, 2024

VMinB12 Oct 22, 2024

VMinB12 commented Oct 22, 2024

VMinB12 commented Oct 22, 2024 •

edited

Loading

eyurtsev Oct 25, 2024

VMinB12 Oct 26, 2024

core: improve performance of InMemoryVectorStore #27538

core: improve performance of InMemoryVectorStore #27538

Conversation

VMinB12 commented Oct 22, 2024

vercel bot commented Oct 22, 2024 • edited Loading

eyurtsev commented Oct 22, 2024

eyurtsev Oct 22, 2024

Choose a reason for hiding this comment

VMinB12 Oct 22, 2024

Choose a reason for hiding this comment

VMinB12 commented Oct 22, 2024

VMinB12 commented Oct 22, 2024 • edited Loading

eyurtsev Oct 25, 2024

Choose a reason for hiding this comment

VMinB12 Oct 26, 2024

Choose a reason for hiding this comment

vercel bot commented Oct 22, 2024 •

edited

Loading

VMinB12 commented Oct 22, 2024 •

edited

Loading