Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: improve performance of InMemoryVectorStore #27538

Merged

Conversation

VMinB12
Copy link
Contributor

@VMinB12 VMinB12 commented Oct 22, 2024

Description: We improve the performance of the InMemoryVectorStore.
Isue: Originally, similarity was computed document by document:

for doc in self.store.values():
            vector = doc["vector"]
            similarity = float(cosine_similarity([embedding], [vector]).item(0))

This is inefficient and does not make use of numpy vectorization.
This PR computes the similarity in one vectorized go:

docs = list(self.store.values())
similarity = cosine_similarity([embedding], [doc["vector"] for doc in docs])

Dependencies: None
Twitter handle: @b12_consulting, @Vincent_Min

Copy link

vercel bot commented Oct 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 25, 2024 8:45pm

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: vector store Related to vector store module labels Oct 22, 2024
@eyurtsev
Copy link
Collaborator

@VMinB12 this makes the code more complex and the in memory implementations are mainly meant as simple reference implementations.

Could you please provide some benchmark so it's possible to get a sense of what kind of an improvement this makes? Does it make a substantial difference with 1,000 or 10,000 docs?

metadata=doc_dict["metadata"],
)
]
if filter is None or filter(doc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we filter prior to applying any computation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether one should depends on if the cosine_similarity or filter takes longer. Since cosine_similarity can be vectorized, I assumed that generally (although not always) cosine_similarity would be quicker and that it is preferable to filter on the prefetched subset. Note that filter can be any callable, so we have no control over how fast filter is.

@VMinB12
Copy link
Contributor Author

VMinB12 commented Oct 22, 2024

Thanks for the feedback, I see your point. I will try to find some time to do benchmarks. We can postpone the merge until then. Given the focus for simplicity, I will likely remove the prefilter_k_multiplier argument and just filter up front.

@VMinB12
Copy link
Contributor Author

VMinB12 commented Oct 22, 2024

@eyurtsev Ok, I found time now.

I simplified the PR, I think it is as simple to read and understand as the previous implementation.

Results from benchmark:

InMemoryVectorStore
Corpus size:         10, Search speed: 9.1238 +/- 0.3630 miliseconds
Corpus size:        100, Search speed: 12.1048 +/- 0.2929 miliseconds
Corpus size:       1000, Search speed: 44.2421 +/- 11.1799 miliseconds
Corpus size:      10000, Search speed: 401.4741 +/- 67.1619 miliseconds
FasterInMemoryVectorStore
Corpus size:         10, Search speed: 8.8437 +/- 0.1215 miliseconds
Corpus size:        100, Search speed: 9.9804 +/- 0.3290 miliseconds
Corpus size:       1000, Search speed: 21.1376 +/- 1.5299 miliseconds
Corpus size:      10000, Search speed: 123.6145 +/- 4.1343 miliseconds

Benchmark script ran with langchain-core==0.3.12:

from typing import Any, Callable, Optional
from time import time
import statistics
from tqdm import tqdm
import random
import string

from langchain_core.documents import Document
from langchain_core.vectorstores.utils import _cosine_similarity as cosine_similarity
from langchain_core.vectorstores.in_memory import InMemoryVectorStore

from langchain_huggingface.embeddings import HuggingFaceEmbeddings


class FasterInMemoryVectorStore(InMemoryVectorStore):
    def _similarity_search_with_score_by_vector(
        self,
        embedding: list[float],
        k: int = 4,
        filter: Optional[Callable[[Document], bool]] = None,
        **kwargs: Any,
    ) -> list[tuple[Document, float, list[float]]]:
        # get all docs with fixed order in list
        docs = list(self.store.values())

        if filter is not None:
            docs = [
                doc
                for doc in docs
                if filter(Document(page_content=doc["text"], metadata=doc["metadata"]))
            ]

        if not docs:
            return []

        similarity = cosine_similarity([embedding], [doc["vector"] for doc in docs])[0]

        # get the indices ordered by similarity score
        top_k_idx = similarity.argsort()[::-1][:k]

        return [
            (
                Document(
                    id=doc_dict["id"],
                    page_content=doc_dict["text"],
                    metadata=doc_dict["metadata"],
                ),
                float(similarity[idx].item()),
                doc_dict["vector"],
            )
            for idx in top_k_idx
            for doc_dict in [docs[idx]]
        ]


model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=model_name)


def generate_random_string(length: int = 1000):
    return "".join(random.choices(string.ascii_letters, k=length))


def benchmark(
    vector_store_class: InMemoryVectorStore, corpus_size: int, n_iterations: int = 100
):
    texts = [generate_random_string() for _ in range(corpus_size)]
    query = generate_random_string()

    vectorstore = vector_store_class.from_texts(
        texts=texts,
        embedding=embeddings,
    )

    search_times = []
    for _ in tqdm(range(n_iterations), desc="Search"):
        start = time()
        vectorstore.similarity_search(query, k=5)
        end = time()
        search_times.append(end - start)

    print(
        f"Corpus size: {corpus_size:10}, Search speed: {1000*statistics.mean(search_times):.4f} +/- {1000*statistics.stdev(search_times):.4f} miliseconds"
    )


print("InMemoryVectorStore")
benchmark(InMemoryVectorStore, 10)
benchmark(InMemoryVectorStore, 100)
benchmark(InMemoryVectorStore, 1000)
benchmark(InMemoryVectorStore, 10000)

print("FasterInMemoryVectorStore")
benchmark(FasterInMemoryVectorStore, 10)
benchmark(FasterInMemoryVectorStore, 100)
benchmark(FasterInMemoryVectorStore, 1000)
benchmark(FasterInMemoryVectorStore, 10000)

)
for idx in top_k_idx
# Assign using walrus operator to avoid multiple lookups
if (doc_dict := docs[idx])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VMinB12 hope you're OK swapped this to use a walrus operator which I think is less surprising than a double comprehension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect 👍

@eyurtsev eyurtsev merged commit 7bc4e32 into langchain-ai:master Oct 25, 2024
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants