Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug Report: Inconsistency in recorded data across different vector databases #1870

Open
1 task done
LakshmiN5 opened this issue Aug 19, 2024 · 3 comments
Open
1 task done
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@LakshmiN5
Copy link

Which component is this bug for?

All Packages

📜 Description

Tried using Traceloop version 0.26.4 with different vector databases while trying to run a RAG application (watsonx + langchain) and observed some difference in behaviour. I was expecting the behaviour across all vector dbs to be uniform with regards to the span information. I tested using Milvus, Pinecone and Chroma where Milvus and Chroma were both tested using in memory option with langchain and for pinecone I tried the managed instance.
Observations :

  • Chroma - captures less information in the vector db related spans - captures embedding count , gives similarity value but does not give all 4 retrieved chunks information. It just returns one chunk and specifying the parameters in the as_retriever() method does not seem to have any effect on the span information collected. reference for as_retriever: https://api.python.langchain.com/en/v0.1/vectorstores/langchain_astradb.vectorstores.AstraDBVectorStore.html#langchain_astradb.[…]Store.as_retriever
  • Pinecone - Does not capture the embedding count nor does it give similarity value but I could see the top 4 retrieved documents as part of another span.
  • Milvus - does not capture the embedding count nor does it give the similarity value , there seems to be some problem with the retrieved context also.

👟 Reproduction steps

Steps can be reproduced by trying the RAG sample from langchain using different vector databases. LLM used is from watsonx with langchain framework.

https://python.langchain.com/v0.1/docs/use_cases/question_answering/quickstart/

👍 Expected behavior

Ideally the following information should be captured consistently across all vector dbs

  1. embeddings details - such as count (for the stored knowledge base) and any additional information
  2. query embeddings and other details
  3. retrieved context information - no of chunks matched, should return all matched chunks as per the configuration parameters set for the retriever (mentioned in number 5)
  4. retrieval parameters configured should influence the actual results generated , for eg : similarity algorithm to use for searching query against the stored docs, number of documents to retrieve , similarity threshold etc
  5. any insights on the chunk(s) used for the final answer generation.

👎 Actual Behavior with Screenshots

Most information is missing and the behaviour is definitely not consistent.

🤖 Python Version

3.10

📃 Provide any additional context for the Bug.

No response

👀 Have you spent some time to check if this bug has been raised before?

  • I checked and didn't find similar issue

Are you willing to submit PR?

None

@nirga nirga added good first issue Good for newcomers help wanted Extra attention is needed labels Aug 19, 2024
@LakshmiN5
Copy link
Author

An update to the Expected Behaviour point 1 - We should be able to capture the embedding model information as well. thank you.

@cu8code
Copy link

cu8code commented Sep 22, 2024

@nirga is this issue open, would like to work on this ?

@nirga
Copy link
Member

nirga commented Sep 22, 2024

Yes @cu8code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants