ContextualCompressionRetriever._get_relevant_documents() returns a list of _DocumentWithState instead of a list of Document #28511

matteo-rusconi · 2024-12-04T14:15:40Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_openai import AzureOpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.storage.file_system import LocalFileStore
from langchain_community.document_transformers.embeddings_redundant_filter import EmbeddingsRedundantFilter
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain.retrievers.document_compressors.base import DocumentCompressorPipeline
from uuid import uuid4


embedder = AzureOpenAIEmbeddings(model='text-embedding-3-large')

vectorstore = Chroma(collection_name="docs", 
                     embedding_function=embedder, 
                     persist_directory="data/vector_db/")

retriever = MultiVectorRetriever(vectorstore=vectorstore,
                                 docstore=create_kv_docstore(LocalFileStore("data/retriever_data/")),
                                 id_key='doc_id')

compression_retriever = ContextualCompressionRetriever(
    base_compressor=DocumentCompressorPipeline(transformers=[
        EmbeddingsRedundantFilter(embeddings=embedder, 
                                  similarity_threshold=0.999)
        ]
    ), 
    base_retriever=retriever)

documents = '''list of documents to embed and store in the vectorstore'''

doc_ids = [str(uuid4()) for _ in documents]

docs = [
    Document(page_content=s, metadata={'doc_id': doc_ids[i]})
    for i, s in enumerate(documents)
]

retriever.base_retriever.vectorstore.add_documents(docs)


retrieved_docs = retriever.invoke('''query''')

Error Message and Stack Trace (if applicable)

No response

Description

According to LangChain's documentation, retrieved_docs shoud be a list of Document objects.

But it happens to be a list of _DocumentWithState objects, which is similar but includes the embedded representations of the documents.

In my case, this is a problem because the embedded vectors are big, and passing them to an LLM in the generation phase of a RAG application is not ideal.

The problem origins in the EmbeddingsRedundantFilter.transform_documents() method that returns:

return [stateful_documents[i] for i in sorted(included_idxs)]

which are then forwarded to the retriever output.

System Info

System Information

OS: Linux
OS Version: #1 SMP Fri Mar 29 23:14:13 UTC 2024
Python Version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]

Package Information

langchain_core: 0.3.10
langchain: 0.3.3
langchain_community: 0.3.2
langsmith: 0.1.129
langchain_chroma: 0.1.4
langchain_huggingface: 0.1.0
langchain_openai: 0.2.2
langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.5
async-timeout: 4.0.3
chromadb: 0.5.13
dataclasses-json: 0.6.7
fastapi: 0.115.2
httpx: 0.27.0
huggingface-hub: 0.25.2
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.51.2
orjson: 3.10.7
packaging: 24.1
pydantic: 2.8.2
pydantic-settings: 2.5.2
PyYAML: 6.0.1
requests: 2.32.3
sentence-transformers: 3.2.0
SQLAlchemy: 2.0.34
tenacity: 8.2.3
tiktoken: 0.8.0
tokenizers: 0.20.1
transformers: 4.45.2
typing-extensions: 4.12.2

The text was updated successfully, but these errors were encountered:

gauravmindzk · 2025-01-02T07:20:35Z

hi @matteo-rusconi , were you able to resolve the issue ?

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContextualCompressionRetriever._get_relevant_documents() returns a list of _DocumentWithState instead of a list of Document #28511

ContextualCompressionRetriever._get_relevant_documents() returns a list of _DocumentWithState instead of a list of Document #28511

matteo-rusconi commented Dec 4, 2024 •

edited

Loading

gauravmindzk commented Jan 2, 2025

ContextualCompressionRetriever._get_relevant_documents() returns a list of _DocumentWithState instead of a list of Document #28511

ContextualCompressionRetriever._get_relevant_documents() returns a list of _DocumentWithState instead of a list of Document #28511

Comments

matteo-rusconi commented Dec 4, 2024 • edited Loading

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

gauravmindzk commented Jan 2, 2025

matteo-rusconi commented Dec 4, 2024 •

edited

Loading