[Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store #7684

c64er4ever · 2023-09-15T06:14:43Z

Bug Description

Hi,
I have created multiple indexes, each covering different set of nodes. I am then using one of the indexes to retrieve nodes based on a query. However, it seems that the retriever is trying to process nodes that are out of the context of the specific index that I have provided, causing a KeyError exception in this line in llama_index/indices/vector_store/retrievers/retriever.py:
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids

I noticed that it is possible to send list of node_ids to retrieve in order to constraint it. Haven't tried it yet and I guess that it will work. However, I believe that it is better if the retrieve function itself verifies that only nodes that are covered by the provided index are processed.

Thanks!
Guy

Version

0.8.5.post2

Steps to Reproduce

Create doc store with X nodes (e.g. 100)
Create index that covers only part of the nodes (e.g. 10)
Use the created index to create a Retriever: retriever = VectorIndexRetriever(my_index)
Call retrieve with a question: nodes = retriever.retrieve("This is my question")

If the query returns nodes that are excluded from the provided index, KeyError exception occurs

Relevant Logs/Tracbacks

No response

dosubot · 2023-09-15T06:21:26Z

🤖

Hi Guy,

Thank you for bringing this issue to our attention. It seems like you've encountered a KeyError because the retriever is trying to process nodes that are not covered by the specific index you've provided. This is a known issue and a similar problem was addressed in a previous issue #1769.

The solution suggested in that issue was to pass the list of node ids from the vector index to the retriever, so it knows which subset of nodes to process. Here's the code snippet from that solution:

def as_retriever(self, **kwargs: Any) -> BaseRetriever:
    # NOTE: lazy import
    from llama_index.indices.vector_store.retrievers import VectorIndexRetriever

    return VectorIndexRetriever(
        self, doc_ids=list(self.index_struct.nodes_dict.values()), **kwargs
    )

This code ensures that the retriever only processes nodes that are included in the provided index, which should prevent the KeyError you're experiencing.

However, please note that this is more of a workaround than a permanent solution. The underlying issue is that the retrieve function does not verify that only nodes covered by the provided index are processed. We will consider addressing this in a future update.

As for your question about changes made in the LlamaIndex codebase after version 0.8.5.post2 that could potentially address this issue, I wasn't able to find any. However, I recommend updating to the latest version (v0.8.27) to benefit from the most recent bug fixes and improvements.

I hope this helps! Let me know if you have any other questions.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich · 2023-09-15T15:04:29Z

The bot is right! If you aren't using index.as_retriever(..), then you need to pass in the doc_ids

afalamna · 2023-11-16T21:55:59Z

@logan-markewich I am facing this isssue and I added the code snippet provided by the bot in retriever.py but its still giving me this error. This is my code.

from llama_index.llms import OpenAI
from llama_index.embeddings import TextEmbeddingsInference
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, StorageContext, load_index_from_storage
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.node_parser import SimpleNodeParser
from llama_index.query_engine import RetrieverQueryEngine

documents = SimpleDirectoryReader(
input_files=["uber_10q_june_2022.pdf","uber_10q_march_2022.pdf"]
).load_data()

embed_model = TextEmbeddingsInference(
model_name="BAAI/bge-large-en-v1.5", # required for formatting inference text,
timeout=60, # timeout in seconds
embed_batch_size=10, # batch size for embedding
)

llm = OpenAI(temperature=0, model="gpt-3.5-turbo", max_tokens=1024)
print('Processing......')
service_context = ServiceContext.from_defaults(llm=llm,embed_model=embed_model)
index = VectorStoreIndex.from_documents(documents = documents,service_context=service_context)
query_engine = index.as_query_engine(similarity_top_k=3)

c64er4ever added bug Something isn't working triage Issue needs to be triaged/prioritized labels Sep 15, 2023

logan-markewich closed this as completed Sep 15, 2023

dosubot bot mentioned this issue Sep 22, 2023

[Bug]: Additional exception handling needed in retriever.py #7791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store #7684

[Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store #7684

c64er4ever commented Sep 15, 2023

dosubot bot commented Sep 15, 2023

logan-markewich commented Sep 15, 2023

afalamna commented Nov 16, 2023

[Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store #7684

[Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store #7684

Comments

c64er4ever commented Sep 15, 2023

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Sep 15, 2023

Sources

logan-markewich commented Sep 15, 2023

afalamna commented Nov 16, 2023