Description
Do you need to file an issue?
- I have searched the existing issues and this bug is not already filed.
- My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
-
The
id
series of theresponse
attribute andsources
dataframe outputs of the DRIFTSearchResult
object appear to be consistently misnumbered by -1. For instance, if thehuman_readable_id
value of a given text unit intext_units
is '2', the correspondingid
value of thesources
dataframe is '1'. -
The response attribute of the DRIFT
SearchResult
object appears to occasionally hallucinate source ids. For instance,response
may reference "[Data: Sources (1)]" where there is no corresponding '1' in theid
series of thesources
dataframes of a DRIFTSearchResult
object.
Steps to reproduce
- Execute a DRIFT search query
- For a given text unit, inspect the corresponding
id
value of thesources
dataframes of theSearchResult.context_data
dict - For the same text unit, inspect the corresponding
human_readable_id
oftext_units.parquet
Expected Behavior
-
For a given text unit, the
id
series of the resultingsources
dataframes of theSearchResult.context_data
dict should be the same as thehuman_readable_id
oftext_units.parquet
. -
Only source ids used for context should be referenced in the generated response.
GraphRAG Config Used
### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/
### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
models:
default_chat_model:
type: openai_chat # or azure_openai_chat
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-05-01-preview
auth_type: api_key # or azure_managed_identity
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: gpt-4o-mini
# deployment_name: <azure_model_deployment_name>
# encoding_model: cl100k_base # automatically set by tiktoken if left undefined
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: 25 # max number of simultaneous LLM requests allowed
async_mode: threaded # or asyncio
retry_strategy: native
max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response)
tokens_per_minute: 0 # set to 0 to disable rate limiting
requests_per_minute: 0 # set to 0 to disable rate limiting
default_embedding_model:
type: openai_embedding # or azure_openai_embedding
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-05-01-preview
auth_type: api_key # or azure_managed_identity
api_key: ${GRAPHRAG_API_KEY}
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: text-embedding-3-small
# deployment_name: <azure_model_deployment_name>
# encoding_model: cl100k_base # automatically set by tiktoken if left undefined
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: 25 # max number of simultaneous LLM requests allowed
async_mode: threaded # or asyncio
retry_strategy: native
max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response)
tokens_per_minute: 0 # set to 0 to disable rate limiting
requests_per_minute: 0 # set to 0 to disable rate limiting
vector_store:
default_vector_store:
type: lancedb
db_uri: output\lancedb
container_name: default
overwrite: True
embed_text:
model_id: default_embedding_model
vector_store_id: default_vector_store
### Input settings ###
input:
type: file # or blob
file_type: text # [csv, text, json]
base_dir: "input"
chunks:
size: 600
overlap: 50
group_by_columns: [id]
### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided
cache:
type: file # [file, blob, cosmosdb]
base_dir: "cache"
reporting:
type: file # [file, blob, cosmosdb]
base_dir: "logs"
output:
type: file # [file, blob, cosmosdb]
base_dir: "output"
### Workflow settings ###
extract_graph:
model_id: default_chat_model
prompt: "prompts/extract_graph.txt"
entity_types: [school of thought, hypothesis, concept, contention,
knowledge gap, policy, country, gender, capability, livelihood, social harm,
ecological harm, technology, norm]
max_gleanings: 1
summarize_descriptions:
model_id: default_chat_model
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
extract_graph_nlp:
text_analyzer:
extractor_type: regex_english # [regex_english, syntactic_parser, cfg]
extract_claims:
enabled: true
model_id: default_chat_model
prompt: "prompts/extract_claims.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
model_id: default_chat_model
graph_prompt: "prompts/community_report_graph.txt"
text_prompt: "prompts/community_report_text.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)
snapshots:
graphml: true
embeddings: false
### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
local_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/local_search_system_prompt.txt"
global_search:
chat_model_id: default_chat_model
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/drift_search_system_prompt.txt"
reduce_prompt: "prompts/drift_search_reduce_prompt.txt"
basic_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/basic_search_system_prompt.txt"
Logs and screenshots
No response
Additional Information
- GraphRAG Version: v2.1.0
- Operating System: Windows 11
- Python Version: 3.12.9
- Related Issues: