Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null} #455

Closed
BovineOverlord opened this issue Jul 9, 2024 · 26 comments
Labels
community_support Issue handled by community members

Comments

@BovineOverlord
Copy link

Describe the bug

{"type": "error", "data": "Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key", "stack": "Traceback (most recent call last):\n File "C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb\n result = node.verb.func(**verb_args)\n File "C:\Program Files\Python310\lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph\n output_df[[level_to, to]] = pd.DataFrame(\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py", line 4299, in setitem\n self._setitem_array(key, value)\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py", line 4341, in _setitem_array\n check_key_length(self.columns, key, value)\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length\n raise ValueError("Columns must be same length as key")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}
{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n File "C:\Program Files\Python310\lib\site-packages\graphrag\index\run.py", line 323, in run_pipeline\n result = await workflow.run(context, callbacks)\n File "C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py", line 369, in run\n timing = await self._execute_verb(node, context, callbacks)\n File "C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb\n result = node.verb.func(**verb_args)\n File "C:\Program Files\Python310\lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph\n output_df[[level_to, to]] = pd.DataFrame(\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py", line 4299, in setitem\n self._setitem_array(key, value)\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py", line 4341, in _setitem_array\n check_key_length(self.columns, key, value)\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length\n raise ValueError("Columns must be same length as key")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}

Steps to reproduce

I was using a local ollama model to use the tool. It ran fine and loaded the test file before the error occurred.

Expected Behavior

The tool should have proceeded with the following step "create_base_text_units" rather than cease operation. It appears to be a bug with the graphing function.

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: command-r-plus:104b-q4_0
model_supports_json: true # recommended if this is available for your model.

max_tokens: 2000

request_timeout: 180.0

api_base: http://localhost:11434/v1

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 1

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 1 # the number of parallel inflight requests that may be made

parallelization:
stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: qwen2:7b-instruct
# api_base: http://localhost:11434/api
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 1
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 1 # the number of parallel inflight requests that may be made
# batch_size: 1 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional

No change to the remainder

Logs and screenshots

error

Additional Information

  • GraphRAG Version: Current of this posting
  • Operating System: Windows 10
  • Python Version: 3.10
  • Related Issues:
@BovineOverlord BovineOverlord added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 9, 2024
@AlonsoGuevara
Copy link
Contributor

Hi!
My general rule of thumb when facing this issues is:

  • Check the outputs of the entity extraction, this will show if the graph is empty
  • If the graph is empty, then it can be either faulty llm responses (unparseable) or, LLM calling failures

Can you please check your cache entries for Entity Extraction to check if the LLM is providing faulty responses?

@BovineOverlord
Copy link
Author

Entity extraction directory is empty. I attempted with 2 other different models and was met with the same result.

@zubu007
Copy link

zubu007 commented Jul 12, 2024

Facing the same thing. cache/entity_extraction is empty. same exact error in the logs.

@huangyuanzhuo-coder
Copy link

same error

2 similar comments
@flikeok
Copy link

flikeok commented Jul 12, 2024

same error

@menghongtao
Copy link

same error

@CyanMystery
Copy link

same error:

this is my indexing-engine.log:
indexing-engine.log

@Xls1994
Copy link

Xls1994 commented Jul 16, 2024

same error:
this is my log:
indexing-engine.log

The entity_extraction directory is not empty.

image

@BochenYIN
Copy link

same error, Entity extraction directory is empty.

@chenfujv
Copy link

same error:
But entity_extraction directory is not empty.
image

@chenfujv
Copy link

settings.yaml
image

@Bai1026
Copy link

Bai1026 commented Jul 19, 2024

same error lol
But entity_extraction and summarize_descriptions directories are also not empty.

@yinjianjie
Copy link

same error
why

@yurochang
Copy link

same problem.

@ayanjiushishuai
Copy link

+1

1 similar comment
@kiljos
Copy link

kiljos commented Jul 22, 2024

+1

@natoverse
Copy link
Collaborator

Consolidating alternate model issues here: #657

@natoverse natoverse closed this as not planned Won't fix, can't repro, duplicate, stale Jul 22, 2024
@natoverse natoverse added community_support Issue handled by community members and removed bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 22, 2024
@night666e
Copy link

面对同样的事情。cache/entity_extraction 为空。日志中出现完全相同的错误。

解决了吗

@night666e
Copy link

实体提取目录为空。我尝试了其他 2 种不同的模型,得到了相同的结果。

解决了吗

@night666e
Copy link

描述错误

{“type”: “error”, “data”: “在create_base_entity_graph中执行动词”cluster_graph“时出错:列的长度必须与键相同”, “stack”: “回溯(最近一次调用):\n 文件 ”C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py“, line 410, in _execute_verb\n result = node.verb.func(**verb_args)\n 文件 ”C:\Program Files\Python310\lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py“, 第 102 行,在 cluster_graph\n output_df[[level_to, to]] = PD。DataFrame(\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py”, 第 4299 行, 在 setitem\n self._setitem_array(键, 值)\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py”, 行 4341, 在 _setitem_array\n check_key_length(self.columns, 键, 值)\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\indexers\utils.py”, 第 390 行,在 check_key_length\n 引发 ValueError(“列必须与键的长度相同”)\nValueError: 列的长度必须与键相同“, ”source“: ”列的长度必须与键的长度相同“, ”details“: null} {”type“: ”错误“, ”data“: ”运行管道时出错!“, ”stack“: ”回溯(最近一次调用最后一次):\n 文件 “C:\Program Files\Python310\lib\site-packages\graphrag\index\run.py”, 第 323 行,run_pipeline\n 结果 = await workflow.run(context, callbacks)\n 文件 “C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py”,第 369 行,运行\n 计时 = 等待self._execute_verb(节点、上下文、回调)\n 文件 “C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py”,第 410 行,_execute_verb\n 结果 = node.verb.func(**verb_args)\n 文件“C:\Program Files\Python310\lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py“,第 102 行,cluster_graph\n output_df[[level_to, to]] = pd。DataFrame(\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py”, 第 4299 行, 在 setitem\n self._setitem_array(键, 值)\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py”, 行 4341, 在 _setitem_array\n check_key_length(self.columns, 键, 值)\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\indexers\utils.py”, 第 390 行,在 check_key_length\n 中引发 ValueError(“列必须与键的长度相同”)\nValueError: 列的长度必须与键相同“, ”source“: ”列的长度必须与键的长度相同“, ”details“: null}

重现步骤

我正在使用本地 ollama 模型来使用该工具。它运行良好,并在错误发生之前加载了测试文件。

预期行为

该工具应继续执行以下步骤“create_base_text_units”,而不是停止操作。这似乎是绘图功能的一个错误。

使用的 GraphRAG 配置

encoding_model: cl100k_base skip_workflows: [] LLM: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # 或 azure_openai_chat model: command-r-plus:104b-q4_0 model_supports_json: true # 如果这适用于您的模型,则推荐使用。

max_tokens: 2000

request_timeout: 180.0

api_base: http://localhost:11434/v1

api_version: 2024-02-15-preview

组织机构: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # 设置漏斗油门

requests_per_minute: 10_000 # 设置漏斗油门

max_retries: 1

max_retry_wait:10.0

sleep_on_rate_limit_recommendation: true # 当 Azure 建议等待时间时是否休眠

concurrent_requests: 1 # 可以发出的并行飞行请求的数量

并行化: 交错: 0.3

num_threads: 50 # 用于并行处理的线程数

async_mode:threaded # 或 asyncio

嵌入:

并行化:覆盖嵌入的全局并行化设置

async_mode: threaded # 或 asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # 或 azure_openai_embedding model: qwen2:7b-instruct # api_base: http://localhost:11434/api # api_version: 2024-02-15-preview # 组织: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # 设置漏桶油门 # requests_per_minute: 10_000 # 设置漏桶限制 # max_retries: 1 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # 当 Azure 建议等待时间时是否休眠 # concurrent_requests: 1 # 可以发出的并行飞行请求数 # batch_size: 1 # 单次请求中要发送的文档数量# batch_max_tokens: 8191 # 单个请求中发送的最大令牌数 # 目标:必填 # 或可选

其余部分不变

日志和屏幕截图

错误 ### 其他信息 * GraphRAG 版本:此帖子的当前内容 * 操作系统:Windows 10 * Python版本:3.10 * 相关问题:

解决了吗,兄弟

@night666e
Copy link

同样的错误:这是我的日志:indexing-engine.log

entity_extraction目录不为空。

image

你解决了吗

@night666e
Copy link

同样的错误:但是entity_extraction目录不是空的。 image

解决了吗

@teneous
Copy link

teneous commented Aug 9, 2024

I use openAI GPT-4o-mini,after I reduce chunks size from 1000 to 200 and decrease overlay to 10. it works for me!

chunks:
  size: 200
  overlap: 10
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
image

@Friman04
Copy link

Friman04 commented Aug 9, 2024

same

@maverick001
Copy link

Same issue here. I used gpt-4o-mini, along with default text-embedding-3-small, max_token set to 1700.
Any official solution yet?

@FULLK
Copy link

FULLK commented Dec 24, 2024

I also encountered this issue, and the root cause is that the results extracted by your model are not good enough. On one hand, you can choose a more powerful large model; on the other hand, you can adjust the llm:max_token in the settings.yaml to be smaller, or reduce the chunks:size and overlap as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community_support Issue handled by community members
Projects
None yet
Development

No branches or pull requests