文本分段与清洗失败 #2549

limuan123123 · 2024-02-23T09:25:10Z

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
Pleas do not modify this template :) and fill in all the required fields.

Provide a description of the help you need

文本分段与清洗上传文档成功，文本分段与清洗失败。
Traceback (most recent call last):
File "/usr/local/dify/api/core/indexing_runner.py", line 67, in run
self._load_segments(dataset, dataset_document, documents)
File "/usr/local/dify/api/core/indexing_runner.py", line 782, in _load_segments
doc_store.add_documents(documents)
File "/usr/local/dify/api/core/docstore/dataset_docstore.py", line 85, in add_documents
raise ValueError("doc must be a Document")
ValueError: doc must be a Document

dosubot · 2024-02-23T09:25:24Z

🤖

你好，@limuan123123！很高兴再次见到你。👋😊

根据您提供的错误信息，问题可能出在SegmentService.segment_create_args_validate(args['segments'], document)函数。这个函数期望第二个参数是一个Document对象，但它没有收到一个。在您提供的代码中，document对象是通过DocumentService.get_document(dataset.id, document_id)函数获取的。如果这个函数没有返回一个Document对象（例如，如果提供的ID的文档不存在），那么就会引发NotFound('Document not found.')错误。然而，如果函数返回一个非Document对象，就会出现您看到的错误。

要解决这个问题，您应该确保DocumentService.get_document(dataset.id, document_id)始终返回一个Document对象。如果函数可能返回非Document对象，您应该在函数调用后添加一个类型检查，以确保返回的对象是一个Document。如果不是，您应该引发一个适当的错误。

以下是如何添加类型检查的示例：

document = DocumentService.get_document(dataset.id, document_id)
if not document:
    raise NotFound('Document not found.')
if not isinstance(document, Document):
    raise TypeError('Expected a Document object, but got a different type.')

请将Document替换为您的Document对象的实际类名。

另外，关于Document对象的结构，它是一个继承自pydantic库的BaseModel类的类，用于存储一段文本和相关的元数据。Document对象的结构如下：

page_content：这是一个字符串属性，存储文档的实际内容。
metadata：这是一个可选的字典属性，存储关于页面内容的任意元数据。这可能包括文档的来源、与其他文档的关系等信息。如果没有提供元数据，它默认为空字典。

这是Document类的代码：

class Document(BaseModel):
    """Class for storing a piece of text and associated metadata."""

    page_content: str

    """Arbitrary metadata about the page content (e.g., source, relationships to other
        documents, etc.).
    """
    metadata: Optional[dict] = Field(default_factory=dict)

希望这些信息能帮助您解决问题。如果还有其他问题，欢迎继续提问。

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

crazywoola closed this as completed Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

文本分段与清洗失败 #2549

文本分段与清洗失败 #2549

limuan123123 commented Feb 23, 2024

dosubot bot commented Feb 23, 2024 •

edited

Loading

About Dosu

文本分段与清洗失败 #2549

文本分段与清洗失败 #2549

Comments

limuan123123 commented Feb 23, 2024

Self Checks

Provide a description of the help you need

dosubot bot commented Feb 23, 2024 • edited Loading

Sources

About Dosu

dosubot bot commented Feb 23, 2024 •

edited

Loading