Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

文本分段与清洗失败 #2549

Closed
3 tasks done
limuan123123 opened this issue Feb 23, 2024 · 1 comment
Closed
3 tasks done

文本分段与清洗失败 #2549

limuan123123 opened this issue Feb 23, 2024 · 1 comment

Comments

@limuan123123
Copy link

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • Pleas do not modify this template :) and fill in all the required fields.

Provide a description of the help you need

文本分段与清洗上传文档成功,文本分段与清洗失败。
Traceback (most recent call last):
File "/usr/local/dify/api/core/indexing_runner.py", line 67, in run
self._load_segments(dataset, dataset_document, documents)
File "/usr/local/dify/api/core/indexing_runner.py", line 782, in _load_segments
doc_store.add_documents(documents)
File "/usr/local/dify/api/core/docstore/dataset_docstore.py", line 85, in add_documents
raise ValueError("doc must be a Document")
ValueError: doc must be a Document

Copy link

dosubot bot commented Feb 23, 2024

🤖

你好,@limuan123123!很高兴再次见到你。👋😊

根据您提供的错误信息,问题可能出在SegmentService.segment_create_args_validate(args['segments'], document)函数。这个函数期望第二个参数是一个Document对象,但它没有收到一个。在您提供的代码中,document对象是通过DocumentService.get_document(dataset.id, document_id)函数获取的。如果这个函数没有返回一个Document对象(例如,如果提供的ID的文档不存在),那么就会引发NotFound('Document not found.')错误。然而,如果函数返回一个非Document对象,就会出现您看到的错误。

要解决这个问题,您应该确保DocumentService.get_document(dataset.id, document_id)始终返回一个Document对象。如果函数可能返回非Document对象,您应该在函数调用后添加一个类型检查,以确保返回的对象是一个Document。如果不是,您应该引发一个适当的错误。

以下是如何添加类型检查的示例:

document = DocumentService.get_document(dataset.id, document_id)
if not document:
    raise NotFound('Document not found.')
if not isinstance(document, Document):
    raise TypeError('Expected a Document object, but got a different type.')

请将Document替换为您的Document对象的实际类名。

另外,关于Document对象的结构,它是一个继承自pydantic库的BaseModel类的类,用于存储一段文本和相关的元数据。Document对象的结构如下:

  • page_content:这是一个字符串属性,存储文档的实际内容。

  • metadata:这是一个可选的字典属性,存储关于页面内容的任意元数据。这可能包括文档的来源、与其他文档的关系等信息。如果没有提供元数据,它默认为空字典。

这是Document类的代码:

class Document(BaseModel):
    """Class for storing a piece of text and associated metadata."""

    page_content: str

    """Arbitrary metadata about the page content (e.g., source, relationships to other
        documents, etc.).
    """
    metadata: Optional[dict] = Field(default_factory=dict)

希望这些信息能帮助您解决问题。如果还有其他问题,欢迎继续提问。

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants