Skip to content

[BUG] 在线知识库爬取文档名超过128个字符报错 #706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
or-less opened this issue Jul 4, 2024 · 1 comment
Closed

[BUG] 在线知识库爬取文档名超过128个字符报错 #706

or-less opened this issue Jul 4, 2024 · 1 comment

Comments

@or-less
Copy link

or-less commented Jul 4, 2024

联系方式

No response

MaxKB 版本

v1.2.1

问题描述

知识库上传在线文档时如何网站连接过长会上传失败(前端显示上传成功,但实际不显示),查看docker环境报错了解到是因为爬取内容时会将网站链接设置为默认文档名称,同时对默认文档名称做了长度判断

{'name': [ErrorDetail(string='【文档名称】请确保此字段的字符数不超过 128 个。', code='max_length')]}

该问题在爬取类似于https://test.com/dir1/dir1_1/.../file 的wiki网站发现,当目录名为中文时很容易超出长度。

重现步骤

爬取一个超过指定长度的链接上传知识库就会出问题,

期待的正确结果

默认文档名称可以自定义修改或者进行字符长度截断,不然长链接网站内容无法进行爬取。

相关日志输出

File "/opt/maxkb/app/apps/dataset/serializers/document_serializers.py", line 655, in handler
    DocumentSerializers.Create(data={'dataset_id': dataset_id}).save(
  File "/opt/maxkb/app/apps/common/util/common.py", line 63, in run
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/maxkb/app/apps/dataset/serializers/document_serializers.py", line 625, in save
    DocumentInstanceSerializer(data=instance).is_valid(raise_exception=True)
  File "/opt/py3/lib/python3.11/site-packages/rest_framework/serializers.py", line 235, in is_valid
    raise ValidationError(self.errors)
rest_framework.exceptions.ValidationError: {'name': [ErrorDetail(string='【 文档名称】请确保此字段的字符数不超过 128 个。', code='max_length')]}

附加信息

No response

@or-less or-less changed the title [BUG] [BUG] 在线知识库爬取错误 Jul 4, 2024
@baixin513
Copy link
Contributor

感谢反馈,应该是bug,后续版本修复。

@baixin513 baixin513 added this to the v1.3.1 milestone Jul 5, 2024
@baixin513 baixin513 changed the title [BUG] 在线知识库爬取错误 [BUG] 在线知识库爬取文档名超过128个字符报错 Jul 15, 2024
shaohuzhang1 added a commit that referenced this issue Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants