Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 对某些上传的pdf文件分块失败 #3551

Closed
havelhuang opened this issue Aug 22, 2024 · 36 comments · Fixed by #5732
Closed

[Bug] 对某些上传的pdf文件分块失败 #3551

havelhuang opened this issue Aug 22, 2024 · 36 comments · Fixed by #5732
Labels
🐛 Bug Something isn't working | 缺陷 Inactive No response in 30 days | 超过 30 天未活跃 released

Comments

@havelhuang
Copy link

📦 部署环境

Vercel

📌 软件版本

1.12.3

💻 系统环境

Windows

🌐 浏览器

Chrome

🐛 问题描述

某些pdf文件上传后无法分块,出现如图错误
屏幕截图 2024-08-22 165740

📷 复现步骤

No response

🚦 期望结果

No response

📝 补充信息

No response

@havelhuang havelhuang added the 🐛 Bug Something isn't working | 缺陷 label Aug 22, 2024
@github-project-automation github-project-automation bot moved this to Roadmap - Chat 1.x in Lobe Chat Routine Aug 22, 2024
@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


📦 Deployment environment

Vercel

📌 Software version

1.12.3

💻 System environment

Windows

🌐 Browser

Chrome

🐛 Problem description

Some pdf files cannot be divided into chunks after uploading, and an error appears as shown in the figure.
Screenshot 2024-08-22 165740

📷 Steps to reproduce

No response

🚦 Expected results

No response

📝 Supplementary information

No response

@lobehubbot
Copy link
Member

👀 @havelhuang

Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible.
Please make sure you have given us as much context as possible.
非常感谢您提交 issue。我们会尽快调查此事,并尽快回复您。 请确保您已经提供了尽可能多的背景信息。

@arvinxx
Copy link
Contributor

arvinxx commented Aug 22, 2024

能否附一个文件上来我看下

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Can you attach a file for me to take a look at?

@havelhuang
Copy link
Author

能否附一个文件上来我看下

synergy.pdf
比如这篇论文

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Can you attach a file for me to take a look at?

synergy.pdf
For example, this paper

@gaarry
Copy link
Contributor

gaarry commented Aug 22, 2024

image same

@Xiaokai6880
Copy link

+1

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


+1

@havelhuang
Copy link
Author

能否附一个文件上来我看下

synergy.pdf 比如这篇论文

有没有可能是你的API服务的问题呢?比如测试一下能不能正常调用text-embedding-3-small模型. 我用你这篇论文的前三页测试了一下,分块是没有问题的 image

其他pdf文件也可以分块,少量pdf文件不行

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Can you attach a file for me to take a look at?

synergy.pdf For example, this paper

Is it possible that there is a problem with your API service? For example, test whether the text-embedding-3-small model can be called normally. I tested it with the first three pages of your paper, and there is no problem with chunking! [image](https://private-user-images.githubusercontent.com/11055122/360427683-665a1bbe-ecd5-4c5c-afbb-fea438540dae.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..IvFfSSKpPltl2EB WLSYfAyItTEh7Vx4PL6mdRkAF3qQ)

Other pdf files can also be divided into chunks, but a small number of pdf files cannot.

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Can you attach a file for me to take a look at?

synergy.pdf For example, this paper

Is it possible that there is a problem with your API service? For example, test whether the text-embedding-3-small model can be called normally. I tested it with the first three pages of your paper, and there is no problem with chunking. ![image](https://private-user-images.githubusercontent.com/11055122/360427683-665a1bbe-ecd5-4c5c-afbb-fea438540dae.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..IvFfSSKpPltl2 EBWLSYfAyItTEh7Vx4PL6mdRkAF3qQ)

Other pdf files can also be divided into chunks, but a small number of pdf files cannot

(If you use Windows system) You can try to use the "Print function" to output this PDF as a new PDF and then upload it in chunks.

@Sun-drenched
Copy link

能否附一个文件上来我看下

synergy.pdf 比如这篇论文

有没有可能是你的API服务的问题呢?比如测试一下能不能正常调用text-embedding-3-small模型. 我用你这篇论文的前三页测试了一下,分块是没有问题的 image

其他pdf文件也可以分块,少量pdf文件不行

嗯,确实是这篇文档的"问题".目前定位到了你这篇论文里的第4页(含义复杂的图表+数学公式混排)解析出问题了..印象中这种页面不太好搞..
image

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Can you attach a file for me to take a look at?

synergy.pdf For example, this paper

Is it possible that there is a problem with your API service? For example, test whether the text-embedding-3-small model can be called normally. I tested it with the first three pages of your paper, and there is no problem with chunking. ![image](https://private-user-images.githubusercontent.com/11055122/360427683-665a1bbe-ecd5-4c5c-afbb-fea438540dae.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..IvFfSSKpPltl2 EBWLSYfAyItTEh7Vx4PL6mdRkAF3qQ)

Other pdf files can also be divided into chunks, but a small number of pdf files cannot

Well, it is indeed the "problem" of this document. Currently, I have located the 4th page of your paper (a mixed arrangement of charts and mathematical formulas with complex meanings) and there is a problem. I have the impression that this kind of page is not easy to handle. ..
image

@arvinxx
Copy link
Contributor

arvinxx commented Aug 22, 2024

@Sun-drenched 这种后续可以等我把 Unstructed.io 的变量开起来后再试试。之前我试下来 Unstructed 解析这种复杂格式的文件都没问题的

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@Sun-drenched I can try this follow-up after I open the variables of Unstructed.io. I tried Unstructed before and it had no problem parsing files in this complex format.

@Sun-drenched
Copy link

@Sun-drenched 这种后续可以等我把 Unstructed.io 的变量开起来后再试试。之前我试下来 Unstructed 解析这种复杂格式的文件都没问题的

希望如此吧,我之前测试一些理工科论文的时候,感觉市面上主流开源方案都不能做得很好(比如这个页面所示的,带有多个且复杂图例且图中还有足以影响理解结果的位置关系;流程图中表达复杂的数学关系).

另外,报错"[error]: invalid byte sequence for encoding "UTF8": 0x00"似乎还和PostgreSQl数据库有关.

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@Sun-drenched I can try this follow-up after I open the variables of Unstructed.io. I tried Unstructed before and it had no problem parsing files in this complex format.

I hope so. When I tested some science and engineering papers before, I felt that the mainstream open source solutions on the market could not do a good job (for example, as shown on this page, there are multiple and complex legends and there are enough in the figure to affect the understanding of the results. positional relationships; complex mathematical relationships expressed in flow charts).

In addition, the error "[error]: invalid byte sequence for encoding "UTF8": 0x00" seems to be related to the PostgreSQl database.

@alphandbelt1
Copy link

我这个pdf也是
image
hackingciphers.pdf

@ProblemFactory
Copy link

是否能提供一种方案把PDF直接通过文本提取/页面截图的方式嵌入到聊天消息里而不是向量化?

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Is there a solution to embed the PDF directly into the chat message through text extraction/page screenshot instead of vectorization?

@arvinxx
Copy link
Contributor

arvinxx commented Aug 23, 2024

@ProblemFactory 未来会做的

@nexusdiscord
Copy link

I got the same thing, using Vercel, Neon DB, and MinIO
image
[Chunking Error] { body: { detail: 'invalid byte sequence for encoding "UTF8": 0x00' }, name: 'error' }
nextjs-docs.pdf

Copy link

dosubot bot commented Aug 27, 2024

Hey @nexusdiscord! I'm here to help you with any bugs, questions, or contributions you have while waiting for a human maintainer. Let's tackle this together!

The error message "invalid byte sequence for encoding 'UTF8': 0x00" indicates that the file contains invalid UTF-8 byte sequences, which is causing the chunking process to fail.

To address this issue, you can try the following steps:

  1. Check File Encoding: Ensure that the file is properly encoded in UTF-8. You can use tools like iconv or text editors that support encoding conversion to re-encode the file.

  2. Handle Non-UTF-8 Encodings: Modify the chunking process to handle files with different encodings. This might involve detecting the file encoding and converting it to UTF-8 before processing.

  3. Update Chunking Logic: If the chunking logic does not currently handle non-UTF-8 encodings, you may need to update it. This can be done in the ContentChunk class, specifically in the chunkContent method, to include encoding detection and conversion.

Here is a relevant part of the ContentChunk class where you might need to make changes:

async chunkContent(params: ChunkContentParams): Promise<ChunkResult> {
  // Detect and convert encoding to UTF-8 if necessary
  const content = convertToUTF8(params.content);

  if (this.isUsingUnstructured(params))
    return await this.chunkByUnstructured(params.filename, content);

  return await this.chunkByLangChain(params.filename, content);
}

function convertToUTF8(content: Uint8Array): Uint8Array {
  // Implement encoding detection and conversion logic here
  // For example, using the 'iconv-lite' library
  const iconv = require('iconv-lite');
  const detectedEncoding = detectEncoding(content); // Implement this function
  return iconv.encode(iconv.decode(content, detectedEncoding), 'utf-8');
}

By implementing encoding detection and conversion, you can ensure that the chunking process handles files with different encodings correctly.

For more details on the chunking process, you can refer to the ContentChunk class in the src/server/modules/ContentChunk/index.ts file [1].

To continue talking to Dosu, mention @dosu.

@CXwudi
Copy link

CXwudi commented Aug 27, 2024

有些PDF如果先分成几段小PDF的话再分别分块就可行,比如说Head First Java本身45mb直接分块的话会报错,但如果把这个45mb分成5份小PDF,每份都小于15mb的话,lobechat就能分出来
image

@nexusdiscord
Copy link

When I split the PDF, I encountered a PDF page that caused an 'invalid byte sequence for encoding "UTF8": 0x00'.
nextjs-docs-1-100-71-100-2-4-6-1.pdf

@Edward-lyz
Copy link

还是PDF不好分割的问题,我推荐如果分块失败,可以先把PDF转成markdown传上去, 把图表都筛掉只剩文字就好了.

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


It’s still a problem that PDF is not easy to split. I recommend that if the splitting fails, you can first convert the PDF to markdown and upload it, filtering out the charts and leaving only the text.

@wensley
Copy link

wensley commented Oct 25, 2024

向量化失败,请检查后重试。失败原因:
[Embedding Error]:{}

txt、pdf、文档都这样。这个问题怎么解决?需要使用特定的模型?

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Vectorization failed, please check and try again. Reason for failure:
[Embedding Error]:{}

This is true for txt, pdf, and documents. How to solve this problem? Need to use a specific model?

@Miku2G
Copy link

Miku2G commented Oct 29, 2024

@ProblemFactory 未来会做的

现在还是不行,导入带有复杂图表、纯图片(会提示no chunk)的和带代码段的pdf依旧会分块失败

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@ProblemFactory will do it in the future

It still doesn't work. Importing PDFs with complex charts, pure pictures (no chunk will be prompted) and code snippets will still fail to be divided into chunks.

@codeyourwayup
Copy link

同样的问题!解决了吗???

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Same problem! Is it solved? ? ?

@lobehubbot lobehubbot added the Inactive No response in 30 days | 超过 30 天未活跃 label Jan 7, 2025
@github-project-automation github-project-automation bot moved this from Roadmap - Chat 1.x to Done in Lobe Chat Routine Feb 4, 2025
@lobehubbot
Copy link
Member

@havelhuang

This issue is closed, If you have any questions, you can comment and reply.
此问题已经关闭。如果您有任何问题,可以留言并回复。

@lobehubbot
Copy link
Member

🎉 This issue has been resolved in version 1.50.4 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug Something isn't working | 缺陷 Inactive No response in 30 days | 超过 30 天未活跃 released
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.