-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Russian language support #1987
Comments
Yes, I'm also interesting in this! |
@Said-Apollo I tried to do it, but it removed almost all the spaces in the text. Here is test document in Russian: |
When inputting the pdf to my knowledge base, it only gives me a single chunk with a few words. However, after converting the pdf to a docx file, it gave me around 18 chunks Now looking closer at the result, they look somewhat correct to me (although Im not a russian expert). However, unfortunately the file is not shown next to it. I guess this is not supported for docx files yet.
|
### What problem does this PR solve? [#1987](#1987) When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from [Russian documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf) needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex. There might be problems with other languages that use different alphabets. I additionally tested [PDF in Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816) and old [a-zA-Z...] regex parses it correctly with spaces. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)
I made a quick and dirty spaces problem fix for only Russian language. Deepdoc component and especially its pdf_parser class now do not remove spaces in RU text. You can observe it if you turn off layout recognition and parse the example "dogovor_oferta.pdf" or any other Russian PDF document. Unfortunately, it still removes spaces if leave layout recognition turned on. |
There are a lot of places in the project (link1, link2), where string is processed in different ways depending on if it matches [0-9a-zA-Z...] regex for english language or not. If we want multiple languages support, this logic should be changed. Because matching the [0-9a-zA-Z...] regex is not the only case, where spaces should be kept and not removed.
My proposal is to replace [0-9a-zA-Z...] regex with something like "should_remove_spaces(str)" function and use it in rmSpace(str) function. I can add and use fasttext-langdetect dependency for that. If the language is not recognized as Chinese or Japanese, spaces should not be removed. |
…flow#2427) ### What problem does this PR solve? [infiniflow#1987](infiniflow#1987) When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from [Russian documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf) needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex. There might be problems with other languages that use different alphabets. I additionally tested [PDF in Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816) and old [a-zA-Z...] regex parses it correctly with spaces. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)
Is there an existing issue for the same feature request?
Describe the feature you'd like
We have a lot of scientific materials that are only in Russian (physics, psychology, etc.) and we would like to make a knowledge base on them and a chatbot. Please tell me do you plan to support the Russian language? Is there any way to add it myself?
The text was updated successfully, but these errors were encountered: