Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Russian language support #1987

Open
1 task done
netandreus opened this issue Aug 17, 2024 · 6 comments
Open
1 task done

[Feature Request]: Russian language support #1987

netandreus opened this issue Aug 17, 2024 · 6 comments
Labels

Comments

@netandreus
Copy link
Contributor

netandreus commented Aug 17, 2024

Is there an existing issue for the same feature request?

  • I have checked the existing issues.

Describe the feature you'd like

We have a lot of scientific materials that are only in Russian (physics, psychology, etc.) and we would like to make a knowledge base on them and a chatbot. Please tell me do you plan to support the Russian language? Is there any way to add it myself?

@Cricricrikets
Copy link

Yes, I'm also interesting in this!

@Said-Apollo
Copy link

When creating a knowledge base, there is the option to activate "Layout Analysis". Since this uses a visual language model (in cases of images or if not enough text was contained in a chunk), maybe this might work for russian language (although its definitely improvable)
image

Maybe you could try to change the "threshold" when the Visual Model should be used to interpret the text.

@netandreus
Copy link
Contributor Author

@Said-Apollo I tried to do it, but it removed almost all the spaces in the text.

Here is test document in Russian:
dogovor_oferta.pdf

And here are parsing results:
Screenshot 2024-08-19 at 15 46 27

@Said-Apollo
Copy link

Said-Apollo commented Aug 21, 2024

@Said-Apollo I tried to do it, but it removed almost all the spaces in the text.

Here is test document in Russian: dogovor_oferta.pdf

And here are parsing results: Screenshot 2024-08-19 at 15 46 27

When inputting the pdf to my knowledge base, it only gives me a single chunk with a few words. However, after converting the pdf to a docx file, it gave me around 18 chunks
image

Now looking closer at the result, they look somewhat correct to me (although Im not a russian expert). However, unfortunately the file is not shown next to it. I guess this is not supported for docx files yet.
image
Maybe you could try this workaround until russian is also supported? In case you have lots of pdfs and are on linux, I would therefore recommend simply this command in terminal:

lowriter --convert-to docx *.pdf

Hyperb0t added a commit to Hyperb0t/ragflow that referenced this issue Sep 14, 2024
KevinHuSh pushed a commit that referenced this issue Sep 14, 2024
### What problem does this PR solve?

[#1987](#1987)

When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf)
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816)
and old [a-zA-Z...] regex parses it correctly with spaces.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
@Hyperb0t
Copy link
Contributor

I made a quick and dirty spaces problem fix for only Russian language. Deepdoc component and especially its pdf_parser class now do not remove spaces in RU text.
Fix was merged in v 0.11 and is in the demo already.

You can observe it if you turn off layout recognition and parse the example "dogovor_oferta.pdf" or any other Russian PDF document.

turn-off-layot-recog
spaces

Unfortunately, it still removes spaces if leave layout recognition turned on.
I think it happens while returning stored text chunks from the backed via REST API and not while parsing.
I am going to also resolve this problem. Probably by changing rmSpace function.

@Hyperb0t
Copy link
Contributor

There are a lot of places in the project (link1, link2), where string is processed in different ways depending on if it matches [0-9a-zA-Z...] regex for english language or not.
One of these differences in string processing is space symbol removal.
If the string is considered english by regex, spaces are not removed, otherwise removed.

If we want multiple languages support, this logic should be changed. Because matching the [0-9a-zA-Z...] regex is not the only case, where spaces should be kept and not removed.
There are other non-latin languages or groups of languages with other alphabets and writing systems, where spaces are needed:

  • Greek (Α α, Β β, Γ γ, Δ δ, Ε ε, Ζ ζ, Η η, Θ θ, Ι ι, Κ κ, Λ λ, Μ μ, Ν ν, Ξ ξ, Ο ο, Π π, Ρ ρ, Σ σ/ς, Τ τ, Υ υ, Φ φ, Χ χ, Ψ ψ, Ω ω.)
  • Cyrillic (Russian, Ukrainian, Serbian etc.) (А ,А̀ ,А̂ ,А̄ ,Ӓ ,Б ,В ,Г, Ґ ,Д ,Ђ ,Ѓ ,Е ,Ѐ ,Е̄ ,Е̂Ё ,Є ,Ж ,З ,З́ ,Ѕ ,И ,І, Ї ,Ꙇ ,Ѝ ,И̂ ,Ӣ ,Й ,Ј ,К, Л ,Љ ,М ,Н ,Њ ,О ,О̀ ,О̂, Ō ,Ӧ ,П ,Р ,С ,С́ ,Т ,Ћ, Ќ ,У ,У̀ ,У̂ ,Ӯ ,Ў ,Ӱ ,Ф, Х ,Ц ,Ч ,Џ ,Ш ,Щ ,Ꙏ ,Ъ, Ъ̀ ,Ы ,Ь ,Ѣ ,Э ,Ю ,Ю̀ ,Я, Я̀ )
  • Hebrew (א,ב,ג,ד,ה,ו,ז,ח,ט,י,כ,ל,מ,נ,ס,ע,פ,צ,ק,ר,ש,ת)
  • Arabic (ا,ب,ت,ث,ج,ح,خ,د,ذ,ر,ز,س,ش,ص,ض,ط,ظ,ع,غ,ف,ق,ك,ل,م,ن,ه,و,ي,ﺀ)
  • Devanagari (Indian) (ा,ि,ु,े,ो,क,ग,च,ज,ट,ड,त,द,न,प,ब,म,य,र,ल,व,स,ह,ृ,क्ष,ज्ञ,में,अ,इ,उ,ए,ओ,क्,ग्,च्,ज्,ट्,ड्,त्,द्,न्,प्,ब्,म्,य्,र्,ल्,व्,स्,ह्,़,क्ष्,ज्ञ्,है,ः,ी,ू,े,ो,ख,घ,छ,झ,ठ,ढ,थ,ध,ं,फ,भ,ण,ळ,,ञ,ङ,श,ष,ॆ,त्र,श्र,मैं,आ,ई,ऊ,ऐ,औ,ख्,घ्,छ्,झ्,ठ्,ढ्,थ्,ध्,ँ,फ्,भ्,ण्,ळ्,क्र,ञ्,ङ्,श्,ष्,,त्र्,श्र्,हूँ)
  • Korean (ㄱ ㄲ ㄴ ㄷ ㄸ ㄹ ㅁ ㅂ ㅃ ㅅ ㅆ ㅇ ㅈ ㅉ ㅊ ㅋ ㅌ ㅍ ㅎ ㅏ ㅐ ㅑ ㅒ ㅓ ㅔ ㅕ ㅖ ㅗ ㅘ ㅙ ㅚ ㅛ ㅜ ㅝ ㅞ ㅟ ㅠ ㅡ ㅢ ㅣ ㄱ ㄲ ㄳ ㄴ ㄵ ㄶ ㄷ ㄹ ㄺ ㄻ ㄼ ㄽ ㄾ ㄿ ㅀ ㅁ ㅂ ㅄ ㅅ ㅆ ㅇ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ)

My proposal is to replace [0-9a-zA-Z...] regex with something like "should_remove_spaces(str)" function and use it in rmSpace(str) function.
By my knowledge spaces should only be removed in Chinese and Japanese languages.

I can add and use fasttext-langdetect dependency for that. If the language is not recognized as Chinese or Japanese, spaces should not be removed.
This python library (fasttext-langdetect) can also be useful in future for other multi-lingual tasks.

Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
…flow#2427)

### What problem does this PR solve?

[infiniflow#1987](infiniflow#1987)

When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf)
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816)
and old [a-zA-Z...] regex parses it correctly with spaces.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants