[Feature Request]: Russian language support #1987

netandreus · 2024-08-17T09:18:37Z

Is there an existing issue for the same feature request?

I have checked the existing issues.

Describe the feature you'd like

We have a lot of scientific materials that are only in Russian (physics, psychology, etc.) and we would like to make a knowledge base on them and a chatbot. Please tell me do you plan to support the Russian language? Is there any way to add it myself?

Cricricrikets · 2024-08-19T08:07:03Z

Yes, I'm also interesting in this!

Said-Apollo · 2024-08-19T11:17:00Z

When creating a knowledge base, there is the option to activate "Layout Analysis". Since this uses a visual language model (in cases of images or if not enough text was contained in a chunk), maybe this might work for russian language (although its definitely improvable)

Maybe you could try to change the "threshold" when the Visual Model should be used to interpret the text.

netandreus · 2024-08-19T11:48:21Z

@Said-Apollo I tried to do it, but it removed almost all the spaces in the text.

Here is test document in Russian:
dogovor_oferta.pdf

And here are parsing results:

Said-Apollo · 2024-08-21T12:36:34Z

@Said-Apollo I tried to do it, but it removed almost all the spaces in the text.

Here is test document in Russian: dogovor_oferta.pdf

And here are parsing results:

When inputting the pdf to my knowledge base, it only gives me a single chunk with a few words. However, after converting the pdf to a docx file, it gave me around 18 chunks

Now looking closer at the result, they look somewhat correct to me (although Im not a russian expert). However, unfortunately the file is not shown next to it. I guess this is not supported for docx files yet.

Maybe you could try this workaround until russian is also supported? In case you have lots of pdfs and are on linux, I would therefore recommend simply this command in terminal:

lowriter --convert-to docx *.pdf

### What problem does this PR solve? [#1987](#1987) When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from [Russian documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf) needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex. There might be problems with other languages that use different alphabets. I additionally tested [PDF in Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816) and old [a-zA-Z...] regex parses it correctly with spaces. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)

Hyperb0t · 2024-09-15T19:42:19Z

I made a quick and dirty spaces problem fix for only Russian language. Deepdoc component and especially its pdf_parser class now do not remove spaces in RU text.
Fix was merged in v 0.11 and is in the demo already.

You can observe it if you turn off layout recognition and parse the example "dogovor_oferta.pdf" or any other Russian PDF document.

Unfortunately, it still removes spaces if leave layout recognition turned on.
I think it happens while returning stored text chunks from the backed via REST API and not while parsing.
I am going to also resolve this problem. Probably by changing rmSpace function.

Hyperb0t · 2024-09-15T19:48:32Z

There are a lot of places in the project (link1, link2), where string is processed in different ways depending on if it matches [0-9a-zA-Z...] regex for english language or not.
One of these differences in string processing is space symbol removal.
If the string is considered english by regex, spaces are not removed, otherwise removed.

If we want multiple languages support, this logic should be changed. Because matching the [0-9a-zA-Z...] regex is not the only case, where spaces should be kept and not removed.
There are other non-latin languages or groups of languages with other alphabets and writing systems, where spaces are needed:

Greek (Α α, Β β, Γ γ, Δ δ, Ε ε, Ζ ζ, Η η, Θ θ, Ι ι, Κ κ, Λ λ, Μ μ, Ν ν, Ξ ξ, Ο ο, Π π, Ρ ρ, Σ σ/ς, Τ τ, Υ υ, Φ φ, Χ χ, Ψ ψ, Ω ω.)
Cyrillic (Russian, Ukrainian, Serbian etc.) (А ,А̀ ,А̂ ,А̄ ,Ӓ ,Б ,В ,Г, Ґ ,Д ,Ђ ,Ѓ ,Е ,Ѐ ,Е̄ ,Е̂Ё ,Є ,Ж ,З ,З́ ,Ѕ ,И ,І, Ї ,Ꙇ ,Ѝ ,И̂ ,Ӣ ,Й ,Ј ,К, Л ,Љ ,М ,Н ,Њ ,О ,О̀ ,О̂, Ō ,Ӧ ,П ,Р ,С ,С́ ,Т ,Ћ, Ќ ,У ,У̀ ,У̂ ,Ӯ ,Ў ,Ӱ ,Ф, Х ,Ц ,Ч ,Џ ,Ш ,Щ ,Ꙏ ,Ъ, Ъ̀ ,Ы ,Ь ,Ѣ ,Э ,Ю ,Ю̀ ,Я, Я̀ )
Hebrew (א,ב,ג,ד,ה,ו,ז,ח,ט,י,כ,ל,מ,נ,ס,ע,פ,צ,ק,ר,ש,ת)
Arabic (ا,ب,ت,ث,ج,ح,خ,د,ذ,ر,ز,س,ش,ص,ض,ط,ظ,ع,غ,ف,ق,ك,ل,م,ن,ه,و,ي,ﺀ)
Devanagari (Indian) (ा,ि,ु,े,ो,क,ग,च,ज,ट,ड,त,द,न,प,ब,म,य,र,ल,व,स,ह,ृ,क्ष,ज्ञ,में,अ,इ,उ,ए,ओ,क्,ग्,च्,ज्,ट्,ड्,त्,द्,न्,प्,ब्,म्,य्,र्,ल्,व्,स्,ह्,़,क्ष्,ज्ञ्,है,ः,ी,ू,े,ो,ख,घ,छ,झ,ठ,ढ,थ,ध,ं,फ,भ,ण,ळ,,ञ,ङ,श,ष,ॆ,त्र,श्र,मैं,आ,ई,ऊ,ऐ,औ,ख्,घ्,छ्,झ्,ठ्,ढ्,थ्,ध्,ँ,फ्,भ्,ण्,ळ्,क्र,ञ्,ङ्,श्,ष्,,त्र्,श्र्,हूँ)
Korean (ㄱ ㄲ ㄴ ㄷ ㄸ ㄹ ㅁ ㅂ ㅃ ㅅ ㅆ ㅇ ㅈ ㅉ ㅊ ㅋ ㅌ ㅍ ㅎ ㅏ ㅐ ㅑ ㅒ ㅓ ㅔ ㅕ ㅖ ㅗ ㅘ ㅙ ㅚ ㅛ ㅜ ㅝ ㅞ ㅟ ㅠ ㅡ ㅢ ㅣ ㄱ ㄲ ㄳ ㄴ ㄵ ㄶ ㄷ ㄹ ㄺ ㄻ ㄼ ㄽ ㄾ ㄿ ㅀ ㅁ ㅂ ㅄ ㅅ ㅆ ㅇ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ)

My proposal is to replace [0-9a-zA-Z...] regex with something like "should_remove_spaces(str)" function and use it in rmSpace(str) function.
By my knowledge spaces should only be removed in Chinese and Japanese languages.

I can add and use fasttext-langdetect dependency for that. If the language is not recognized as Chinese or Japanese, spaces should not be removed.
This python library (fasttext-langdetect) can also be useful in future for other multi-lingual tasks.

…flow#2427) ### What problem does this PR solve? [infiniflow#1987](infiniflow#1987) When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from [Russian documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf) needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex. There might be problems with other languages that use different alphabets. I additionally tested [PDF in Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816) and old [a-zA-Z...] regex parses it correctly with spaces. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)

KevinHuSh added the Feature label Aug 19, 2024

Hyperb0t added a commit to Hyperb0t/ragflow that referenced this issue Sep 14, 2024

fix parsing spaces in russian language PDFs (infiniflow#1987)

8f3df0c

Hyperb0t mentioned this issue Sep 14, 2024

fix parsing spaces in russian language PDFs (#1987) #2427

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Russian language support #1987

[Feature Request]: Russian language support #1987

netandreus commented Aug 17, 2024 •

edited

Loading

Cricricrikets commented Aug 19, 2024

Said-Apollo commented Aug 19, 2024

netandreus commented Aug 19, 2024

Said-Apollo commented Aug 21, 2024 •

edited

Loading

Hyperb0t commented Sep 15, 2024

Hyperb0t commented Sep 15, 2024

[Feature Request]: Russian language support #1987

[Feature Request]: Russian language support #1987

Comments

netandreus commented Aug 17, 2024 • edited Loading

Is there an existing issue for the same feature request?

Describe the feature you'd like

Cricricrikets commented Aug 19, 2024

Said-Apollo commented Aug 19, 2024

netandreus commented Aug 19, 2024

Said-Apollo commented Aug 21, 2024 • edited Loading

Hyperb0t commented Sep 15, 2024

Hyperb0t commented Sep 15, 2024

netandreus commented Aug 17, 2024 •

edited

Loading

Said-Apollo commented Aug 21, 2024 •

edited

Loading