Skip to content

Commit

Permalink
fix parsing spaces in russian language PDFs (infiniflow#1987) (infini…
Browse files Browse the repository at this point in the history
…flow#2427)

### What problem does this PR solve?

[infiniflow#1987](infiniflow#1987)

When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf)
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816)
and old [a-zA-Z...] regex parses it correctly with spaces.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
  • Loading branch information
Hyperb0t authored Sep 14, 2024
1 parent bf81dec commit 78a30bd
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion deepdoc/parser/pdf_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,7 @@ def __ocr(self, pagenum, img, chars, ZM=3):
self.lefted_chars.append(c)
continue
if c["text"] == " " and bxs[ii]["text"]:
if re.match(r"[0-9a-zA-Z,.?;:!%%]", bxs[ii]["text"][-1]):
if re.match(r"[0-9a-zA-Zа-яА-Я,.?;:!%%]", bxs[ii]["text"][-1]):
bxs[ii]["text"] += " "
else:
bxs[ii]["text"] += c["text"]
Expand Down

0 comments on commit 78a30bd

Please sign in to comment.