Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hebrew text displayed backwards #97

Open
dotancohen opened this issue Jun 27, 2024 · 3 comments
Open

Hebrew text displayed backwards #97

dotancohen opened this issue Jun 27, 2024 · 3 comments
Labels
pdfminer Issue in pdfminer

Comments

@dotancohen
Copy link

This tool is terrific, thank you.

Highlighted and underlined Hebrew text are displayed backwards. Interestingly, the title blurb preceding the highlighted text is not backwards.

@dotancohen
Copy link
Author

Find attached a PDF file, created in LibreOffice Writer, with the following structure:

שלום, עולם.

# כותרת
זה קובץ לבדיקה, אני סתם כותב משהו כאן.

## עוד כותרת
אין שמש גשם יש רק צל.

pdfannots.pdf

‪I've then highlighted the text אני סתם כותב under the first heading and שמש גשם under the second heading. I used Okular (KDE PDF viewer) for the annotations:
pdfannots.pdf

Here is the output:

$ pdfannots pdfannots.pdf
## Highlights

 * Page #1 (כותרת): "בתוכ םתס ינא"

 * Page #1 (עוד כותרת): "םשג שמש"

Note that כותרת and עוד כותרת are displayed properly, but אני סתם כותב and שמש גשם are backwards.

$ pdfannots --version
pdfannots 0.4
$ uname -a
Linux nefora 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

@0xabu
Copy link
Owner

0xabu commented Jun 28, 2024

Thanks for the report!

In this case the headings are extracted correctly because they come as a string from the PDF metadata. The problem is that pdfminer's text extraction routines don't support right-to-left text: pdfminer/pdfminer.six#515

There are also some similar assumptions in pdfannots that affect things like the relative order that two annotations are reported when they appear on the same line of text. I could probably fix that but the bigger issue is the one linked above.

@0xabu 0xabu added the pdfminer Issue in pdfminer label Jun 28, 2024
@dotancohen
Copy link
Author

Thank you.

That bug report points to a fork, PdfMiner.RTL which has experimental RTL support:
https://pypi.org/project/pdfminer.rtl/

I tried it and in general the tool works well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pdfminer Issue in pdfminer
Projects
None yet
Development

No branches or pull requests

2 participants