-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
page.get_text('blocks') output two piece of very similar text with different bbox #4026
Comments
Let me ask you this:
But anyway. You indeed have an example PDF page which contains lots of hidden text. Older versions of PyMuPDF (and MuPDF), including also the current one 1.24.13 extract all text - including hidden one. So all I can offer you is waiting for the next PyMuPDF version. |
Thanks for your reply. I know the cause of the problem.
|
There is no defined date yet. But it should be fairly soon. The MuPDF team is already about to publish a release candidate for MuPDF 1.25.0 (which PyMuPDF requires for its new version). |
Okay, I'll keep an eye out for the new version. Thanks again. |
Description of the bug
When I use
page.get_text('blocks')
, I get the very similar text with different bbox.The output of Page 5 (start from 1) as follows:
And the associated page as follows:
The raw pdf is
00b3ad2ad0af97ec4a85274510343e04.pdf
I think block 12 is the redundant one.
What's more, my python version is actually 3.8.19 but I select 3.9 because the available choice is start from 3.9
How to reproduce the bug
PyMuPDF version
1.24.6
Operating system
Linux
Python version
3.9
The text was updated successfully, but these errors were encountered: