Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page.get_text('blocks') output two piece of very similar text with different bbox #4026

Open
qianyue76 opened this issue Nov 6, 2024 · 4 comments
Labels
enhancement enhancement-upstream to be implemented by MuPDF fix developed release schedule to be determined

Comments

@qianyue76
Copy link

qianyue76 commented Nov 6, 2024

Description of the bug

When I use page.get_text('blocks') , I get the very similar text with different bbox.
The output of Page 5 (start from 1) as follows:
image
And the associated page as follows:
image

The raw pdf is
00b3ad2ad0af97ec4a85274510343e04.pdf
I think block 12 is the redundant one.

What's more, my python version is actually 3.8.19 but I select 3.9 because the available choice is start from 3.9

How to reproduce the bug

import fitz

with open("./00b3ad2ad0af97ec4a85274510343e04.pdf", "rb") as f:
    pdf_bytes = f.read()
document = fitz.open(stream=pdf_bytes, filetype="pdf")

for i in range(document.page_count):
    if i==4:
        page = document.load_page(i)
        blocks = page.get_text("blocks")
        for i, block in enumerate(blocks):
            print(f"block {i}:", block)
            print('\n')

PyMuPDF version

1.24.6

Operating system

Linux

Python version

3.9

@JorjMcKie
Copy link
Collaborator

Let me ask you this:

  • When you upgraded from your now unsupported Python version 3.8 ... you did not choose the current Python 3.13? Why?
  • Why did you not also upgrade PyMuPDF?

But anyway. You indeed have an example PDF page which contains lots of hidden text. Older versions of PyMuPDF (and MuPDF), including also the current one 1.24.13 extract all text - including hidden one.
The next version of MuPDF 1.25.0 allows de-selecting text that is clipped away as it happens in your page.
PyMuPDF 1.25.0 will automatically choose this option and only show 5 text blocks on this page. Older versions like yours and the current 1.24.13 instead show 15.

So all I can offer you is waiting for the next PyMuPDF version.

@JorjMcKie JorjMcKie added enhancement fix developed release schedule to be determined labels Nov 6, 2024
@qianyue76
Copy link
Author

Let me ask you this:

  • When you upgraded from your now unsupported Python version 3.8 ... you did not choose the current Python 3.13? Why?
  • Why did you not also upgrade PyMuPDF?

But anyway. You indeed have an example PDF page which contains lots of hidden text. Older versions of PyMuPDF (and MuPDF), including also the current one 1.24.13 extract all text - including hidden one. The next version of MuPDF 1.25.0 allows de-selecting text that is clipped away as it happens in your page. PyMuPDF 1.25.0 will automatically choose this option and only show 5 text blocks on this page. Older versions like yours and the current 1.24.13 instead show 15.

So all I can offer you is waiting for the next PyMuPDF version.

Thanks for your reply. I know the cause of the problem.
I will respond to your questions below

  • The current Python 3.13 is so new for me and I chose version 3.8 for this project, which was still supported at the time.
  • When I used pymupdf for the first time, I installed the latest version, and I just learned that there is a new version.
    I will consider using a newer python version and waiting for the next PyMuPDF version.
    By the way, when will the new PyMuPDF version be released?

@JorjMcKie
Copy link
Collaborator

By the way, when will the new PyMuPDF version be released?

There is no defined date yet. But it should be fairly soon. The MuPDF team is already about to publish a release candidate for MuPDF 1.25.0 (which PyMuPDF requires for its new version).

@qianyue76
Copy link
Author

By the way, when will the new PyMuPDF version be released?

There is no defined date yet. But it should be fairly soon. The MuPDF team is already about to publish a release candidate for MuPDF 1.25.0 (which PyMuPDF requires for its new version).

Okay, I'll keep an eye out for the new version. Thanks again.

@julian-smith-artifex-com julian-smith-artifex-com added the upstream bug bug outside this package label Nov 15, 2024
@JorjMcKie JorjMcKie added enhancement-upstream to be implemented by MuPDF and removed upstream bug bug outside this package labels Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhancement-upstream to be implemented by MuPDF fix developed release schedule to be determined
Projects
None yet
Development

No branches or pull requests

3 participants