page.get_text('blocks') output two piece of very similar text with different bbox #4026

qianyue76 · 2024-11-06T08:49:35Z

Description of the bug

When I use page.get_text('blocks') , I get the very similar text with different bbox.
The output of Page 5 (start from 1) as follows:

And the associated page as follows:

The raw pdf is
00b3ad2ad0af97ec4a85274510343e04.pdf
I think block 12 is the redundant one.

What's more, my python version is actually 3.8.19 but I select 3.9 because the available choice is start from 3.9

How to reproduce the bug

import fitz

with open("./00b3ad2ad0af97ec4a85274510343e04.pdf", "rb") as f:
    pdf_bytes = f.read()
document = fitz.open(stream=pdf_bytes, filetype="pdf")

for i in range(document.page_count):
    if i==4:
        page = document.load_page(i)
        blocks = page.get_text("blocks")
        for i, block in enumerate(blocks):
            print(f"block {i}:", block)
            print('\n')

PyMuPDF version

1.24.6

Operating system

Linux

Python version

3.9

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-11-06T10:14:22Z

Let me ask you this:

When you upgraded from your now unsupported Python version 3.8 ... you did not choose the current Python 3.13? Why?
Why did you not also upgrade PyMuPDF?

But anyway. You indeed have an example PDF page which contains lots of hidden text. Older versions of PyMuPDF (and MuPDF), including also the current one 1.24.13 extract all text - including hidden one.
The next version of MuPDF 1.25.0 allows de-selecting text that is clipped away as it happens in your page.
PyMuPDF 1.25.0 will automatically choose this option and only show 5 text blocks on this page. Older versions like yours and the current 1.24.13 instead show 15.

So all I can offer you is waiting for the next PyMuPDF version.

qianyue76 · 2024-11-06T10:36:02Z

Let me ask you this:

When you upgraded from your now unsupported Python version 3.8 ... you did not choose the current Python 3.13? Why?

Why did you not also upgrade PyMuPDF?

But anyway. You indeed have an example PDF page which contains lots of hidden text. Older versions of PyMuPDF (and MuPDF), including also the current one 1.24.13 extract all text - including hidden one. The next version of MuPDF 1.25.0 allows de-selecting text that is clipped away as it happens in your page. PyMuPDF 1.25.0 will automatically choose this option and only show 5 text blocks on this page. Older versions like yours and the current 1.24.13 instead show 15.

So all I can offer you is waiting for the next PyMuPDF version.

Thanks for your reply. I know the cause of the problem.
I will respond to your questions below

The current Python 3.13 is so new for me and I chose version 3.8 for this project, which was still supported at the time.
When I used pymupdf for the first time, I installed the latest version, and I just learned that there is a new version.
I will consider using a newer python version and waiting for the next PyMuPDF version.
By the way, when will the new PyMuPDF version be released?

JorjMcKie · 2024-11-06T10:41:42Z

By the way, when will the new PyMuPDF version be released?

There is no defined date yet. But it should be fairly soon. The MuPDF team is already about to publish a release candidate for MuPDF 1.25.0 (which PyMuPDF requires for its new version).

qianyue76 · 2024-11-06T11:04:31Z

By the way, when will the new PyMuPDF version be released?

There is no defined date yet. But it should be fairly soon. The MuPDF team is already about to publish a release candidate for MuPDF 1.25.0 (which PyMuPDF requires for its new version).

Okay, I'll keep an eye out for the new version. Thanks again.

JorjMcKie added enhancement fix developed release schedule to be determined labels Nov 6, 2024

julian-smith-artifex-com added the upstream bug bug outside this package label Nov 15, 2024

JorjMcKie added enhancement-upstream to be implemented by MuPDF and removed upstream bug bug outside this package labels Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page.get_text('blocks') output two piece of very similar text with different bbox #4026

page.get_text('blocks') output two piece of very similar text with different bbox #4026

qianyue76 commented Nov 6, 2024 •

edited by julian-smith-artifex-com

Loading

JorjMcKie commented Nov 6, 2024

qianyue76 commented Nov 6, 2024

JorjMcKie commented Nov 6, 2024

qianyue76 commented Nov 6, 2024

page.get_text('blocks') output two piece of very similar text with different bbox #4026

page.get_text('blocks') output two piece of very similar text with different bbox #4026

Comments

qianyue76 commented Nov 6, 2024 • edited by julian-smith-artifex-com Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Nov 6, 2024

qianyue76 commented Nov 6, 2024

JorjMcKie commented Nov 6, 2024

qianyue76 commented Nov 6, 2024

qianyue76 commented Nov 6, 2024 •

edited by julian-smith-artifex-com

Loading