relative=True in page.extract_text() not working #391

LiutongZhou · 2021-03-20T06:55:31Z

The Bug

setting relative box coordinates in crop and then do extract_text is not working. page.crop(box_coordinates, relative=True)

Code to reproduce the problem

import pdfplumber

!wget https://s22.q4cdn.com/407748750/files/doc_financials/2020/ar/2020-Proxy-Card.pdf -O Some.pdf --no-check-certificate

with pdfplumber.open("Some.pdf",) as pdf:
    page = pdf.pages[0]
    box_coordinates = (0, 0, 1.,1.)
    crop = page.crop(box_coordinates, relative=True)
    text = crop.extract_text()

assert text, "Not Working"

PDF file

https://s22.q4cdn.com/407748750/files/doc_financials/2020/ar/2020-Proxy-Card.pdf

Expected behavior

Return the text of the page

Actual behavior

Return nothing

Environment

pdfplumber version: 0.5.27
Python version: 3.8
OS: Linux

The text was updated successfully, but these errors were encountered:

samkit-jain · 2021-03-21T13:20:56Z

Hi @LiutongZhou The issue you are facing is not necessarily a bug. The reason you are getting None when extracting the text on the cropped page is because the cropped region has no text. The cropped region is just a single point as you can see from the image representation of the saved page below.

The bounding box (0, 0, 1, 1) is just a square of area 1.

LiutongZhou · 2021-03-22T20:52:04Z

The bounding box (0, 0, 1, 1) is just a square of area 1 @samkit-jain

Hi @samkit-jain , is the above statement still true even if I set relative=True in the crop method? If so, what differentiates relative=True from relative=False?

I was expecting that page.crop((a, b, c, d), relative=True) as equivalent to page.crop((a * width, b * height, c * width, d * height),)

Please help me understand it.

Thanks

jsvine · 2021-03-23T01:25:07Z

Hi @LiutongZhou, I think you may be misunderstanding the units of a bounding box. (0, 0, 1.,1.) sounds like you may be trying to get the full width and height (since in some domains, 1 is equivalent to 100%). If that's your goal, then you will want something like this: (0, 0, page.width, page.height)

LiutongZhou · 2021-03-23T01:34:03Z

Hi @jsvine. Thank you for your explanation. But I was hoping that crop( (0, 0, 1. , 1.), relative=True) would give me the same result as that returned by crop( ( 0, 0, page.width, page.height), relative=False) .

I assume this is the intention for having this optional parameter relative.

Is my understanding wrong?

jsvine · 2021-03-23T01:38:26Z

Ah, now I better understand your question. Thank you for clarifying. Here is an explanation of the relative parameter, from the documentation:

If relative=True, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.)

LiutongZhou · 2021-03-23T01:51:43Z

Okay, this is confusing :D

I would not be able to understand it if I hadn’t read the whole issue #245.

So relative=True is equivalent to relative=False when the whole page is being cropped. relative=True makes a difference only when it is called from a cropped page, in which case coordinates are relative to the origin (top left corner ) of the cropped region of the original page.

jsvine · 2021-03-23T01:54:51Z

So relative=True is equivalent to relative=False when the whole page is being cropped. relative=True makes a difference only when it is called from a cropped page, in which case coordinates are relative to the origin (top left corner ) of the cropped region of the original page.

Yep, exactly! That's a great summary.

situchen · 2021-04-16T11:00:37Z

hi， @jsvine ，When I extract the text in the specified area, there are images that will affect the accuracy of my extraction, and the spaces in the specified area will be removed. If I use Adobe Acrobat software to delete the picture and then extract the specified area, the extraction is normal. Please help, thank you very much

jsvine · 2021-04-20T13:17:29Z

Hi @situchen, your inquiry seems unrelated to this issue thread. Please instead open a discussion here and provide as much details (including the original PDF, the code you're using, etc.) as possible, so that we can best help you.

LiutongZhou added the bug label Mar 20, 2021

samkit-jain self-assigned this Mar 21, 2021

samkit-jain added troubleshooting Issues that seek assistance with parsing specific PDFs and removed bug labels Mar 21, 2021

samkit-jain closed this as completed Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relative=True in page.extract_text() not working #391

relative=True in page.extract_text() not working #391

LiutongZhou commented Mar 20, 2021

samkit-jain commented Mar 21, 2021

LiutongZhou commented Mar 22, 2021 •

edited

Loading

jsvine commented Mar 23, 2021

LiutongZhou commented Mar 23, 2021

jsvine commented Mar 23, 2021

LiutongZhou commented Mar 23, 2021

jsvine commented Mar 23, 2021

situchen commented Apr 16, 2021

jsvine commented Apr 20, 2021

relative=True in page.extract_text() not working #391

relative=True in page.extract_text() not working #391

Comments

LiutongZhou commented Mar 20, 2021

The Bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Environment

samkit-jain commented Mar 21, 2021

LiutongZhou commented Mar 22, 2021 • edited Loading

jsvine commented Mar 23, 2021

LiutongZhou commented Mar 23, 2021

jsvine commented Mar 23, 2021

LiutongZhou commented Mar 23, 2021

jsvine commented Mar 23, 2021

situchen commented Apr 16, 2021

jsvine commented Apr 20, 2021

LiutongZhou commented Mar 22, 2021 •

edited

Loading