Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relative=True in page.extract_text() not working #391

Closed
LiutongZhou opened this issue Mar 20, 2021 · 9 comments
Closed

relative=True in page.extract_text() not working #391

LiutongZhou opened this issue Mar 20, 2021 · 9 comments
Assignees
Labels
troubleshooting Issues that seek assistance with parsing specific PDFs

Comments

@LiutongZhou
Copy link

The Bug

setting relative box coordinates in crop and then do extract_text is not working. page.crop(box_coordinates, relative=True)

Code to reproduce the problem

import pdfplumber

!wget https://s22.q4cdn.com/407748750/files/doc_financials/2020/ar/2020-Proxy-Card.pdf -O Some.pdf --no-check-certificate

with pdfplumber.open("Some.pdf",) as pdf:
    page = pdf.pages[0]
    box_coordinates = (0, 0, 1.,1.)
    crop = page.crop(box_coordinates, relative=True)
    text = crop.extract_text()

assert text, "Not Working"

PDF file

https://s22.q4cdn.com/407748750/files/doc_financials/2020/ar/2020-Proxy-Card.pdf

Expected behavior

Return the text of the page

Actual behavior

Return nothing

Environment

  • pdfplumber version: 0.5.27
  • Python version: 3.8
  • OS: Linux
@samkit-jain samkit-jain self-assigned this Mar 21, 2021
@samkit-jain samkit-jain added troubleshooting Issues that seek assistance with parsing specific PDFs and removed bug labels Mar 21, 2021
@samkit-jain
Copy link
Collaborator

Hi @LiutongZhou The issue you are facing is not necessarily a bug. The reason you are getting None when extracting the text on the cropped page is because the cropped region has no text. The cropped region is just a single point as you can see from the image representation of the saved page below.
image

The bounding box (0, 0, 1, 1) is just a square of area 1.

@LiutongZhou
Copy link
Author

LiutongZhou commented Mar 22, 2021

The bounding box (0, 0, 1, 1) is just a square of area 1 @samkit-jain

Hi @samkit-jain , is the above statement still true even if I set relative=True in the crop method? If so, what differentiates relative=True from relative=False?

I was expecting that page.crop((a, b, c, d), relative=True) as equivalent to page.crop((a * width, b * height, c * width, d * height),)

Please help me understand it.

Thanks

@jsvine
Copy link
Owner

jsvine commented Mar 23, 2021

Hi @LiutongZhou, I think you may be misunderstanding the units of a bounding box. (0, 0, 1.,1.) sounds like you may be trying to get the full width and height (since in some domains, 1 is equivalent to 100%). If that's your goal, then you will want something like this: (0, 0, page.width, page.height)

@LiutongZhou
Copy link
Author

Hi @jsvine. Thank you for your explanation. But I was hoping that crop( (0, 0, 1. , 1.), relative=True) would give me the same result as that returned by crop( ( 0, 0, page.width, page.height), relative=False) .

I assume this is the intention for having this optional parameter relative.

Is my understanding wrong?

@jsvine
Copy link
Owner

jsvine commented Mar 23, 2021

Ah, now I better understand your question. Thank you for clarifying. Here is an explanation of the relative parameter, from the documentation:

If relative=True, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.)

@LiutongZhou
Copy link
Author

Okay, this is confusing :D

I would not be able to understand it if I hadn’t read the whole issue #245.

So relative=True is equivalent to relative=False when the whole page is being cropped. relative=True makes a difference only when it is called from a cropped page, in which case coordinates are relative to the origin (top left corner ) of the cropped region of the original page.

@jsvine
Copy link
Owner

jsvine commented Mar 23, 2021

So relative=True is equivalent to relative=False when the whole page is being cropped. relative=True makes a difference only when it is called from a cropped page, in which case coordinates are relative to the origin (top left corner ) of the cropped region of the original page.

Yep, exactly! That's a great summary.

@situchen
Copy link

image

hi, @jsvine ,When I extract the text in the specified area, there are images that will affect the accuracy of my extraction, and the spaces in the specified area will be removed. If I use Adobe Acrobat software to delete the picture and then extract the specified area, the extraction is normal. Please help, thank you very much

@jsvine
Copy link
Owner

jsvine commented Apr 20, 2021

Hi @situchen, your inquiry seems unrelated to this issue thread. Please instead open a discussion here and provide as much details (including the original PDF, the code you're using, etc.) as possible, so that we can best help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
troubleshooting Issues that seek assistance with parsing specific PDFs
Projects
None yet
Development

No branches or pull requests

4 participants