Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response. #1132

Closed
nishantkumar21stjul opened this issue Apr 25, 2024 · 1 comment
Labels

Comments

@nishantkumar21stjul
Copy link

Describe the bug

Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response instead of giving text from PDF [Sample pdf-
5092.CID_Overview.pdf].

Output-
image

Sample Code-

import pdfplumber
def pdfProcessing_AgendaExtraction(pdfPath):
    bold_content = []
    agendaList = []
    with pdfplumber.open(pdfPath) as pdf:
        totalpages = len(pdf.pages)
        #print(totalpages)
        for page in range(0,totalpages-1):
            text = pdf.pages[page]
            #print(pdf.pages[0])
            #print(text)
            clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
            #print(clean_text.extract_text())
            bold_content.append(clean_text.extract_text())

    #print(bold_content)
    for item in bold_content:
        #print(item)
        #print("*"*20)
        if item.startswith("Agenda"):
            agendaList.append(item)
        else:
            if item.startswith("Annexure"):
                break
            #continue

    #print("-"*50)
    return agendaList

pdfPath = "C:\\Users\\nkumar34\\Desktop\\demo\\demo\\data\\5092.CID_Overview.pdf"
data = pdfProcessing_AgendaExtraction(pdfPath)
print(data)

After troubleshooting and bit of research, found below causes (A similar issue has been reported for a different library [LINK]) -

  • Encoding Identity-H, Roman causes issues. The same code has been validated with ASCII encoding and it works fine.

image
image

Have you tried repairing the PDF?

Yes, didn't worked.

Expected behavior

Output should extract all the bold text mentioned in the PDF document

Actual behavior

Output is an Empty List instead of a list of extracted bold texts.

Environment

  • pdfplumber version: 0.11.0

image

  • Python version: 3.11.6

image

  • OS: Windows 10 Enterprise

image

@jsvine
Copy link
Owner

jsvine commented May 16, 2024

As far as I can tell from the code shared above, the blank result here does not come from a problem with pdfplumber but rather the fact that the attached PDF does not contain the phrase "Agenda". If I've misunderstood, please update this issue with a simpler example that demonstrates the bug you're seeing.

@jsvine jsvine closed this as completed May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants