Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response. #1132

nishantkumar21stjul · 2024-04-25T15:58:59Z

Describe the bug

Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response instead of giving text from PDF [Sample pdf-
5092.CID_Overview.pdf].

Output-

Sample Code-

import pdfplumber
def pdfProcessing_AgendaExtraction(pdfPath):
    bold_content = []
    agendaList = []
    with pdfplumber.open(pdfPath) as pdf:
        totalpages = len(pdf.pages)
        #print(totalpages)
        for page in range(0,totalpages-1):
            text = pdf.pages[page]
            #print(pdf.pages[0])
            #print(text)
            clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
            #print(clean_text.extract_text())
            bold_content.append(clean_text.extract_text())

    #print(bold_content)
    for item in bold_content:
        #print(item)
        #print("*"*20)
        if item.startswith("Agenda"):
            agendaList.append(item)
        else:
            if item.startswith("Annexure"):
                break
            #continue

    #print("-"*50)
    return agendaList

pdfPath = "C:\\Users\\nkumar34\\Desktop\\demo\\demo\\data\\5092.CID_Overview.pdf"
data = pdfProcessing_AgendaExtraction(pdfPath)
print(data)

After troubleshooting and bit of research, found below causes (A similar issue has been reported for a different library [LINK]) -

Encoding Identity-H, Roman causes issues. The same code has been validated with ASCII encoding and it works fine.

Have you tried repairing the PDF?

Yes, didn't worked.

Expected behavior

Output should extract all the bold text mentioned in the PDF document

Actual behavior

Output is an Empty List instead of a list of extracted bold texts.

Environment

pdfplumber version: 0.11.0

Python version: 3.11.6

OS: Windows 10 Enterprise

The text was updated successfully, but these errors were encountered:

jsvine · 2024-05-16T16:14:40Z

As far as I can tell from the code shared above, the blank result here does not come from a problem with pdfplumber but rather the fact that the attached PDF does not contain the phrase "Agenda". If I've misunderstood, please update this issue with a simpler example that demonstrates the bug you're seeing.

nishantkumar21stjul added the bug label Apr 25, 2024

jsvine closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response. #1132

Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response. #1132

nishantkumar21stjul commented Apr 25, 2024

jsvine commented May 16, 2024

Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response. #1132

Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response. #1132

Comments

nishantkumar21stjul commented Apr 25, 2024

Describe the bug

Have you tried repairing the PDF?

Expected behavior

Actual behavior

Environment

jsvine commented May 16, 2024