You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response instead of giving text from PDF [Sample pdf- 5092.CID_Overview.pdf].
Output-
Sample Code-
import pdfplumber
def pdfProcessing_AgendaExtraction(pdfPath):
bold_content = []
agendaList = []
with pdfplumber.open(pdfPath) as pdf:
totalpages = len(pdf.pages)
#print(totalpages)
for page in range(0,totalpages-1):
text = pdf.pages[page]
#print(pdf.pages[0])
#print(text)
clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
#print(clean_text.extract_text())
bold_content.append(clean_text.extract_text())
#print(bold_content)
for item in bold_content:
#print(item)
#print("*"*20)
if item.startswith("Agenda"):
agendaList.append(item)
else:
if item.startswith("Annexure"):
break
#continue
#print("-"*50)
return agendaList
pdfPath = "C:\\Users\\nkumar34\\Desktop\\demo\\demo\\data\\5092.CID_Overview.pdf"
data = pdfProcessing_AgendaExtraction(pdfPath)
print(data)
After troubleshooting and bit of research, found below causes (A similar issue has been reported for a different library [LINK]) -
Encoding Identity-H, Roman causes issues. The same code has been validated with ASCII encoding and it works fine.
As far as I can tell from the code shared above, the blank result here does not come from a problem with pdfplumber but rather the fact that the attached PDF does not contain the phrase "Agenda". If I've misunderstood, please update this issue with a simpler example that demonstrates the bug you're seeing.
Describe the bug
Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response instead of giving text from PDF [Sample pdf-
5092.CID_Overview.pdf].
Output-
Sample Code-
After troubleshooting and bit of research, found below causes (A similar issue has been reported for a different library [LINK]) -
Have you tried repairing the PDF?
Yes, didn't worked.
Expected behavior
Output should extract all the bold text mentioned in the PDF document
Actual behavior
Output is an Empty List instead of a list of extracted bold texts.
Environment
The text was updated successfully, but these errors were encountered: