Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Extraction Yields cid and Fails on Mixed Content Pages in PDF #1036

Closed
hrhktkbzyy opened this issue Aug 22, 2024 · 1 comment
Closed

Comments

@hrhktkbzyy
Copy link

Issue:

When attempting to extract text from the attached PDF, several pages return cid values instead of readable text. Additionally, pages containing mixed content (text and images) do not return any text at all.

Affected PDF:

The Phantom of the Opera.pdf

Code Sample:

from pdfminer.high_level import extract_text

def get_text_from_pdf_by_pdfminer(file_path):
    try:
        text = extract_text(file_path.absolute())
        number_of_pages = text.count('\f')
        return text, number_of_pages

    except Exception as e:
        print(e)

Output:

The extracted content includes cid values such as:

(cid:11550)(cid:450)(cid:5509)
(cid:12720)(cid:450)(cid:1275)
(cid:20)(cid:450)(cid:55)(cid:75)(cid:72)(cid:3)(cid:71)(cid:68)(cid:81)(cid:70)(cid:72)(cid:85)(cid:86)
(cid:20)(cid:714)(cid:14414)(cid:17528)(cid:9540)(cid:2696)(cid:1308)
(cid:21)(cid:450)(cid:55)(cid:75)(cid:72)(cid:3)(cid:71)(cid:76)(cid:85)(cid:72)(cid:70)(cid:87)(cid:82)(cid:85)(cid:86)(cid:3)(cid:82)(cid:73)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:50)(cid:83)(cid:72)(cid:85)(cid:68)(cid:3)(cid:43)(cid:82)(cid:88)(cid:86)(cid:72)
...
@hrhktkbzyy hrhktkbzyy changed the title extract_text got Text Extraction Yields cid and Fails on Mixed Content Pages in PDF Aug 22, 2024
@dhdaines
Copy link
Contributor

This PDF has completely arbitrary and corrupt ToUnicode character mappings, it's unlikely that pdfminer can do much about it. You can see the problem by trying to copy and paste text out of it from your browser's PDF viewer (in my case Chrome). Even the English text is corrupted, for example, "The dancers" on page 3 comes out as:

7KHGDQFHUV

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants