Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Showing squares with hex values instead of text in some PDFs #15289

Closed
jeremyn opened this issue Aug 8, 2022 · 25 comments · Fixed by #15290 or #15900
Closed

Showing squares with hex values instead of text in some PDFs #15289

jeremyn opened this issue Aug 8, 2022 · 25 comments · Fixed by #15290 or #15900

Comments

@jeremyn
Copy link

jeremyn commented Aug 8, 2022

Attach (recommended) or Link to PDF file here:

  • Any O'Reilly book preview PDF at Humble Bundle seems to have the problem. You can find a current bundle here with many direct links. Please note the bundles only last for a limited time, and direct links to preview PDFs have a "ttl" query string parameter so I suppose they expire. I'm unwilling to upload an example here since I don't own the content. It would be ideal if a pdf.js dev could download test cases themselves while the bundle is still active.

Configuration:

  • Web browser and its version: Firefox 103.0.1 (64-bit).
  • Operating system and its version: Windows 10 Pro 10.0.19044.
  • PDF.js version: 2.15.129 according to the developer tools console.
  • Is a browser extension: No.

Steps to reproduce the problem:

  1. Open one of the O'Reilly book preview PDFs on Humble Bundle and see the problem. It also happens if I save the preview PDF locally and then open it. Others have reported the problem on Reddit here. It only started recently, in the past month or so. If I copy and paste text from the broken preview PDF into Notepad, the text looks fine.
  2. The problem doesn't happen if I:
  • open a complete O'Reilly non-preview PDF that is local on my system
  • open an O'Reilly preview PDF in Firefox on Linux
  • open an O'Reilly preview PDF in Chrome on Windows
  • open a Humble Bundle preview PDF from another publisher (Packt)

What is the expected behavior? (add screenshot)

  • PDF is readable.

What went wrong? (add screenshot)

  • PDF is unreadable because all the text is replaced by squares with hex values. On some but not all broken PDFs I see errors like this in the console:

Warning: Failed to load font 'g_d0_f3': 'SyntaxError: An invalid or illegal string was specified'. pdf.js:446:13
downloadable font: CFF : Failed to parse Global Subrs INDEX (font-family: "g_d0_f3" style:normal weight:400 stretch:100 src index:0) source: (invalid URI)
downloadable font: CFF : Failed to parse table (font-family: "g_d0_f3" style:normal weight:400 stretch:100 src index:0) source: (invalid URI)
downloadable font: rejected by sanitizer (font-family: "g_d0_f3" style:normal weight:400 stretch:100 src index:0) source: (invalid URI)
downloadable font: font load failed (font-family: "g_d0_f3" style:normal weight:400 stretch:100 src index:0) source: (invalid URI)

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

  • N/A.
@jeremyn
Copy link
Author

jeremyn commented Aug 8, 2022

Also it seems that some but not all of the preview PDFs in the current Essential Classic Fantasy RPG Collection bundle, not from O'Reilly, have the problem. An interesting example is the preview PDF for Let's Get Kraken where only the header text which is supposed to be "PART ONE: ADVENTURE OVERVIEW" is squares-with-hex-values, with the rest of the text readable.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Aug 8, 2022

Attaching a preview here, which is hopefully OK, since this isn't really easily actionable otherwise: issue15289.pdf

PDF f8951e869f298f8d2652f80b3c02c490 [1.7 GPL Ghostscript 9.56.1 / AH CSS Formatter V6.0 MR2 for Linux64 : 6.0.2.5372 (2012/05/16 18:26JST)] (PDF.js: 2.15.322) [viewer.js:1531:13](resource://pdf.js/web/viewer.js)
Warning: Failed to load font 'g_d0_f2': 'SyntaxError: An invalid or illegal string was specified'. [pdf.js:456:13](resource://pdf.js/build/pdf.js)
downloadable font: CFF : Failed to parse Global Subrs INDEX (font-family: "g_d0_f2" style:normal weight:400 stretch:100 src index:0) source: (invalid URI)
downloadable font: CFF : Failed to parse table (font-family: "g_d0_f2" style:normal weight:400 stretch:100 src index:0) source: (invalid URI)
downloadable font: rejected by sanitizer (font-family: "g_d0_f2" style:normal weight:400 stretch:100 src index:0) source: (invalid URI)
downloadable font: font load failed (font-family: "g_d0_f2" style:normal weight:400 stretch:100 src index:0) source: (invalid URI)
Warning: Out of bounds subrIndex for callgsubr 30 [pdf.worker.js:1123:13](resource://pdf.js/build/pdf.worker.js)

All of the affected fonts are Type1/Type1C, i.e. CFF fonts.
Unfortunately there appears to be multiple different issues with the font-data in that PDF document:

  • The regular font, i.e. "JZKJAS+MinionPro-Regular", which renders with hex-outlines. There's no errors or warning messages for this one, however it renders correctly with disableFontFace=true set.
  • The italic font, i.e. "JCVSVF+MinionPro-It", which renders only partially. This font is rejected by OTS, see above, and it's not helped by the disableFontFace option.
  • The bold font, i.e. "YZOBJZ+MyriadPro-SemiboldCond", which doesn't render at all. There's a warning message for this one, however it renders correctly with disableFontFace=true set.

@jeremyn
Copy link
Author

jeremyn commented Aug 8, 2022

@Snuffleupagus I can confirm that your uploaded example reproduces the issue for me.

Please note that I said some but not all of the examples have errors in the console. Both the preview PDFs for Robust Python in the O'Reilly bundle I first linked, and Let's Get Kraken from the RPG bundle I linked in my first comment, have hex squares but no console errors. Also I looked more at the current Packt bundle here and contrary to my initial description where I said the Packt books were okay, the preview for the book Machine Learning with PyTorch and Scikit-Learn does have hex squares (and no console errors).

@calixteman
Copy link
Contributor

calixteman commented Aug 8, 2022

About the sanitizer issue, it's very likely because of:

return [0, 0, 0];

The specs say:

An empty INDEX is represented by a count field with a 0 value
and no additional fields. Thus, the total size of an empty INDEX
is 2 bytes.

and a bug has been fixed in the sanitizer 13y ago:
khaledhosny/ots@9c33fbf
when the patch for the CFFParser is 11y old:
ce53b1b#diff-20c7b0dcebfcf76bb7b0da0ea5fd10dd0a9b146507a7751dc16d16944dbf7cbc

The fix for OTS is likely to replace [0, 0, 0] by [0, 0].

That said I have no idea for the other issues.

@calixteman
Copy link
Contributor

On mac OS, the rendering of page 1 is ok except for the italic font.
And the italic font issue is fixed thanks to the patch shown in my previous comment.
As @Snuffleupagus said, everything is ok with disableFontFace=true, so it's very likely an issue with the font rendering engine.

@jeremyn
Copy link
Author

jeremyn commented Aug 8, 2022

I can confirm the hex squares become text with pdfjs.disableFontFace set to true in about:config in Firefox, so that's good.

I'm not sure what disableFontFace does but if in your experience there is some common change PDF creators can make to avoid this problem with the default false setting, I can open a support issue with Humble Bundle about it. It would need to be specific advice though, not "Firefox can't display your PDFs, here's a bug report" but "the developers think you should embed fonts/change some list of numbers/etc". Even then I don't know if it will help but I'd like to try multiple approaches, especially if this will take a while on the pdf.js side (talking about 11 year old fixes).

@calixteman calixteman linked a pull request Aug 8, 2022 that will close this issue
@calixteman
Copy link
Contributor

I filed bug for Firefox:
https://bugzilla.mozilla.org/show_bug.cgi?id=1783740

@jeremyn
Copy link
Author

jeremyn commented Aug 8, 2022

Thanks for the quick turnaround! For what it's worth however, though I can reproduce the hex square ("tofu", I guess) with the plop.html and plop.ttf attached to the Firefox bug, changing pdfjs.disableFontFace to true does not fix the problem with those files. If that's expected then all right, I just wanted to report that here.

rousek pushed a commit to signosoft/pdf.js that referenced this issue Aug 10, 2022
@jeremyn
Copy link
Author

jeremyn commented Aug 19, 2022

@calixteman I'm not really sure of the status of this issue. You submitted a PR (thanks!) to this repository about two weeks ago and closed this issue. The related Firefox bug is still open. That bug is marked as "Version: Firefox 105". The current stable version of Firefox is v103. Does that mean this pdf.js bug should definitely be fixed in Firefox v105?

@marco-c
Copy link
Contributor

marco-c commented Aug 22, 2022

The fix for a subset of this issue (#15290) landed in Firefox Nightly in https://bugzilla.mozilla.org/show_bug.cgi?id=1784537.
The rest of the issues you mentioned are covered by https://bugzilla.mozilla.org/show_bug.cgi?id=1783740, until https://bugzilla.mozilla.org/show_bug.cgi?id=1783740 is fixed, you'll still see them.

@jeremyn
Copy link
Author

jeremyn commented Aug 22, 2022

@marco-c I don't know any of the details of how pdf.js works or how it plugs into Firefox. As an end user the issue I reported is that when I open a subset of PDFs from a specific publisher (Humble Bundle), some of the text in those PDFs is unreadable.

Looking at all these issues and bugs what I see as an end user is:

  1. this GitHub issue we're on is closed but the PDFs are still broken in Firefox

  2. there is a discrepancy that I mentioned above between the test case in the Firefox issue and the behavior I'm seeing in this GitHub issue, so I'm unclear if the Firefox bug really matches this GitHub issue

  3. the Firefox issue appears to have stalled out almost two weeks ago with some open questions which might be answered by pointing the asker back to this GitHub issue

  4. there's no clear sign when this should be fixed in Firefox, so even if I ignore all this other stuff as strictly internal, I don't have any specific point when I could say "the bug tracker said this was fixed but it is not"

To be clear I'm not trying to rush anybody. If this were at a point of "yes, we have all the info we need, we'll get to it when we get to it" then that's fine. At the moment though there are still some open questions and uncertainty on my side whether the correct problem is being tracked, so I'd like to resolve those before leaving things alone. In fact it feels a little like the various devs here have been sidetracked on different problems but the core problem of "I can't read these PDFs" has gotten lost.

@marco-c
Copy link
Contributor

marco-c commented Aug 22, 2022

@marco-c I don't know any of the details of how pdf.js works or how it plugs into Firefox. As an end user the issue I reported is that when I open a subset of PDFs from a specific publisher (Humble Bundle), some of the text in those PDFs is unreadable.

@jeremyn there were actually different root issues affecting the PDFs you shared with us, one class of issues has been fixed as part of #15290 (which closed this issue). The rest of the issues are unrelated to pdf.js itself but are due to Firefox internal graphics engine, and these issues are tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=1783740.

there's no clear sign when this should be fixed in Firefox, so even if I ignore all this other stuff as strictly internal, I don't have any specific point when I could say "the bug tracker said this was fixed but it is not"

Until https://bugzilla.mozilla.org/show_bug.cgi?id=1783740 is fixed, you will still be able to reproduce some (if not all) of the issues you mentioned initially.

the Firefox issue appears to have stalled out almost two weeks ago with some open questions which might be answered by pointing the asker back to this GitHub issue

Thanks, I'll point Jonathan to this issue. @calixteman is away, or he would have answered him.

@jeremyn
Copy link
Author

jeremyn commented Aug 22, 2022

@marco-c Thanks.

Do you have thoughts on the workaround of setting pdfjs.disableFontFace to true in Firefox? I've read several issues and discussions about this setting and still don't understand it. I think one setting has the OS render fonts, and the other setting keeps font rendering in PDF.js/Firefox but I'm not sure which is which. Also some people say they get different breakages depending on the setting.

Is PDF.js/Firefox treating this as a specific problem for a few fonts or as a systemic problem? As my earlier comments say this is widespread across Humble Bundle PDFs from a variety of publishers. It would be unfortunate for me for this issue to take a long time to resolve only to find out it was some hyper-specific fix for the one sample PDF uploaded here.

Also about Humble Bundle, in an earlier comment #15289 (comment) I asked here if there is some useful request I can make to their support group. If they are generating PDFs in some bad way then I can just ask them to stop. Do you have any advice about that?

@marco-c
Copy link
Contributor

marco-c commented Aug 23, 2022

@jeremyn it seems to be related to these PDFs and not a widespread problem, it could be useful to ask them questions to answer all of @jfkthame's questions from https://bugzilla.mozilla.org/show_bug.cgi?id=1783740 (and maybe he has more after reading this thread).

@jeremyn
Copy link
Author

jeremyn commented Aug 23, 2022

@marco-c I created an issue with Humble Bundle support and directed them to the Bugzilla issue. I can't say what the escalation process is between their support people and whoever deals with this sort of problem on their side. I want to get out of the middle here so as far as that all goes, this is not my issue anymore.

Do you have any info about pdfjs.disableFontFace? I'll let it go after this, but since it makes the problem go away I'm curious if it's something that I can just set to true and forget about, or what.

@marco-c
Copy link
Contributor

marco-c commented Aug 25, 2022

@jeremyn I'm not familiar with that option, but if it isn't the default it must mean that it has downsides that exceed the improvements, so I would keep it to false.

@humble-cburnham
Copy link

Chris from Humble Bundle here.

It looks like our process for making pdf preview is to take the full PDF and run it through GhostScript to truncate the PDF.
We use -sDEVICE=pdfwrite, and only include a few pages starting from the first chapter (skipping the title page and table of contents).
We also potentially reduce the resolution in some cases.

The full book PDFs do render just fine in Firefox, It's clear something in this process is triggering the bug in Firefox, but I'm not sure what. If I get some free time, I can try some alternate arguments for ghostscript to workaround this issue going forward. Maybe something useful in the stripped out pages is getting lost?

I also looked at Bugzilla, and they've got a pretty minimal test case to demonstrate the issue as well.

@jfkthame
Copy link

jfkthame commented Sep 9, 2022

Given that the "full book PDFs do render just fine in Firefox", it appears that GhostScript is damaging the font in some way, perhaps during the process of subsetting to include only the characters present in the selected pages.

I don't think this is a really a Firefox bug as such; note that https://bugzilla.mozilla.org/show_bug.cgi?id=1783740 indicates that the font similarly fails to load in Edge.

Maybe the -dSubsetFonts=false would make a difference to how it behaves? It's pretty hard to diagnose exactly why the Windows API is rejecting the subsetted font, when OTS is happy with it and macOS seems to accept it fine, but my best guess is that GS's subsetting operation is doing something slightly questionable, and DirectWrite doesn't like it.

I suppose if you can share a "full" PDF that works, along with a truncated preview (created from the same document) where the font fails, we can try to extract the corresponding font resources from each and compare them, though CFF is a fearsomely complex format and it may be hard to identify exactly what is triggering the failure.

@sergei-harbour
Copy link

My issue was linked to this one so I tried to review all the comments and links here. But I'm not sure that it's the same. @Snuffleupagus points me to this comment in this thread, but my issue is not Firefox specific, also, even in the latest version of the Firefox the issue is still reproducible.
Sorry if I miss something obvious here.

@jeremyn
Copy link
Author

jeremyn commented Oct 6, 2022

@sergei-harbour As I understand it, this issue is only partially fixed, with the rest moved to the Bugzilla tracker. See #15289 (comment).

Also, your test file is broken for me in Firefox 105.0.2 with pdfjs.disableFontFace set to false (the default) but works if I set that to true, a workaround discussed in earlier comments, which suggests the two issues are related. However note #15289 (comment) which says this is probably not a good permanent workaround.

@sergei-harbour
Copy link

sergei-harbour commented Oct 12, 2022

but works if I set that to true

True, it works, but after some research it looks like it stops working for those cases when a pdf has a non-standard font that is not embedded to the doc. It doesn't fall back to the system font. Maybe some retry logic can be a workaround here, something like:

  1. Render with disableFontFace: true, stopAtErrors: true
  2. If the previous step fails fall back to disableFontFace: false
  3. Hope no one on earth uses a doc with a set of non-standard embedded and unembedded fonts at the same time

The thing is that I deal with tons of PDF docs in my system and I can't control the way how they are created. Maybe I need to preprocess the docs somehow and replace/embed the fonts that don't play nice with pdf.js.

@calixteman
Copy link
Contributor

After digging in the font I finally found that it's because of ExpansionFactor set to 0 in the private font dict.
It seems to be fixed by just removing this entry.
I've no idea about what this parameter is supposed to mean, there is nothing helpful in the CFF specifications:
https://adobe-type-tools.github.io/font-tech-notes/pdfs/5176.CFF.pdf

@jfkthame
Copy link

jfkthame commented Jan 7, 2023

Oh, interesting! Congratulations on tracking this down.

Neither the Adobe CFF specification nor the OpenType spec for CFF2 seems to give any clue what this means; they just mention a default value of 0.06, but not a word about what other values would be valid or what effect it's supposed to have. shrug

Removing it in pdf.js should fix the immediate issue with rendering PDFs that contain such fonts, but we could also consider removing it in OTS, so that if a "bad" font is used as a webfont (independently of PDF embedding) it would also resolve that case. Though maybe such fonts only arise as a result of some (faulty?) PDF-generating workflows.

@jfkthame
Copy link

jfkthame commented Jan 7, 2023

Ah - looks like this is inherited from the old Type 1 spec. See page 45 in https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf for information.

@calixteman I'm just wondering, if you reset the value to 0.06 (the default) instead of removing it, does that also resolve the failure? If so, maybe that would be the lowest-risk approach, just in case any rendering engine expects the entry to be present.

@calixteman
Copy link
Contributor

I updated my PR to set the property to 0.06.
I'd say that OTS should do the job because I won't bet one euro that nobody uses such a font as a webfont.
Or if 0 is a legal value, we should ask to some MS people to fix the bug on Windows, since it appears to not be a problem on mac and linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment