Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving a PDF (Print>Save as PDF) turns it unsearchable #14277

Closed
maverick74 opened this issue Nov 15, 2021 · 28 comments
Closed

Saving a PDF (Print>Save as PDF) turns it unsearchable #14277

maverick74 opened this issue Nov 15, 2021 · 28 comments
Labels

Comments

@maverick74
Copy link

maverick74 commented Nov 15, 2021

Saving a PDF (Print>Save as PDF) turns it unsearchable!

Attach (recommended) or Link to PDF file here:
https://africau.edu/images/default/sample.pdf (https://web.archive.org/web/20220531122837/http://www.africau.edu/images/default/sample.pdf)

Configuration:

  • Web browser and its version: 94.0
  • Operating system and its version: Neon Linux

Steps to reproduce the problem:

  1. Go to https://africau.edu/images/default/sample.pdf
  2. Click the print button
  3. Select "save as PDF"
  4. save the file
  5. reopen it in firefox
  6. try to select the text or search for any text

What is the expected behavior?
Text should be selectable and searchable

What went wrong?
Text is not selectable or searchable

@marco-c
Copy link
Contributor

marco-c commented Dec 7, 2021

Is this the same as https://bugzilla.mozilla.org/show_bug.cgi?id=1274502?

@maverick74
Copy link
Author

maverick74 commented Dec 8, 2021

Is this the same as https://bugzilla.mozilla.org/show_bug.cgi?id=1274502?

Yes, they seem to report the same problem.

But now I question:
is this a Firefox problem (and as such should be reported on the - 6 years old 8( ?!?! - link you shared) or is this a PDF.JS problem (and should be reported here)?

I'm OK with closing this bug as long as it is submitted on the right place.

(I think it has better chances here... But then again... I have bugs ignored for years too...)

@Snuffleupagus
Copy link
Collaborator

This, at least to me, sounds like a roundabout way of saving a PDF document that's opened with the Firefox PDF Viewer.

Why not directly use e.g. the download button (in the viewer), the Cmd/Ctrl+S keyboard shortcut, or the "Save Page As..." entry in the "File" menu (of the browser), rather than going through the printing process?
By invoking the download directly you'd get the original PDF document, and it'd be faster too.

@maverick74
Copy link
Author

maverick74 commented Dec 8, 2021

Why not directly use e.g. the download button (in the viewer), the Cmd/Ctrl+S keyboard shortcut, or the "Save Page As..." entry in the "File" menu (of the browser), rather than going through the printing process? By invoking the download directly you'd get the original PDF document, and it'd be faster too.

You're right, unless when you have a multiple page PDF and you want just one page.

One example is invoices: there is software that prints 4 copies of the same invoice in the same document. Now imagine you want to send your client just the original...

another example: repair orders. Imagine you get an equipment to repair and the software generates 2 pages of the equipment "ID/profile" - one for you, another for the client. But when we have to send the equipment ID sheet for brand cross-check and have restrictions on the file size, you need to reduce the file to just one sheet.

ATM, using Firefox as your PDF reader, the only way to get just the first sheet is thru the print option, but that renders it unsearchable which is a problem!

@marco-c
Copy link
Contributor

marco-c commented Dec 9, 2021

Basically what you want is a feature to split a PDF

@maverick74
Copy link
Author

maverick74 commented Dec 9, 2021

Basically what you want is a feature to split a PDF

No exactly, because Firefox already allows me to save only the pages i need, just not in a searchable format!

What i want is to be able to save only the pages i need in a searchable format (because it's a requirement my job imposes)!

I really hate to say this but, in M$ edge, for example, i can do this.

i don't know how they do that, but it just works...

@Snuffleupagus
Copy link
Collaborator

Basically what you want is a feature to split a PDF

No exactly! What i want is to be able to save just the pages i need in a searchable format (because it's a requirement my job imposes)!

Well, technically speaking that's essentially what this would amount to :-)

Supporting such a use-case would require adding (more) arbitrary editing of PDF documents (currently we only support saving of form data), which is not really a small/simple thing to implement in general (and was never a goal of the project).

@maverick74
Copy link
Author

maverick74 commented Dec 9, 2021

So, let me clarify it @Snuffleupagus :

You are saying that being able to save just the pages the user needs would involve a lot of work?
I thought that, since you already have the original file that is searchable, this would be a strait-forward detail to implement...

(I was also under the impression that PDF.JS was a lot more powerful and feature rich than google's Pdfium)

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Dec 9, 2021

You are saying that being able to save just the pages the user needs would involve a lot of work?

Yes, it'd require creating a new PDF document from the specified pages.

Given how PDF documents are structured internally (it's a fairly old format), there's in general no easy way to just "pick" a couple of pages and directly create a new valid PDF document from that. First of all, you'd probably need to remove e.g. font and graphics resources no longer needed in order to reduce the file size of the new PDF document. Secondly, you'd need to create a valid XRef (i.e. cross reference) table such that the new PDF document can be successfully opened in viewers.

Please note that, as mentioned above, arbitrary PDF editing has (thus far) never been a goal of this library, since it's a fairly complex topic given e.g. all the weird/corrupt data-structures found in real-world PDF documents.

@maverick74
Copy link
Author

maverick74 commented Dec 9, 2021

OK.

This basically means "no fast solution on the horizon any time soon.. (if ever)."

I would still like to leave this open, if you agree... (as I believe it is an important feature to businesses)

@marco-c
Copy link
Contributor

marco-c commented Dec 9, 2021

@maverick74 we are giving some thought to printing issues, so this might change soon.

@maverick74
Copy link
Author

@marco-c That's great news!
We're having a couple of issues with printing-related problems.

The other issues were already reported, however.

We intend to use Firefox not only as our default browser but also as our only PDF reader

@marco-c
Copy link
Contributor

marco-c commented Jul 6, 2022

This is fixed in latest Firefox Nightly.

@marco-c marco-c closed this as completed Jul 6, 2022
@marco-c
Copy link
Contributor

marco-c commented Jul 6, 2022

Thanks to https://hg.mozilla.org/mozilla-central/rev/7d9376649d6d (https://bugzilla.mozilla.org/show_bug.cgi?id=1777209).

@maverick74
Copy link
Author

maverick74 commented Jul 7, 2022

@marco-c i found a bug in the implementation!

Easy steps to reproduce:

  1. Open: PDF example
  2. Get to Print Dialog (CTRL+P) and Save as PDF
  3. Open Saved File
  4. Select and copy Text (Dummy PDF file) from the saved pdf file
  5. Paste it somewhere (notepad, kate, whatever)

Result: Unrecognized characters
Expected: "Dummy PDF file" text

If you prefer i can fill a separated bug report

@marco-c
Copy link
Contributor

marco-c commented Jul 7, 2022

Thanks, I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1778484.

Did you see this with other PDFs too?

@calixteman
Copy link
Contributor

It's very likely caused by
#9340

@maverick74
Copy link
Author

Did you see this with other PDFs too?

Yes.
I originally noted that on an "internal" receipt PDF.
Because the receipt uses a "weird" font i went on to try other more normal pdf's to be sure it wasn't a document-specific problem.

But in the documents i've tried the result was always the same.

@marco-c
Copy link
Contributor

marco-c commented Jul 14, 2022

@maverick74 the issue is fixed in latest Nightly, please let us know if you see other problems.

@maverick74
Copy link
Author

I can confirm it's working ok now.
As soon as i have bit of free time I'll do some extra tests with more complex PDF.
If i find anything worth mentioning I'll post it back here.

Thank you all :)

@cksgh1224
Copy link

It said it was fixed in the latest Firefox Nightly, but when I tested it on Nightly 106.0a1 (2022-08-23), the problem didn't seem to be resolved.

https://mozilla.github.io/pdf.js/web/viewer.html
From here, open the saved PDF file again after 'Print-Save as PDF'

I searched for "Trace" text, but it doesn't search and drag.

Isn't the latest Nightly mentioned not released?

@maverick74
Copy link
Author

@cksgh1224 works for me.

I've tested it on the latest 104 (release) and on the latest Nightly 106 and in both cases it worked as it was supposed to.

Prior to this, i've always tested in the official Nightly

@marco-c
Copy link
Contributor

marco-c commented Aug 25, 2022

@cksgh1224 on what PDF could you still reproduce the problem?

@cksgh1224
Copy link

@marco-c

https://mozilla.github.io/pdf.js/web/viewer.html

When I tested with the PDF file here, searching and dragging did not work.

in the original article
https://africau.edu/images/default/sample.pdf When I tested using this PDF and the sample PDF I have, it can be searched and dragged...

Is this a problem with the PDF here https://mozilla.github.io/pdf.js/web/viewer.html?

image

@marco-c
Copy link
Contributor

marco-c commented Aug 26, 2022

What you're using when you load https://mozilla.github.io/pdf.js/web/viewer.html is the web viewer of pdf.js, not the version included in Firefox itself. The version included in Firefox is using some internal Firefox APIs to be able to print correctly.
You can test by loading https://raw.githubusercontent.com/mozilla/pdf.js/master/test/pdfs/tracemonkey.pdf directly in Firefox and printing it to PDF.

@cksgh1224
Copy link

I tested it as you said and it works fine!!

https://mozilla.github.io/pdf.js/web/viewer.html The reason it doesn't work here is, is it because the version of pdf.js used for loading is low?

@marco-c
Copy link
Contributor

marco-c commented Aug 29, 2022

No, the reason is what I mentioned above: the PDF reader in Firefox itself is using internal Firefox APIs to print in a better way, while the viewer you see on the page is just a normal web page and so unable to use Firefox internal APIs.

@cksgh1224
Copy link

@marco-c Thanks for the kind explanation~!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Closed
Development

No branches or pull requests

6 participants