Title incorrectly set as "PDF.js viewer" on some PDFs #3372

robertknight · 2021-05-05T12:47:27Z

Steps to reproduce

Annotate https://arxiv.org/pdf/2105.01635.pdf through Via

Expected behaviour

When the annotation is created the client should use the document title embedded in the PDF in the POST /api/annotations request or omit it if not present.

Actual behaviour

Via's PDF viewer currently has a generic title which reads "PDF.js viewer" and this gets submitted as the title.

Additional details

This particular PDF has an empty title string in its document metadata. You can see this if you evaluate window.PDFViewerApplication.documentInfo in the PDF.js frame and look at the Title property.

The title is extracted by the PDFMetadata class which has generic logic that uses the document.title property if the PDF does not contain an embedded title or it is empty:

client/src/annotator/integrations/pdf-metadata.js

Line 136 in 6267210

let title = document.title;

Depending on the PDF viewer, the document title from document.title may or may not be a sensible title. In the Chrome extension document.title gets populated with the PDF filename for this document. We should look into what code is doing that.

The text was updated successfully, but these errors were encountered:

robertknight · 2021-05-05T13:11:32Z

It looks like PDF.js has logic that updates document.title: https://github.com/mozilla/pdf.js/blob/2067bccf09b8a82a943b9fdfa7df95f294f0ba1b/web/app.js#L771

This logic is not used however if the isViewerEmbedded flag is set, and this flag gets set automatically whenever PDF.js is embedded in an iframe: https://github.com/mozilla/pdf.js/blob/2067bccf09b8a82a943b9fdfa7df95f294f0ba1b/web/app.js#L252

In legacy Via, PDF.js was the top-level frame, so this flag was automatically set to false and PDF.js did set document.title. In the current Via 3 implementation, PDF.js is embedded in an iframe, so this flag is automatically set to true, and document.title does not get set.

I think to err on the side of caution we should just ignore document.title and only consider the embedded title. On the server side we can auto-generate a title from the URL if a document doesn't have one.

If a PDF has no embedded title, don't use `document.title` as a fallback. In top-level frames PDF.js sets the title to the filename in the URL, so this worked out OK. When PDF.js was embedded in an iframe however it did not modify `document.title` and it would typically be set to a generic default such as "PDF.js viewer" (the default title of the PDF.js viewer application). The one case where we might actually want to use `document.title` is in a custom viewer where the publisher has taken care to set it appropriately based on some other source. This is a tiny minority of usage currently. Fixes #3372

judell · 2021-05-05T15:16:26Z

Related: hypothesis/lms#1371

dwhly · 2021-05-10T11:50:50Z

“ On the server side we can auto-generate a title from the URL if a document doesn't have one.”. Was thinking the same. The filename would be better than nothing.

If a PDF has no embedded title, don't use `document.title` as a fallback. In top-level frames PDF.js sets the title to the filename in the URL, so this worked out OK. When PDF.js was embedded in an iframe however it did not modify `document.title` and it would typically be set to a generic default such as "PDF.js viewer" (the default title of the PDF.js viewer application). The one case where we might actually want to use `document.title` is in a custom viewer where the publisher has taken care to set it appropriately based on some other source. This is a tiny minority of usage currently. Fixes #3372

Replace the usage of `document.title` as a way to get the document title if the PDF has no embedded title in either its _document info dictionary_ or _metadata stream_. In top-level frames using `document.title` (where `document` is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from: 1. The embedded title 2. The filename from the `Content-Disposition` header 3. The last segment of the URL's path (eg. "test.pdf" in "https://example.com/test.pdf") When PDF.js is embedded in an iframe however, it does not set `document.title` by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title. This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set `document.title`, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The `Content-Disposition` filename is not exposed as a public property on `PDFViewerApplication`, so `PDFMetadata#getMetadata` was refactored to call the `pdfDocument.getMetadata` instead. Fixes #3372

robertknight mentioned this issue May 5, 2021

Refactor PDFMetadata tests #3373

Merged

robertknight mentioned this issue May 5, 2021

Replace document.title as a fallback in PDFs without embedded title metadata #3374

Merged

robertknight closed this as completed in #3374 May 11, 2021

mkdir-washington-edu mentioned this issue May 25, 2021

SPIKE: Fix previously-saved "PDF.js" doc titles so when PDFs are re-used in the LMS app we capture a better file name hypothesis/support-legacy#206

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Title incorrectly set as "PDF.js viewer" on some PDFs #3372

Title incorrectly set as "PDF.js viewer" on some PDFs #3372

robertknight commented May 5, 2021 •

edited

Loading

robertknight commented May 5, 2021

judell commented May 5, 2021

dwhly commented May 10, 2021

Title incorrectly set as "PDF.js viewer" on some PDFs #3372

Title incorrectly set as "PDF.js viewer" on some PDFs #3372

Comments

robertknight commented May 5, 2021 • edited Loading

Steps to reproduce

Expected behaviour

Actual behaviour

Additional details

robertknight commented May 5, 2021

judell commented May 5, 2021

dwhly commented May 10, 2021

robertknight commented May 5, 2021 •

edited

Loading