Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title incorrectly set as "PDF.js viewer" on some PDFs #3372

Closed
robertknight opened this issue May 5, 2021 · 3 comments · Fixed by #3374
Closed

Title incorrectly set as "PDF.js viewer" on some PDFs #3372

robertknight opened this issue May 5, 2021 · 3 comments · Fixed by #3374

Comments

@robertknight
Copy link
Member

robertknight commented May 5, 2021

Steps to reproduce

Annotate https://arxiv.org/pdf/2105.01635.pdf through Via

Expected behaviour

When the annotation is created the client should use the document title embedded in the PDF in the POST /api/annotations request or omit it if not present.

Actual behaviour

Via's PDF viewer currently has a generic title which reads "PDF.js viewer" and this gets submitted as the title.

Additional details

This particular PDF has an empty title string in its document metadata. You can see this if you evaluate window.PDFViewerApplication.documentInfo in the PDF.js frame and look at the Title property.

The title is extracted by the PDFMetadata class which has generic logic that uses the document.title property if the PDF does not contain an embedded title or it is empty:

let title = document.title;

Depending on the PDF viewer, the document title from document.title may or may not be a sensible title. In the Chrome extension document.title gets populated with the PDF filename for this document. We should look into what code is doing that.

@robertknight
Copy link
Member Author

It looks like PDF.js has logic that updates document.title: https://github.com/mozilla/pdf.js/blob/2067bccf09b8a82a943b9fdfa7df95f294f0ba1b/web/app.js#L771

This logic is not used however if the isViewerEmbedded flag is set, and this flag gets set automatically whenever PDF.js is embedded in an iframe: https://github.com/mozilla/pdf.js/blob/2067bccf09b8a82a943b9fdfa7df95f294f0ba1b/web/app.js#L252

In legacy Via, PDF.js was the top-level frame, so this flag was automatically set to false and PDF.js did set document.title. In the current Via 3 implementation, PDF.js is embedded in an iframe, so this flag is automatically set to true, and document.title does not get set.

I think to err on the side of caution we should just ignore document.title and only consider the embedded title. On the server side we can auto-generate a title from the URL if a document doesn't have one.

robertknight added a commit that referenced this issue May 5, 2021
If a PDF has no embedded title, don't use `document.title` as a
fallback. In top-level frames PDF.js sets the title to the filename in the URL,
so this worked out OK. When PDF.js was embedded in an iframe however it
did not modify `document.title` and it would typically be set to a
generic default such as "PDF.js viewer" (the default title of the PDF.js
viewer application). The one case where we might actually want to use
`document.title` is in a custom viewer where the publisher has taken
care to set it appropriately based on some other source. This is a tiny
minority of usage currently.

Fixes #3372
@judell
Copy link
Contributor

judell commented May 5, 2021

Related: hypothesis/lms#1371

@dwhly
Copy link
Member

dwhly commented May 10, 2021

“ On the server side we can auto-generate a title from the URL if a document doesn't have one.”. Was thinking the same. The filename would be better than nothing.

robertknight added a commit that referenced this issue May 11, 2021
If a PDF has no embedded title, don't use `document.title` as a
fallback. In top-level frames PDF.js sets the title to the filename in the URL,
so this worked out OK. When PDF.js was embedded in an iframe however it
did not modify `document.title` and it would typically be set to a
generic default such as "PDF.js viewer" (the default title of the PDF.js
viewer application). The one case where we might actually want to use
`document.title` is in a custom viewer where the publisher has taken
care to set it appropriately based on some other source. This is a tiny
minority of usage currently.

Fixes #3372
robertknight added a commit that referenced this issue May 11, 2021
Replace the usage of `document.title` as a way to get the document title
if the PDF has no embedded title in either its _document info
dictionary_ or _metadata stream_.

In top-level frames using `document.title` (where `document` is the
global HTML document, not the PDF) works because PDF.js sets the title
based on the first non-empty value from:

 1. The embedded title
 2. The filename from the `Content-Disposition` header
 3. The last segment of the URL's path (eg. "test.pdf" in
    "https://example.com/test.pdf")

When PDF.js is embedded in an iframe however, it does not set
`document.title` by default. As a result, documents were ending up in
Hypothesis with a generic "PDF.js viewer" title.

This commit implements (roughly) the same logic that PDF.js uses to
determine the value used to set `document.title`, in the case where the
PDF has no embedded title. This means implementing steps (2) and (3)
from the above list. The `Content-Disposition` filename is not exposed
as a public property on `PDFViewerApplication`, so
`PDFMetadata#getMetadata` was refactored to call the
`pdfDocument.getMetadata` instead.

Fixes #3372
robertknight added a commit that referenced this issue May 11, 2021
Replace the usage of `document.title` as a way to get the document title
if the PDF has no embedded title in either its _document info
dictionary_ or _metadata stream_.

In top-level frames using `document.title` (where `document` is the
global HTML document, not the PDF) works because PDF.js sets the title
based on the first non-empty value from:

 1. The embedded title
 2. The filename from the `Content-Disposition` header
 3. The last segment of the URL's path (eg. "test.pdf" in
    "https://example.com/test.pdf")

When PDF.js is embedded in an iframe however, it does not set
`document.title` by default. As a result, documents were ending up in
Hypothesis with a generic "PDF.js viewer" title.

This commit implements (roughly) the same logic that PDF.js uses to
determine the value used to set `document.title`, in the case where the
PDF has no embedded title. This means implementing steps (2) and (3)
from the above list. The `Content-Disposition` filename is not exposed
as a public property on `PDFViewerApplication`, so
`PDFMetadata#getMetadata` was refactored to call the
`pdfDocument.getMetadata` instead.

Fixes #3372
robertknight added a commit that referenced this issue May 11, 2021
Replace the usage of `document.title` as a way to get the document title
if the PDF has no embedded title in either its _document info
dictionary_ or _metadata stream_.

In top-level frames using `document.title` (where `document` is the
global HTML document, not the PDF) works because PDF.js sets the title
based on the first non-empty value from:

 1. The embedded title
 2. The filename from the `Content-Disposition` header
 3. The last segment of the URL's path (eg. "test.pdf" in
    "https://example.com/test.pdf")

When PDF.js is embedded in an iframe however, it does not set
`document.title` by default. As a result, documents were ending up in
Hypothesis with a generic "PDF.js viewer" title.

This commit implements (roughly) the same logic that PDF.js uses to
determine the value used to set `document.title`, in the case where the
PDF has no embedded title. This means implementing steps (2) and (3)
from the above list. The `Content-Disposition` filename is not exposed
as a public property on `PDFViewerApplication`, so
`PDFMetadata#getMetadata` was refactored to call the
`pdfDocument.getMetadata` instead.

Fixes #3372
robertknight added a commit that referenced this issue May 11, 2021
Replace the usage of `document.title` as a way to get the document title
if the PDF has no embedded title in either its _document info
dictionary_ or _metadata stream_.

In top-level frames using `document.title` (where `document` is the
global HTML document, not the PDF) works because PDF.js sets the title
based on the first non-empty value from:

 1. The embedded title
 2. The filename from the `Content-Disposition` header
 3. The last segment of the URL's path (eg. "test.pdf" in
    "https://example.com/test.pdf")

When PDF.js is embedded in an iframe however, it does not set
`document.title` by default. As a result, documents were ending up in
Hypothesis with a generic "PDF.js viewer" title.

This commit implements (roughly) the same logic that PDF.js uses to
determine the value used to set `document.title`, in the case where the
PDF has no embedded title. This means implementing steps (2) and (3)
from the above list. The `Content-Disposition` filename is not exposed
as a public property on `PDFViewerApplication`, so
`PDFMetadata#getMetadata` was refactored to call the
`pdfDocument.getMetadata` instead.

Fixes #3372
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants