-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Title incorrectly set as "PDF.js viewer" on some PDFs #3372
Comments
It looks like PDF.js has logic that updates This logic is not used however if the In legacy Via, PDF.js was the top-level frame, so this flag was automatically set to I think to err on the side of caution we should just ignore |
If a PDF has no embedded title, don't use `document.title` as a fallback. In top-level frames PDF.js sets the title to the filename in the URL, so this worked out OK. When PDF.js was embedded in an iframe however it did not modify `document.title` and it would typically be set to a generic default such as "PDF.js viewer" (the default title of the PDF.js viewer application). The one case where we might actually want to use `document.title` is in a custom viewer where the publisher has taken care to set it appropriately based on some other source. This is a tiny minority of usage currently. Fixes #3372
Related: hypothesis/lms#1371 |
“ On the server side we can auto-generate a title from the URL if a document doesn't have one.”. Was thinking the same. The filename would be better than nothing. |
If a PDF has no embedded title, don't use `document.title` as a fallback. In top-level frames PDF.js sets the title to the filename in the URL, so this worked out OK. When PDF.js was embedded in an iframe however it did not modify `document.title` and it would typically be set to a generic default such as "PDF.js viewer" (the default title of the PDF.js viewer application). The one case where we might actually want to use `document.title` is in a custom viewer where the publisher has taken care to set it appropriately based on some other source. This is a tiny minority of usage currently. Fixes #3372
Replace the usage of `document.title` as a way to get the document title if the PDF has no embedded title in either its _document info dictionary_ or _metadata stream_. In top-level frames using `document.title` (where `document` is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from: 1. The embedded title 2. The filename from the `Content-Disposition` header 3. The last segment of the URL's path (eg. "test.pdf" in "https://example.com/test.pdf") When PDF.js is embedded in an iframe however, it does not set `document.title` by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title. This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set `document.title`, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The `Content-Disposition` filename is not exposed as a public property on `PDFViewerApplication`, so `PDFMetadata#getMetadata` was refactored to call the `pdfDocument.getMetadata` instead. Fixes #3372
Replace the usage of `document.title` as a way to get the document title if the PDF has no embedded title in either its _document info dictionary_ or _metadata stream_. In top-level frames using `document.title` (where `document` is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from: 1. The embedded title 2. The filename from the `Content-Disposition` header 3. The last segment of the URL's path (eg. "test.pdf" in "https://example.com/test.pdf") When PDF.js is embedded in an iframe however, it does not set `document.title` by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title. This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set `document.title`, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The `Content-Disposition` filename is not exposed as a public property on `PDFViewerApplication`, so `PDFMetadata#getMetadata` was refactored to call the `pdfDocument.getMetadata` instead. Fixes #3372
Replace the usage of `document.title` as a way to get the document title if the PDF has no embedded title in either its _document info dictionary_ or _metadata stream_. In top-level frames using `document.title` (where `document` is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from: 1. The embedded title 2. The filename from the `Content-Disposition` header 3. The last segment of the URL's path (eg. "test.pdf" in "https://example.com/test.pdf") When PDF.js is embedded in an iframe however, it does not set `document.title` by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title. This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set `document.title`, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The `Content-Disposition` filename is not exposed as a public property on `PDFViewerApplication`, so `PDFMetadata#getMetadata` was refactored to call the `pdfDocument.getMetadata` instead. Fixes #3372
Replace the usage of `document.title` as a way to get the document title if the PDF has no embedded title in either its _document info dictionary_ or _metadata stream_. In top-level frames using `document.title` (where `document` is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from: 1. The embedded title 2. The filename from the `Content-Disposition` header 3. The last segment of the URL's path (eg. "test.pdf" in "https://example.com/test.pdf") When PDF.js is embedded in an iframe however, it does not set `document.title` by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title. This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set `document.title`, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The `Content-Disposition` filename is not exposed as a public property on `PDFViewerApplication`, so `PDFMetadata#getMetadata` was refactored to call the `pdfDocument.getMetadata` instead. Fixes #3372
Steps to reproduce
Annotate https://arxiv.org/pdf/2105.01635.pdf through Via
Expected behaviour
When the annotation is created the client should use the document title embedded in the PDF in the
POST /api/annotations
request or omit it if not present.Actual behaviour
Via's PDF viewer currently has a generic title which reads "PDF.js viewer" and this gets submitted as the title.
Additional details
This particular PDF has an empty title string in its document metadata. You can see this if you evaluate
window.PDFViewerApplication.documentInfo
in the PDF.js frame and look at theTitle
property.The title is extracted by the
PDFMetadata
class which has generic logic that uses thedocument.title
property if the PDF does not contain an embedded title or it is empty:client/src/annotator/integrations/pdf-metadata.js
Line 136 in 6267210
Depending on the PDF viewer, the document title from
document.title
may or may not be a sensible title. In the Chrome extensiondocument.title
gets populated with the PDF filename for this document. We should look into what code is doing that.The text was updated successfully, but these errors were encountered: