Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace document.title as a fallback in PDFs without embedded title metadata #3374

Merged
merged 1 commit into from
May 11, 2021

Conversation

robertknight
Copy link
Member

@robertknight robertknight commented May 5, 2021

(Edit: Updated to reflect new implementation)

Replace the usage of document.title as a way to get the document title if the PDF has no embedded title in either its document info dictionary or metadata stream [1]

In top-level frames using document.title (where document is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from:

  1. The embedded title
  2. The filename from the Content-Disposition header
  3. The last segment of the URL's path (eg. "test.pdf" in
    "https://example.com/test.pdf")

When PDF.js is embedded in an iframe however, it does not set document.title by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title.

This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set document.title, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The Content-Disposition filename is not exposed as a public property on PDFViewerApplication, so PDFMetadata#getMetadata was refactored to call pdfDocument.getMetadata instead.

Fixes #3372

[1] See section 14.3, "Metadata" in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf for more information about these parts of a PDF.


Testing:

For each of the following cases, visit the URL, create an annotation and check the document.title field in the payload of the request sent to the server:

  1. For http://localhost:3000/pdf/gatsby the embedded title "The Great Gatsby" should be used
  2. For http://localhost:3000/pdf/nils-olav the last part of the URL, nils-olav, should be used
  3. Apply the following diff and revisit http://localhost:3000/pdf/nils-olav. The title "Use this filename.pdf" should be used:
diff --git a/dev-server/serve-dev.js b/dev-server/serve-dev.js
index 9201b9a82..42eb33d6c 100644
--- a/dev-server/serve-dev.js
+++ b/dev-server/serve-dev.js
@@ -78,7 +78,11 @@ function serveDev(port, config) {
 
   // Serve static PDF files out of the PDF directory, but serve under
   // `/pdf-source/` — these are needed by PDF JS viewer
-  app.use('/pdf-source', express.static(PDF_PATH));
+  app.use('/pdf-source', express.static(PDF_PATH, {
+    setHeaders(res) {
+      res.set('Content-Disposition', 'attachment; filename=Use%20this%20filename.pdf');
+    }
+  }));
 
   // Enable CORS for assets so that cross-origin font loading works.
   app.use(function (req, res, next) {

@robertknight
Copy link
Member Author

robertknight commented May 10, 2021

The logic that PDF.js uses to set document.title after the PDF is downloaded is here: https://github.com/mozilla/pdf.js/blob/f07d50f8eeced1cd9cc967e938485893d584fc32/web/app.js#L1499. The document title is set to the first non-empty value from:

  1. The dc:title metadata field within the PDF
  2. The Title property from the document information dictionary (see section 9.2.1 in the PDF specification)
  3. The filename directive from the Content-Disposition header of the response used when retrieving the file

While the file is loading, and after it has loaded if none of the above exists, the title is set based on the filename extracted from the URL. This is the last element of the URL's path after the final /.

This seems like a reasonable set of sources to consider for the title. The Hypothesis client currently implements the logic to check for (1) and (2) itself. It doesn't implement (3) or (4) but implicitly gets that as a result of its fallback to document.title, except when PDF.js is embedded in an iframe and doesn't set document.title.

In this PR I removed the fallback to use document.title, but that will mean losing the Content-Disposition-based title. Also the other parts of Hypothesis that render document titles don't currently have the same filename-from-URL logic that PDF.js used to set document.title, which will mean that parts of Hypothesis that show document titles will change.

I think the safest change here would be to implement the filename-from-Content-Disposition and filename-from-URL fallbacks ourselves in the PDFMetadata class. This will mean that the results will be the same for users as before, but it won't be reliant on PDF.js setting document.title.

@robertknight
Copy link
Member Author

I have pushed a change which modifies the logic used to generate the title property returned by PDFMetadata#getMetadata to match how PDF.js sets document.title, in the case where the PDF has no embedded title metadata.

With this change the document title in the new Via service should always be set the same way it is for legacy Via and in the Chrome extension.

A caveat with the current implementation is that it relies on a private PDFViewerApplication._contentDispositionFilename property. It would be preferable to avoid this reference.

@codecov
Copy link

codecov bot commented May 10, 2021

Codecov Report

Merging #3374 (b16b216) into master (5ffeba9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head b16b216 differs from pull request most recent head 0775ac5. Consider uploading reports for the commit 0775ac5 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3374   +/-   ##
=======================================
  Coverage   98.43%   98.43%           
=======================================
  Files         213      213           
  Lines        7729     7739   +10     
  Branches     1751     1754    +3     
=======================================
+ Hits         7608     7618   +10     
  Misses        121      121           
Impacted Files Coverage Δ
src/annotator/integrations/pdf-metadata.js 100.00% <100.00%> (ø)
src/sidebar/services/groups.js 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ffeba9...0775ac5. Read the comment docs.

Base automatically changed from refactor-pdf-metadata-tests to master May 11, 2021 08:51
@robertknight robertknight force-pushed the do-not-use-document-dot-title-in-pdfs branch 2 times, most recently from bfa2fa5 to f67f486 Compare May 11, 2021 08:59
title = app.documentInfo.Title;
} else if (app._contentDispositionFilename) {
title = app._contentDispositionFilename;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private property that was recently renamed from contentDispositionFilename. I want to find a non-private way to get this.

@robertknight robertknight force-pushed the do-not-use-document-dot-title-in-pdfs branch from b16b216 to 0ca222b Compare May 11, 2021 12:03
@robertknight robertknight changed the title Do not use document.title as a fallback for title in PDFs Replace document.title as a fallback in PDFs without embedded title metadata May 11, 2021
@robertknight robertknight force-pushed the do-not-use-document-dot-title-in-pdfs branch from 0ca222b to 6d7d29b Compare May 11, 2021 12:29
Replace the usage of `document.title` as a way to get the document title
if the PDF has no embedded title in either its _document info
dictionary_ or _metadata stream_.

In top-level frames using `document.title` (where `document` is the
global HTML document, not the PDF) works because PDF.js sets the title
based on the first non-empty value from:

 1. The embedded title
 2. The filename from the `Content-Disposition` header
 3. The last segment of the URL's path (eg. "test.pdf" in
    "https://example.com/test.pdf")

When PDF.js is embedded in an iframe however, it does not set
`document.title` by default. As a result, documents were ending up in
Hypothesis with a generic "PDF.js viewer" title.

This commit implements (roughly) the same logic that PDF.js uses to
determine the value used to set `document.title`, in the case where the
PDF has no embedded title. This means implementing steps (2) and (3)
from the above list. The `Content-Disposition` filename is not exposed
as a public property on `PDFViewerApplication`, so
`PDFMetadata#getMetadata` was refactored to call the
`pdfDocument.getMetadata` instead.

Fixes #3372
@robertknight robertknight force-pushed the do-not-use-document-dot-title-in-pdfs branch from 6d7d29b to 0775ac5 Compare May 11, 2021 12:29
@robertknight robertknight marked this pull request as ready for review May 11, 2021 12:41
@robertknight robertknight requested a review from lyzadanger May 11, 2021 12:42
Copy link
Contributor

@lyzadanger lyzadanger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes appear to do what they claim! A couple of notes for posterity in case anyone else tries to work through the testing steps in the PR description:

  • You'll need to restart your local webserver after applying the patch to set Content-Disposition headers on static PDFs
  • I had to disable cache in my browser to see the Content-Disposition headers

I struggled a little here with the type naming, especially with the two Metadata types. And the complexity of the test fakes continue to make me a little bit nervous, but nothing new there!

Otherwise, well documented and logically clear.

link.push({ href: url });
}

return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit to figure out that the Metadata type that this returns is not the new Metadata type that was defined in these changes—that's a bit confusing!

* Document metadata parsed from the PDF's _metadata stream_.
*
* See `Metadata` class from `display/metadata.js` in PDF.js.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this type could have a more descriptive name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, well Metadata is what the class is called in PDF.js. I suppose you could alias it when importing it elsewhere.

//
// This logic is similar to how PDF.js sets `document.title`.
let title;
if (metadata?.has('dc:title') && metadata.get('dc:title') !== 'Untitled') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example where a long chain of conditionals makes sense for readability 👍🏻

@robertknight
Copy link
Member Author

Thanks for the feedback Lyza. I don't immediately have a better scheme for naming regarding the various Metadata-related types, so I'm going to get this merged and revisit later.

@robertknight robertknight merged commit 40eaf5c into master May 11, 2021
@robertknight robertknight deleted the do-not-use-document-dot-title-in-pdfs branch May 11, 2021 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Title incorrectly set as "PDF.js viewer" on some PDFs
2 participants