Replace `document.title` as a fallback in PDFs without embedded title metadata #3374

robertknight · 2021-05-05T14:33:26Z

(Edit: Updated to reflect new implementation)

Replace the usage of document.title as a way to get the document title if the PDF has no embedded title in either its document info dictionary or metadata stream [1]

In top-level frames using document.title (where document is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from:

The embedded title
The filename from the Content-Disposition header
The last segment of the URL's path (eg. "test.pdf" in
"https://example.com/test.pdf")

When PDF.js is embedded in an iframe however, it does not set document.title by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title.

This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set document.title, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The Content-Disposition filename is not exposed as a public property on PDFViewerApplication, so PDFMetadata#getMetadata was refactored to call pdfDocument.getMetadata instead.

Fixes #3372

[1] See section 14.3, "Metadata" in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf for more information about these parts of a PDF.

Testing:

For each of the following cases, visit the URL, create an annotation and check the document.title field in the payload of the request sent to the server:

For http://localhost:3000/pdf/gatsby the embedded title "The Great Gatsby" should be used
For http://localhost:3000/pdf/nils-olav the last part of the URL, nils-olav, should be used
Apply the following diff and revisit http://localhost:3000/pdf/nils-olav. The title "Use this filename.pdf" should be used:

diff --git a/dev-server/serve-dev.js b/dev-server/serve-dev.js
index 9201b9a82..42eb33d6c 100644
--- a/dev-server/serve-dev.js
+++ b/dev-server/serve-dev.js
@@ -78,7 +78,11 @@ function serveDev(port, config) {
 
   // Serve static PDF files out of the PDF directory, but serve under
   // `/pdf-source/` — these are needed by PDF JS viewer
-  app.use('/pdf-source', express.static(PDF_PATH));
+  app.use('/pdf-source', express.static(PDF_PATH, {
+    setHeaders(res) {
+      res.set('Content-Disposition', 'attachment; filename=Use%20this%20filename.pdf');
+    }
+  }));
 
   // Enable CORS for assets so that cross-origin font loading works.
   app.use(function (req, res, next) {

robertknight · 2021-05-10T11:38:26Z

The logic that PDF.js uses to set document.title after the PDF is downloaded is here: https://github.com/mozilla/pdf.js/blob/f07d50f8eeced1cd9cc967e938485893d584fc32/web/app.js#L1499. The document title is set to the first non-empty value from:

The dc:title metadata field within the PDF
The Title property from the document information dictionary (see section 9.2.1 in the PDF specification)
The filename directive from the Content-Disposition header of the response used when retrieving the file

While the file is loading, and after it has loaded if none of the above exists, the title is set based on the filename extracted from the URL. This is the last element of the URL's path after the final /.

This seems like a reasonable set of sources to consider for the title. The Hypothesis client currently implements the logic to check for (1) and (2) itself. It doesn't implement (3) or (4) but implicitly gets that as a result of its fallback to document.title, except when PDF.js is embedded in an iframe and doesn't set document.title.

In this PR I removed the fallback to use document.title, but that will mean losing the Content-Disposition-based title. Also the other parts of Hypothesis that render document titles don't currently have the same filename-from-URL logic that PDF.js used to set document.title, which will mean that parts of Hypothesis that show document titles will change.

I think the safest change here would be to implement the filename-from-Content-Disposition and filename-from-URL fallbacks ourselves in the PDFMetadata class. This will mean that the results will be the same for users as before, but it won't be reliant on PDF.js setting document.title.

robertknight · 2021-05-10T12:40:06Z

I have pushed a change which modifies the logic used to generate the title property returned by PDFMetadata#getMetadata to match how PDF.js sets document.title, in the case where the PDF has no embedded title metadata.

With this change the document title in the new Via service should always be set the same way it is for legacy Via and in the Chrome extension.

A caveat with the current implementation is that it relies on a private PDFViewerApplication._contentDispositionFilename property. It would be preferable to avoid this reference.

codecov · 2021-05-10T12:40:15Z

Codecov Report

Merging #3374 (b16b216) into master (5ffeba9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head b16b216 differs from pull request most recent head 0775ac5. Consider uploading reports for the commit 0775ac5 to get more accurate results

@@           Coverage Diff           @@
##           master    #3374   +/-   ##
=======================================
  Coverage   98.43%   98.43%           
=======================================
  Files         213      213           
  Lines        7729     7739   +10     
  Branches     1751     1754    +3     
=======================================
+ Hits         7608     7618   +10     
  Misses        121      121

Impacted Files	Coverage Δ
src/annotator/integrations/pdf-metadata.js	`100.00% <100.00%> (ø)`
src/sidebar/services/groups.js	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ffeba9...0775ac5. Read the comment docs.

robertknight · 2021-05-11T09:22:00Z

src/annotator/integrations/pdf-metadata.js

        title = app.documentInfo.Title;
+      } else if (app._contentDispositionFilename) {
+        title = app._contentDispositionFilename;


This is a private property that was recently renamed from contentDispositionFilename. I want to find a non-private way to get this.

Replace the usage of `document.title` as a way to get the document title if the PDF has no embedded title in either its _document info dictionary_ or _metadata stream_. In top-level frames using `document.title` (where `document` is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from: 1. The embedded title 2. The filename from the `Content-Disposition` header 3. The last segment of the URL's path (eg. "test.pdf" in "https://example.com/test.pdf") When PDF.js is embedded in an iframe however, it does not set `document.title` by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title. This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set `document.title`, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The `Content-Disposition` filename is not exposed as a public property on `PDFViewerApplication`, so `PDFMetadata#getMetadata` was refactored to call the `pdfDocument.getMetadata` instead. Fixes #3372

lyzadanger

These changes appear to do what they claim! A couple of notes for posterity in case anyone else tries to work through the testing steps in the PR description:

You'll need to restart your local webserver after applying the patch to set Content-Disposition headers on static PDFs
I had to disable cache in my browser to see the Content-Disposition headers

I struggled a little here with the type naming, especially with the two Metadata types. And the complexity of the test fakes continue to make me a little bit nervous, but nothing new there!

Otherwise, well documented and logically clear.

lyzadanger · 2021-05-11T16:23:58Z

src/annotator/integrations/pdf-metadata.js

+      link.push({ href: url });
+    }
+
+    return {


It took me a bit to figure out that the Metadata type that this returns is not the new Metadata type that was defined in these changes—that's a bit confusing!

lyzadanger · 2021-05-11T16:24:25Z

src/types/pdfjs.js

+ * Document metadata parsed from the PDF's _metadata stream_.
+ *
+ * See `Metadata` class from `display/metadata.js` in PDF.js.
+ *


I'm wondering if this type could have a more descriptive name.

Hmm, well Metadata is what the class is called in PDF.js. I suppose you could alias it when importing it elsewhere.

lyzadanger · 2021-05-11T16:25:21Z

src/annotator/integrations/pdf-metadata.js

+    //
+    // This logic is similar to how PDF.js sets `document.title`.
+    let title;
+    if (metadata?.has('dc:title') && metadata.get('dc:title') !== 'Untitled') {


This is a good example where a long chain of conditionals makes sense for readability 👍🏻

robertknight · 2021-05-11T19:27:49Z

Thanks for the feedback Lyza. I don't immediately have a better scheme for naming regarding the various Metadata-related types, so I'm going to get this merged and revisit later.

Base automatically changed from refactor-pdf-metadata-tests to master May 11, 2021 08:51

robertknight force-pushed the do-not-use-document-dot-title-in-pdfs branch 2 times, most recently from bfa2fa5 to f67f486 Compare May 11, 2021 08:59

robertknight commented May 11, 2021

View reviewed changes

robertknight force-pushed the do-not-use-document-dot-title-in-pdfs branch from b16b216 to 0ca222b Compare May 11, 2021 12:03

robertknight changed the title ~~Do not use document.title as a fallback for title in PDFs~~ Replace document.title as a fallback in PDFs without embedded title metadata May 11, 2021

robertknight force-pushed the do-not-use-document-dot-title-in-pdfs branch from 0ca222b to 6d7d29b Compare May 11, 2021 12:29

robertknight force-pushed the do-not-use-document-dot-title-in-pdfs branch from 6d7d29b to 0775ac5 Compare May 11, 2021 12:29

robertknight marked this pull request as ready for review May 11, 2021 12:41

robertknight requested a review from lyzadanger May 11, 2021 12:42

lyzadanger approved these changes May 11, 2021

View reviewed changes

robertknight merged commit 40eaf5c into master May 11, 2021

robertknight deleted the do-not-use-document-dot-title-in-pdfs branch May 11, 2021 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `document.title` as a fallback in PDFs without embedded title metadata #3374

Replace `document.title` as a fallback in PDFs without embedded title metadata #3374

robertknight commented May 5, 2021 •

edited

Loading

robertknight commented May 10, 2021 •

edited

Loading

robertknight commented May 10, 2021

codecov bot commented May 10, 2021 •

edited

Loading

robertknight May 11, 2021

lyzadanger left a comment

lyzadanger May 11, 2021

lyzadanger May 11, 2021

robertknight May 11, 2021

lyzadanger May 11, 2021

robertknight commented May 11, 2021

Replace document.title as a fallback in PDFs without embedded title metadata #3374

Replace document.title as a fallback in PDFs without embedded title metadata #3374

Conversation

robertknight commented May 5, 2021 • edited Loading

robertknight commented May 10, 2021 • edited Loading

robertknight commented May 10, 2021

codecov bot commented May 10, 2021 • edited Loading

Codecov Report

robertknight May 11, 2021

Choose a reason for hiding this comment

lyzadanger left a comment

Choose a reason for hiding this comment

lyzadanger May 11, 2021

Choose a reason for hiding this comment

lyzadanger May 11, 2021

Choose a reason for hiding this comment

robertknight May 11, 2021

Choose a reason for hiding this comment

lyzadanger May 11, 2021

Choose a reason for hiding this comment

robertknight commented May 11, 2021

Replace `document.title` as a fallback in PDFs without embedded title metadata #3374

Replace `document.title` as a fallback in PDFs without embedded title metadata #3374

robertknight commented May 5, 2021 •

edited

Loading

robertknight commented May 10, 2021 •

edited

Loading

codecov bot commented May 10, 2021 •

edited

Loading