-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace document.title
as a fallback in PDFs without embedded title metadata
#3374
Conversation
The logic that PDF.js uses to set
While the file is loading, and after it has loaded if none of the above exists, the title is set based on the filename extracted from the URL. This is the last element of the URL's path after the final This seems like a reasonable set of sources to consider for the title. The Hypothesis client currently implements the logic to check for (1) and (2) itself. It doesn't implement (3) or (4) but implicitly gets that as a result of its fallback to In this PR I removed the fallback to use I think the safest change here would be to implement the filename-from-Content-Disposition and filename-from-URL fallbacks ourselves in the |
I have pushed a change which modifies the logic used to generate the With this change the document title in the new Via service should always be set the same way it is for legacy Via and in the Chrome extension. A caveat with the current implementation is that it relies on a private |
Codecov Report
@@ Coverage Diff @@
## master #3374 +/- ##
=======================================
Coverage 98.43% 98.43%
=======================================
Files 213 213
Lines 7729 7739 +10
Branches 1751 1754 +3
=======================================
+ Hits 7608 7618 +10
Misses 121 121
Continue to review full report at Codecov.
|
bfa2fa5
to
f67f486
Compare
title = app.documentInfo.Title; | ||
} else if (app._contentDispositionFilename) { | ||
title = app._contentDispositionFilename; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a private property that was recently renamed from contentDispositionFilename
. I want to find a non-private way to get this.
b16b216
to
0ca222b
Compare
document.title
as a fallback for title in PDFsdocument.title
as a fallback in PDFs without embedded title metadata
0ca222b
to
6d7d29b
Compare
Replace the usage of `document.title` as a way to get the document title if the PDF has no embedded title in either its _document info dictionary_ or _metadata stream_. In top-level frames using `document.title` (where `document` is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from: 1. The embedded title 2. The filename from the `Content-Disposition` header 3. The last segment of the URL's path (eg. "test.pdf" in "https://example.com/test.pdf") When PDF.js is embedded in an iframe however, it does not set `document.title` by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title. This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set `document.title`, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. The `Content-Disposition` filename is not exposed as a public property on `PDFViewerApplication`, so `PDFMetadata#getMetadata` was refactored to call the `pdfDocument.getMetadata` instead. Fixes #3372
6d7d29b
to
0775ac5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes appear to do what they claim! A couple of notes for posterity in case anyone else tries to work through the testing steps in the PR description:
- You'll need to restart your local webserver after applying the patch to set
Content-Disposition
headers on static PDFs - I had to disable cache in my browser to see the
Content-Disposition
headers
I struggled a little here with the type naming, especially with the two Metadata
types. And the complexity of the test fakes continue to make me a little bit nervous, but nothing new there!
Otherwise, well documented and logically clear.
link.push({ href: url }); | ||
} | ||
|
||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a bit to figure out that the Metadata
type that this returns is not the new Metadata
type that was defined in these changes—that's a bit confusing!
* Document metadata parsed from the PDF's _metadata stream_. | ||
* | ||
* See `Metadata` class from `display/metadata.js` in PDF.js. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this type could have a more descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, well Metadata
is what the class is called in PDF.js. I suppose you could alias it when importing it elsewhere.
// | ||
// This logic is similar to how PDF.js sets `document.title`. | ||
let title; | ||
if (metadata?.has('dc:title') && metadata.get('dc:title') !== 'Untitled') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good example where a long chain of conditionals makes sense for readability 👍🏻
Thanks for the feedback Lyza. I don't immediately have a better scheme for naming regarding the various Metadata-related types, so I'm going to get this merged and revisit later. |
(Edit: Updated to reflect new implementation)
Replace the usage of
document.title
as a way to get the document title if the PDF has no embedded title in either its document info dictionary or metadata stream [1]In top-level frames using
document.title
(wheredocument
is the global HTML document, not the PDF) works because PDF.js sets the title based on the first non-empty value from:Content-Disposition
header"https://example.com/test.pdf")
When PDF.js is embedded in an iframe however, it does not set
document.title
by default. As a result, documents were ending up in Hypothesis with a generic "PDF.js viewer" title.This commit implements (roughly) the same logic that PDF.js uses to determine the value used to set
document.title
, in the case where the PDF has no embedded title. This means implementing steps (2) and (3) from the above list. TheContent-Disposition
filename is not exposed as a public property onPDFViewerApplication
, soPDFMetadata#getMetadata
was refactored to callpdfDocument.getMetadata
instead.Fixes #3372
[1] See section 14.3, "Metadata" in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf for more information about these parts of a PDF.
Testing:
For each of the following cases, visit the URL, create an annotation and check the
document.title
field in the payload of the request sent to the server:nils-olav
, should be used