Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Books .pdf document equivalence problem #7884

Open
chrisaldrich opened this issue Mar 12, 2023 · 1 comment
Open

Google Books .pdf document equivalence problem #7884

chrisaldrich opened this issue Mar 12, 2023 · 1 comment

Comments

@chrisaldrich
Copy link

I've noticed on a couple of .pdf documents from Google books that their fingerprints, lack thereof, or some other glitch in creating document equivalency all seem to clash creating orphans.

Example, the downloadable .pdf of Geyer's Stationer 1904 found at
https://www.google.com/books/edition/Geyer_s_Stationer/L507AQAAMAAJ?hl=en&gbpv=0 currently has 109 orphaned annotations caused by this issue.

See also a specific annotation on this document: https://hypothes.is/a/vNmUHMB3Ee2VKgt4yhjofg

@robertknight
Copy link
Member

I've noticed on a couple of .pdf documents from Google books that their fingerprints, lack thereof, or some other glitch in creating document equivalency all seem to clash creating orphans.

For debugging issues like this, the fingerprint is displayed under Help => About this version.

About this version

I've noticed on a couple of .pdf documents from Google books that their fingerprints, lack thereof, or some other glitch in creating document equivalency all seem to clash creating orphans.

I don't see a PDF download option on that page. Instead there is a "Preview unavailable" message. This might be related to us being in different locations.

I was however able to reproduce your problem by searching for two differently freely available books ("Great Expectations" and "Oliver Twister") and clicking the "Download PDF" link for the first item that was freely available. In both cases the books had different URLs and content but the same fingerprint value ("ca474facea1eb6917376bd8394b060ad"). This looks like an MD5 hash of some value, but I don't know what. From the browser console, there is some more info about how the PDF was created:

PDF ca474facea1eb6917376bd8394b060ad [1.4 Google Books PDF Converter (rel 3 12/12/14) / -] (PDF.js: 2.14.137)

The PDF converter release date mentioned is quite old, so it is possible that the issue might have been fixed in a more recent version. It would be worth checking some newer publications that are available in case this happened. If the problem still exists it would be worth reporting to Google, since the fingerprint is a standard part of the PDF specification, where it is called the "File ID".

As for workarounds, we could perhaps do something like checking for known-bad PDF generation tools and substituting some other fingerprint. This would break existing annotation links though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants