You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 15, 2023. It is now read-only.
PDF files pointed to by other PDF files need not have a .pdf extension to be identified as such. I had to apply the following patch to be able to download PDFs recursively (in my case, they had no extension):
diff --git a/pdfx/__init__.py b/pdfx/__init__.py
index 6042e26..8411235 100644
--- a/pdfx/__init__.py
+++ b/pdfx/__init__.py
@@ -194,7 +194,7 @@ class PDFx(object):
logger.debug("- Saved metadata to '%s'" % fn_json)
# Download references
- urls = [ref.ref for ref in self.get_references("pdf")]
+ urls = [ref.ref for ref in self.get_references()]
if not urls:
return
```
Of course, this quick fix brings problems. pdfx will try (and fail) to download `mailto:` links, or will download random websites linked to. Point is: pdfx should allow some kind of custom regex or something to identify desirable files among references. Maybe it should also allow some a posteriori file checking (download a file, see if it's a PDF, if not, delete it).
The text was updated successfully, but these errors were encountered:
theiostream
changed the title
PDFs should not be treated as such based on extension
PDF references should not be treated as such based on extension
Oct 12, 2018
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
PDF files pointed to by other PDF files need not have a
.pdf
extension to be identified as such. I had to apply the following patch to be able to download PDFs recursively (in my case, they had no extension):The text was updated successfully, but these errors were encountered: