Skip to content
This repository has been archived by the owner on Jun 15, 2023. It is now read-only.

PDF references should not be treated as such based on extension #30

Open
theiostream opened this issue Oct 12, 2018 · 0 comments
Open

Comments

@theiostream
Copy link

PDF files pointed to by other PDF files need not have a .pdf extension to be identified as such. I had to apply the following patch to be able to download PDFs recursively (in my case, they had no extension):

diff --git a/pdfx/__init__.py b/pdfx/__init__.py
index 6042e26..8411235 100644
--- a/pdfx/__init__.py
+++ b/pdfx/__init__.py
@@ -194,7 +194,7 @@ class PDFx(object):
         logger.debug("- Saved metadata to '%s'" % fn_json)

         # Download references
-        urls = [ref.ref for ref in self.get_references("pdf")]
+        urls = [ref.ref for ref in self.get_references()]
         if not urls:
             return
```

Of course, this quick fix brings problems. pdfx will try (and fail) to download `mailto:` links, or will download random websites linked to. Point is: pdfx should allow some kind of custom regex or something to identify desirable files among references. Maybe it should also allow some a posteriori file checking (download a file, see if it's a PDF, if not, delete it).
@theiostream theiostream changed the title PDFs should not be treated as such based on extension PDF references should not be treated as such based on extension Oct 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant