PDF references should not be treated as such based on extension #30

theiostream · 2018-10-12T19:46:17Z

PDF files pointed to by other PDF files need not have a .pdf extension to be identified as such. I had to apply the following patch to be able to download PDFs recursively (in my case, they had no extension):

diff --git a/pdfx/__init__.py b/pdfx/__init__.py
index 6042e26..8411235 100644
--- a/pdfx/__init__.py
+++ b/pdfx/__init__.py
@@ -194,7 +194,7 @@ class PDFx(object):
         logger.debug("- Saved metadata to '%s'" % fn_json)

         # Download references
-        urls = [ref.ref for ref in self.get_references("pdf")]
+        urls = [ref.ref for ref in self.get_references()]
         if not urls:
             return
```

Of course, this quick fix brings problems. pdfx will try (and fail) to download `mailto:` links, or will download random websites linked to. Point is: pdfx should allow some kind of custom regex or something to identify desirable files among references. Maybe it should also allow some a posteriori file checking (download a file, see if it's a PDF, if not, delete it).

The text was updated successfully, but these errors were encountered:

theiostream changed the title ~~PDFs should not be treated as such based on extension~~ PDF references should not be treated as such based on extension Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF references should not be treated as such based on extension #30

PDF references should not be treated as such based on extension #30

theiostream commented Oct 12, 2018

PDF references should not be treated as such based on extension #30

PDF references should not be treated as such based on extension #30

Comments

theiostream commented Oct 12, 2018