-
-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for PDF attachments. #177
Conversation
Is there any interest in this feature? |
Hi Colin. Thanks a lot for contributing! The code looks good at a quick glance (though I’ll need more time to review it.) But first, about the feature itself. Can you explain a bit the use case? Why do you want this, how is it useful? What kind of files do you expect to attach? I’ve never heard of |
Hi Simon, thanks for this great tool 👍 ! PDF files can embedded arbitrary files which are accessible either through clickable annotations or a global file list (the paperclip). We use it to add data files to PDF reports at the office. For example Adobe Reader allows you to attach files as annotations in the toolbox, Prince provides an I chose the Adding support for attachment annotations seems easy enough and I'd update the patch if you're considering it for inclusion, the use case being files related to a section of a document. If there are any problems with the patch, I'd be happy to change what's necessary. |
I think adding global attachments from "out of band" parameters (ie. in the Python API or with comand-line flags) is fine, but I’m less certain about the HTML links. Are the links something you need, or just something that seemed nice/easy to add? As to "annotation attachments" do they appear in a specific position in the document? In that case anything but HTML links might be hard. |
The OOB attachments are the most important of course. I wouldn't mind leaving the Annotation attachments should be rendered like any other link, but clicking one of these links would open the viewers "Save as" dialog. It would require similiar treatment as the internal links on the PDF level. They have the |
What’s OOB?
Ok, so instead of linking to the URL. |
out-of-band |
I fixed the important TODOs and a couple of bugs, enhanced the testcase, tested with Python 2 and 3 (MD5 sums of the generated PDFs match) and checked an advanced output with the following readers:
What I didn't test is how the filesystem encoding might influence the filenames (all test files have been generated on Linux), but as they all go through I kept the python3 -m weasyprint http://colin.de/test.html output.pdf I'll add the annotation attachments in a different commit. |
I think the patches are ready for review 😓 . |
Looks like great work, thanks! I still need to take time for the review, sorry… |
@@ -70,11 +70,15 @@ class HTML(object): | |||
Defaults to ``'print'``. **Note:** In some cases like | |||
``HTML(string=foo)`` relative URLs will be invalid if ``base_url`` | |||
is not provided. | |||
:param attachments: A list of tuples, where each element describes an | |||
attachment to the document. The tuple contains a URL and a description, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention PDF in this description.
Correct me if I’m wrong, but this sounds like you’re not gonna use this bit of feature. I’m still uncomfortable with WeasyPrint support non-standard HTML that not only is not in the HTML spec, but is not described in any spec anywhere. So unless you actually want to use this feature, "it feels just right" is not good enough. Please remove the |
… of the `Document` class
This patch honors the filename key of a fetched resource, which can be set by the `Content-Disposition` or `Content-Type` headers and uses `mimetypes.guess_extension` for resources that lack any indication of a filename.
Alright, let’s do this. Starting a review of a9fd32c and earlier commits: In the Python API ( This also does the right thing for command-line arguments: a string is interpreted as an URL if it looks like an absolute URL, a filename otherwise. With that, you can remove the URL manipulation code in I have a slight preference for I’m still not convinced by Rather than having a In Regarding the issue of multiple links with the same URL but different descriptions: I think that’s OK. Regarding the rectangles for links and CSS transforms, please open a separate issue. I don’t know what In tests, why |
…nt for `write_pdf`
… type with no `get_filename` method
…'s actually necessary to special case the unquoted result
I'll have to take some time to understand the implications of converting the tuples to a guessed source.
Sounds good. I removed the attachments attribute and added a argument to
I moved the special handling to
My git-fu is not good enough to see this through. Is it OK to cherry-pick that commit into this branch and remerging it with the rest of this patch?
|
I had to fix the filename logic for Python 2 in a8a951b. Should I move that logic into |
That’s ok. As said earlier, encoding of URLs and filenames in WeasyPrint overall is busted and need to be rethought. This works for this PR.
Yeah, that works. We’ll end up with a duplicate commit in the history, which is not ideal but meh.
Yeah, actually regardless of handles vs. names, you should use
Yeah, the idea of
|
…b_get_charset` and `urllib_get_filename`.
…instead of the URL/description tuples
I hope the changes to support the Unfortuneatly I had to sacrifice the filename detection because |
Good job! I pushed the merged commit now that I had it after resolving conflicts, but one remaining issue is that I also fixed some minor stylistic issues:
Flake 8 detects most of this automatically. |
Fixed in 9b0488c. |
Sorry, I totally missed that during the refactoring. Thanks for merging this feature! |
Let me know if you want to have this in a PyPI release. |
That's a kind offer, but until a fix for #132 has been merged I still have to use a patched version of WeasyPrint anyway so no reason to hurry 😄 . |
This commit adds support for PDF (global) attachments, which can be used from the command line tool with the
-a
option, theHTML
constructor and<link rel=attachment>
elements.The attachment's data is compressed and a MD5 checksum included in the object stream. The implementation avoids seeking in the PDF stream and copies the data directly without reading the whole resource into memory.
I have tested the feature to work with evince and Adobe Reader 9 on Linux.
Things that need testing include
TODOs:
data:
URLs. Unfortunatelydata-
attributes are reserved for document authors and<link>
has no other suitable attribute. This is of course an obscure use case.http:
andfile:
. If the filename can be deduced from the URL, the attachment tuples could be simplified to(url, description)
. This should probably be changed.url_fetcher
to the PDF writer.<a rel=attachment>
and<area rel=attachment>
elements. This would require some bookkeeping and special handling inside the post fixup of links, but shouldn't be too hard to do. I'll probably take a shot at it if this feature is accepted.