-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Risks if generating PDF from user supplied HTML? #50
Comments
Hi @MartyMcMartface
Personally, if you have to use user supplied HTML, I would use something like a locked down docker container per instance of use. Given that arbitrary code execution is unlikely, it should be extremely difficult to escape such a container. |
Thanks Daniel. The fact that there is some sort of memory leak bug in either PDFBox or ImageIO is evidence enough to be wary of it. If it only happens with embedded URI encoded images I can test the input for that, but if not it could cause problems even without malicious user intent. It also confirms that ImageIO uses native code to read a simple JPEG file which is a bit surprising. Could you please elaborate on what you mean by "outgoing links"? Are you referring to images in the supplied HTML that point to third party servers? If so I was planning to prevent that issue by getting the user's browser to fetch all the required images and post them to the server. If that's not what you meant I'd be interested to know what other risk you are referring to. Also how could the user cause an endless loop? If I decide not to include the PDF generation functionality in the main application, do you know of any provider of an HTML to PDF web service that uses openhtmltopdf as a back end? That might be more cost effective than creating one. I'm aware of docraptor.com (using a different back end) but it's a little pricey for my project. Thanks again for taking the time to respond. |
Yes, outgoing links could be fetched either from your server with the I don't know of anyone providing open-source PDF renderers as a service (they mostly use Prince I believe), however, another option is to use server-less architecture. I think AWS lambda allows you to set a memory limit and time limit per run. The Lambda environment may be reused between runs so it is not a protection against any arbitrary code execution bugs but it should prevent the other problems. Even better, it is free for small to medium work loads. The only issue may be returning the PDF. Lambda doesn't allow the returning of binary data so you would have to either base64 encode the result or upload it to S3 and get it from there. Best of luck with your project, |
Hi Daniel,
I want to use openhtmltopdf to generate PDFs that are partially based on user supplied HTML on a corporate website. I'm wondering if you could comment about any security issues that might raise, as it has the security team raising their eyebrows.
For example it might be possible there is a bug in the image rendering code that allows code to be executed on the server if some special image is supplied to it. Personally I think the scenario is very unlikely but I'd like to hear your take on it. The image rendering code is part of the PDFBox library isn't it? If that is pure java then it's not really vulnerable to stack buffer overflow attack is it? To me the probability of such an attack is exceedingly small but I have to convince the security team and I don't have any real knowledge of how that part of the library is implemented.
Are you aware of any other risks and how they can be mitigated when rendering user supplied HTML?
Thanks in advance
Martin
The text was updated successfully, but these errors were encountered: