-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large HTML File conversion to PDF hangs. #180
Comments
Sounds silly to ask, but how much memory are allocating to your JVM? Try setting a higher limit with -xmx When there is not enough RAM, the generator will hang while eating all of your CPU time. |
How long does it hang for? Could it be that it is hitting disk to use the swap space? As @dilworks asks, how much memory are you allocating to Java and how much physical memory is available to the machine? Hanging is obviously unacceptable, so I'm keen to get to the bottom of this one. I'll also investigate the memory/disk options of PDF-BOX (currently it is constructed completely in memory) and reply here. |
Hi Thanks for reply @dilworks and @danfickle We have hosted it on AWS t2.micro instance where it never resolves (hangs indefinitely) we have provided following options: Initial JVM heap size: 256m On My Local machine It hangs for more than 20 minutes and eats all the CPU. Physical memory on local is around 4GB free and heap size is 256m. I will try increasing the heap as @dilworks suggested. But I feel it will be better to directly construct it on the disk instead of memory which will give better performance. @danfickle Please investigate and implement the solution. meanwhile I am also investigating PDF-BOX options to construct it on disk will post if found something useful. |
Hi @dilworks I have tried assigning -xmx 2048 and it did not resolve the problem it still hangs. @danfickle it is hitting the disk for swap space. please check below. |
+ Allow user to create their own PDDocument with memory settings of their choice. + Fix silly bug in bidi splitter that was taking more than half the time in my sample document (according to VisualVM).
Thanks @rajaningle I added a builder method to pass in your own PDDocument which can be configured in the constructor with a MemoryUsageSetting to control how much memory/disk is used by PDFBOX. However, with my simple testing of a large document, this didn't fix the problem so I am now profiling with VisualVM to find CPU/Memory hogs. I've already found a major CPU hog as discussed in #170 Thanks for your patience and hopefully we can get this fixed. |
Thanks @danfickle I was checking PDFBox options and came across doc.saveIcremental(outStream) Method. Please check if we can use it and whether this method resolves our problem. Thanks. |
Hi, Today we got the same issue as @rajaningle trying to convert an HTML about 400 pages with 0.0.1-RC12 version. After read this issue, we have tried the SNAPSHOT version using Are you planing to do a new release? Thanks! |
Hi @danfickle I tried with MemoryUsageSetting.setupTempFileOnly() and it did not solve the problem it is still hogging the CPU/Memory. |
OK, I generate a large (inline only) document with this code: private static void createLargeInlineDoc() throws IOException {
OutputStream os2 = new FileOutputStream("/Users/me/Documents/pdf-issues/issue-180.htm");
PrintWriter pw = new PrintWriter(os2);
pw.println("<html>");
pw.println("<head>");
pw.println("</head>");
pw.println("<body>");
for (int i = 0; i < 100000; i++) {
pw.println("Normal <strong>Bold</strong> <i>Italic</i>");
}
pw.println("</body>");
pw.println("</html>");
pw.close();
os2.close();
} After fixing the two BIDI performance bugs it is down to 11 seconds on my machine, from a staggering 400 seconds before! Next up, in improving performance according to the profiler, is this monstrosity (finally one that's not mine), from private static String collapseWhitespace(InlineBox iB, IdentValue whitespace, String text, boolean collapseLeading) {
if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
text = linefeed_space_collapse.matcher(text).replaceAll(EOL);
} else if (whitespace == IdentValue.PRE) {
text = space_before_linefeed_collapse.matcher(text).replaceAll(EOL);
}
if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
text = linefeed_to_space.matcher(text).replaceAll(SPACE);
text = tab_to_space.matcher(text).replaceAll(SPACE);
text = space_collapse.matcher(text).replaceAll(SPACE);
} else if (whitespace == IdentValue.PRE || whitespace == IdentValue.PRE_WRAP) {
int tabSize = (int) iB.getStyle().asFloat(CSSName.TAB_SIZE);
char[] tabs = new char[tabSize];
Arrays.fill(tabs, ' ');
text = tab_to_space.matcher(text).replaceAll(new String(tabs));
} else if (whitespace == IdentValue.PRE_LINE) {
text = tab_to_space.matcher(text).replaceAll(SPACE);
text = space_collapse.matcher(text).replaceAll(SPACE);
}
if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
// collapse first space against prev inline
if (text.startsWith(SPACE) &&
collapseLeading) {
text = text.substring(1, text.length());
}
}
return text;
} Note that text in normal mode goes through four regular expression replaces and a substring. Unless someone else provides a replacement without regular expressions, I'll work on it tomorrow, and then do the release. |
I've decided to follow your steps and profile everything on my setup... using one of my RAM-eating please-have-mercy testcases: a rather simple table-based report (complete with headers and footers) that easily gets into the thousands of pages (it's a transaction log report for a entire year, and for a mid-sized customer it goes over 5000 pages) - this was the reason of why I was forced to fiddle with -xmx (apparently this flaw was inherited from FS). This report in particular is rather CPU-bound... until it's time to generate the PDF, when my JSF-generated XHTML brings FS/OH down to its knees, now massively eating RAM this time. A logging statement on this: openhtmltopdf/openhtmltopdf-core/src/main/java/com/openhtmltopdf/css/style/derived/LengthValue.java Line 179 in 65b7faa
...is causing JBoss/WildFly logging subsystem to go insane and drain a non-insignificant slice of CPU time! Leaving my code outside, this single logging call ends eating almost half of the CPU time. (And if you were wondering: no, I never got my 5000+ page PDF - profiling makes everything go much slower, plus I was testing with some real data that easily ate the 3GB limit I had set) |
…ons with more performant loop for normal and no-wrap mode white-space settings.
…ERE. SEVERE is too severe for a common warning.
Thanks @dilworks The only thing I could think of causing a slow down is the fact that it was logging as SEVERE. Could Wildfly be set up to do something special with SEVERE log messages? Anyway, I have downgraded it to WARNING to be consistent with other CSS warnings. I also released RC-13, so we'll make the next release focused on performance and memory. Much work is needed to get 5000+ page documents running smoothly! |
Loving that couple of fixes - after some quick tests now performance is on par with FS, and even beats it in a few times with the same 18-page test doc I had attached. But then, that's just the beginning Thank you very much for the improvements @danfickle ! |
Thanks @danfickle there is some performance improvement with the current fixes but it still hangs while generating huge documents 5000+ pages or more. hoping for the performance improvement there I have this open defect and need to resolve it ASAP because we have all huge documents to be exported and functionality breaks while generating huge PDF. Please see if you can find solution to resolve this hang issue. |
@rajaningle This may not solve your problem, but for those 5000+ pages you have many DOM nodes in memory, and therefore need tons of memory alone for your DOM nodes. => You could try exists-db to solve this memory problem. Exists-DB allows you to store big amount of XML in a persistent file. It also allows you to query it very fast using XQuery (this is what I had used exists-db for in another project ten years ago...). And all the nodes also implement org.w3c.dom. Something like this could work:
If I understand it correctly from the documentation XMLResouce.getContentAsDOM() gets you the content as org.w3c.dom which is then lazy loaded from the database. So that you only have those nodes in memory which are needed at a time. You could then feed the DOM into the PdfRendererBuilder using withW3cDocument(). I can not guarantee that this will work correctly and really reduces the memory pressure, but it is at least something you could try. |
I've just created a testcase for this problem, see #194. This takes 5m 49s on my MacBook Pro 2014 (16 GB RAM) to create a HTML file with 18.5 MB and a result PDF with 232 MB and 12694 pages using JDK 1.7.0_52. I.e. i can not reproduce the problem, it works for me. @rajaningle what JDK are you using on what OS? Please look at the testcase, it only contains text and tables. What other stuff are you using in your report? |
Managed to find some time to test one of my huge files - I'm attaching a sample (a couple pages with test data - the real production report only has longer strings and bigger numbers but nothing else) so you guys can check the layout - it's rather simple as I've already said yet it can grow up in size easily since the report is a transaction log - with one of my datasets it generates a ~4300-page PDF. I'm testing with -xmx3g (my laptop only has 6GB RAM, thankfully our production setups either never have enough data to push things to the limits, or have at least 8GB dedicated for our app).
Test case here: ...thankfully that's the only heavyweight report on my app (and the least used one, yet... due to $REGULATIONS our customers have to generate it at least a couple times per year: the first run with an empty transaction log easily goes over 100 pages (one per department). |
I've integrated parts of your sample in #194 and just did a big freemarker loop around it. Something strange is going on here in the Bidi-splitter stuff. There is one ParagraphSplitter$Paragraph object with 2.2 million entries in the textRuns hash map ... this seems to be the root object? At least it's seems strange to me that one Paragraph object can have so many entries ... @danfickle you should be able to investigate that in my #194 pull request. Did you disable the logging? XRLog.setLoggingEnabled(false)? Because the logging causes some overhead, even if the logger does not write the log infos somewhere, because the log infos are generated anyway. |
In regard of the bidi splitter, it defines a paragraph as a block element. It should define a paragraph as anything block-like, for example a table cell. I meant to make this trivial fix in RC-13 but somehow forgot. @dilworks |
Testcase big document #180 with perf improvements.
…splitter. Also define a paragraph as anything block-like or out-of-flow.
I’ve been thinking about the painting side. The core algorithm is:
This leads to a method call count of |
Also beginning of style caching with which conditions will disable it.
…w page. Tests that rotated text on overflow page entirely clipped out by the page margin should not generate an overflow page as such page will be visually empty.
…ng a larger replaced text. On two vertical pages and one overflow page.
… does not generate a horizontal overflow page.
…oes not generate a horizontal overflow page.
…T output table header, footer or caption on every page.
…le header and footer on every page (but caption only on first page).
… them) despite being in overflow hidden containers. Also enable test demonstrating this scenario.
…overflow pages. Plus enable test from previous commit with this scenario.
… [ci skip] Tests that a nested float in a fixed element renders correctly. Appears to be an ordering issue of the layers as if header and footer swap element order everything works.
Some boxes can be layed out many times (to satisfy page constraints for example). If this happens we just mark our old layer for deletion and create a new layer. Not sure this is right, but doesn't break any correct tests. Yes, this is a particularly hackish solution. This fix also brought up the correct response for positioning-absolute test, so I altered the html to match the expected output. (I hadn't noticed the missing box when I committed the expected test result previously).
…o the sum of its child boxes using border-box sizing.
Also cleaned up ContentFunctionFactory class.
With leader function, attr function, target-counter function and overflow page in the middle.
Yeah, RC18 is finally released with a usable fast renderer. |
I am now using version 1.0.2, but the pdf build is still hang. Attafched with the test html My code for pdf generation is as follow:
|
Hi,
I am trying to convert large HTML File approximately 600 pages which is not passing the conversion and hangs.
Following is my observation after debugging the core.
PdfRendererBuilder.class file has following method call.
2. renderer.createPDF(); // This action is not completing its execution and hangs the process.
when I looked into it renderer.createPDF() is trying to create entire PDF in memory (document) and after completion it starts writing to OutputStream.
Can we write it directly to OutputStream page by page? I think this might solve the problem.
Following is my code snippet please check the same if I am doing anything wrong here.
In above code snippet it is not completing builder.run(); process and hangs.
Please help me with the solution.
Thanks in advance.
The text was updated successfully, but these errors were encountered: