Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation of pdf is too slow for large html #506

Open
Infinity821 opened this issue Jun 22, 2020 · 5 comments
Open

Generation of pdf is too slow for large html #506

Infinity821 opened this issue Jun 22, 2020 · 5 comments

Comments

@Infinity821
Copy link

I am now using version 1.0.2, but the pdf build is still hang.
The size of html is 13241929
I have tried many times and increased the heap size to 4G.
My running machine is i5 4460, 16G RAM.

Attafched with the test html
test.txt

My code for pdf generation is as follow:

    public byte[] generateFromHtml(String html) throws Exception {
        try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFont(getFont(PMingLiU), "PMingLiU");
            builder.useFont(getFont(PMingLiUExtB), "PMingLiU-ExtB");
            builder.useFont(getFont(seguiemj), "Segoe UI Emoji");
            builder.withHtmlContent(html, null);
            builder.useFastMode();
            builder.toStream(byteArrayOutputStream);
            builder.run();
            return byteArrayOutputStream.toByteArray();
        }
    }

Originally posted by @Infinity821 in #180 (comment)

@syjer
Copy link
Contributor

syjer commented Jun 22, 2020

hi @Infinity821 ,

Using the master branch and version 1.0.3, I've been able to generate the pdf using the attached test html.

Code for pdf generation (note, I was not able to find the correct font for PMingLiU-ExtB, but I don't think it has an effect):

try (OutputStream os = new FileOutputStream("out.pdf")) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFont(new File("PMINGLIU.ttf"), "PMingLiU");
            builder.useFont(new File("PMINGLIU.ttf"), "PMingLiU-ExtB");
            builder.useFont(new File("seguiemj.ttf"), "Segoe UI Emoji");
            builder.useFastMode();
            builder.withFile(new File("test" +
                    ".html"));

            builder.toStream(os);
            builder.run();
        }

resulting pdf:
out.pdf

By the way, have you tried with the version 1.0.3?

(Using a 16gb ram ryzen 1700 pc, java 11, default heap configuration, execution time 5716ms)

@danfickle
Copy link
Owner

I've noticed that with heavy mixed font text, up to 80% of cpu self-time is spent initialising the IllegalArgumentException that pdfbox uses to indicate that the current font does not support passed in characters. Therefore, it may be a large performance gain to change to a canDisplayUpTo method, but it would require work on pdfbox as well as this project.

P.s. According to VisualVM.

@Infinity821 , can you try cpu sampling with visualvm and posting a screenshot of hotspots?

@olivergg
Copy link

olivergg commented Mar 2, 2021

I've got the same issues with some very (very) large HTML files (up to 600 MB). I have several files that ends up in a OOM, so I had to test some smaller files ( ~ 22 MB)

I can confirm that many IllegalArgumentException are raised, as seen in the following screenshot (from a JFR recording):
image

Unfortunately I can't test a larger file due to the memory limitation (-Xmx13g -XX:+UseG1GC).

Here is some other useful metrics :

image

Is there any way to prevent OOM (even if the generation takes longer)

@danfickle I'm willing to provides some HTML sample in PM if you need to

@rudolphi
Copy link

rudolphi commented Dec 6, 2021

The biggest problem seems to be caused by the numerous zerowidthspace characters inserted for whitespace contained within the HTML. It is not available in Helvetica and width should just be zero (name says it). I checked the HTML for any zerowidthspaces that I could remove, but they seem to be inserted internally. 🤷‍♂️

java.lang.IllegalArgumentException: U+200B ('zerowidthspace') is not available in this font Helvetica encoding: WinAnsiEncoding at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:427) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333) at org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:364) at com.openhtmltopdf.pdfboxout.PdfBoxTextRenderer.getWidth(PdfBoxTextRenderer.java:337) at com.openhtmltopdf.layout.Breaker.lambda$doBreakText$1(Breaker.java:526) at com.openhtmltopdf.layout.Breaker.doBreakTextWords(Breaker.java:560) at com.openhtmltopdf.layout.Breaker.doBreakText(Breaker.java:531) at com.openhtmltopdf.layout.Breaker.doBreakText(Breaker.java:317) at com.openhtmltopdf.layout.Breaker.breakText(Breaker.java:188) at com.openhtmltopdf.layout.InlineBoxing.layoutText(InlineBoxing.java:1126) at com.openhtmltopdf.layout.InlineBoxing.startInlineText(InlineBoxing.java:410) at com.openhtmltopdf.layout.InlineBoxing.layoutContent(InlineBoxing.java:192) at com.openhtmltopdf.render.BlockBox.layoutInlineChildren(BlockBox.java:1227) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1208) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layoutCell(TableRowBox.java:452) at com.openhtmltopdf.newtable.TableRowBox.layoutChildren(TableRowBox.java:206) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layout(TableRowBox.java:95) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableSectionBox.layoutChildren(TableSectionBox.java:137) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableSectionBox.layout(TableSectionBox.java:278) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableBox.layoutChildren(TableBox.java:316) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.newtable.TableBox.layoutTable(TableBox.java:281) at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:240) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layoutCell(TableRowBox.java:452) at com.openhtmltopdf.newtable.TableRowBox.layoutChildren(TableRowBox.java:206) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layout(TableRowBox.java:95) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableSectionBox.layoutChildren(TableSectionBox.java:137) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableSectionBox.layout(TableSectionBox.java:278) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableBox.layoutChildren(TableBox.java:316) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.newtable.TableBox.layoutTable(TableBox.java:281) at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:240) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.layout(PdfBoxRenderer.java:346) at com.openhtmltopdf.pdfboxout.PdfRendererBuilder.run(PdfRendererBuilder.java:45)

@mhmmdgamal
Copy link

I'm trying to convert html with 30 MB and it takes around 50 sec anyway to enhance this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants