Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating PDF/A-3 document raises NPE #401

Open
ivanbogicevickg opened this issue Oct 31, 2019 · 10 comments
Open

Creating PDF/A-3 document raises NPE #401

ivanbogicevickg opened this issue Oct 31, 2019 · 10 comments

Comments

@ivanbogicevickg
Copy link

If I try to generate PDF/A-3 document from html I get the following exception:

Exception in thread "main" java.lang.NullPointerException
	at org.apache.pdfbox.cos.COSArray.add(COSArray.java:62)
	at com.openhtmltopdf.pdfboxout.PdfBoxAccessibilityHelper.finishNumberTree(PdfBoxAccessibilityHelper.java:744)
	at com.openhtmltopdf.pdfboxout.PdfBoxFastOutputDevice.finish(PdfBoxFastOutputDevice.java:875)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.writePDFFast(PdfBoxRenderer.java:661)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPdfFast(PdfBoxRenderer.java:550)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDF(PdfBoxRenderer.java:468)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDFWithoutClosing(PdfBoxRenderer.java:395)
	at com.dm.reviscan.emails.EmailToPDF.main(EmailToPDF.java:90)

This is a code snipped I'm using to generate PDF:

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.usePDDocument(pdfDoc);
builder.withW3cDocument(new W3CDom().fromJsoup(htmlDoc), outFile.toURI().toURL().toString());
builder.useFastMode();
builder.useDefaultPageSize(210, 297, BaseRendererBuilder.PageSizeUnits.MM);
builder.useHttpStreamImplementation(new OkHttpStreamFactory());
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
builder.usePdfVersion(1.5f);
builder.usePdfUaAccessbility(false);
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "ArialMT");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Arial-BoldMT");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Times-Roman");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Times-Bold");
try (InputStream colorProfile = EmailToPDF.class.getResourceAsStream("/sRGB.icc")) {
  byte[] colorProfileBytes = IOUtils.toByteArray(colorProfile);
  builder.useColorProfile(colorProfileBytes);
}

If I comment out builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A); document is generated, but this is not what I want.

@ivanbogicevickg ivanbogicevickg changed the title Create PDF/A-3 document raises NPE Creating PDF/A-3 document raises NPE Oct 31, 2019
@danfickle
Copy link
Owner

Hi @ivanbogicevickg,

It seems that it is trying to create a structure element without a parent. This is a bug, but is very hard to track down without sample HTML (hopefully minimal).

It should be something not used in the sample document. I hope that helps to narrow down the offending tag.

@ivanbogicevickg
Copy link
Author

Hi @danfickle,

Here is a smallest html sample.
email-attach-1.txt
This only happens when I turn on PDFA conformance, otherwise PDF is generated without a problem.

@danfickle
Copy link
Owner

Hi @ivanbogicevickg,

I couldn't replicate this, even going back to the version 1 release. Here is the code I used:

    public static void main(String... args) throws Exception {
        PdfRendererBuilder builder = new PdfRendererBuilder();
        File inFile = new File("/Users/me/Documents/pdf-issues/issue-401.htm");
        org.jsoup.nodes.Document doc = Jsoup.parse(inFile, "UTF-8");
        builder.withW3cDocument(new W3CDom().fromJsoup(doc), inFile.toURI().toURL().toString());
        // DON'T DO THIS (not closing stream): Throw-away code ahead!
        builder.toStream(new FileOutputStream("/Users/me/Documents/pdf-issues/output/issue-401.pdf"));
        builder.useFastMode();
        builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
        builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
        builder.usePdfVersion(1.5f);
        builder.useFont(new File("/Users/me/Documents/pdf-issues/fonts/JustAnotherHand.ttf"), "default");
        builder.run();
    }

and the only change I made to the HTML was to add: style="font-family: 'default';" to the body element. Are you perhaps using a stylesheet that is changing things?

@swarl
Copy link

swarl commented Apr 6, 2020

Hi @danfickle

I can reproduce the error with this piece of html which is derived from our production server and generated by domino and spiced up on our side with some styles:

<html>
  <head>
    <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type">
    <style>
      * {
        font-family: 'Liberation Sans';
      }
    </style>
  </head>
  <body>
    <ul><b>To:</b></ul>
  </body>
</html>

Noto Fonts: https://www.google.com/get/noto/

builder.withW3cDocument(new W3CDom().fromJsoup(Jsoup.parse(htmlContent)), "");
[...]
builder.useFont(() -> PdfRenderer.class.getClassLoader().getResourceAsStream("org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"),
            "Liberation Sans");

Problem: the ul-tag comes without embedded li-tags. With the combination of any font and the imho not valid HTML the NPE occurs. I have no clue what's wrong with this combination.

Is this fixable?

Thanks and Greetings

@swarl
Copy link

swarl commented Apr 7, 2020

Workaround: turn off fast mode

//builder.useFastMode();

Like this PDF is generated correctly

@danfickle
Copy link
Owner

Hi @swarl ,

Thanks for the reproducible example, however, I'm not sure what to do with this one. As you suggest, the HTML is incorrect and a pretty explicit log message is provided:

com.openhtmltopdf.general WARNING:: Trying to add incompatible child to parent item: child type=GenericStructualElement, parent type=ListStructualElement, expected child type=ListItemStructualElement. Document will not be PDF/UA compliant.

The question in my mind is given the resources working on this project (ie. not much) is it reasonable to try to produce a document given any number of combinations of invalid input?

In this case, we could produce a visually valid PDF, but it would not be PDF/A 3a valid as it would not have the structure tree required by screen readers and the standard.

P.S-1 Adding the font causes the NPE to trigger as without the font, no text can be output. Remember, the PDF/A standards disable the built-in fonts.

P.S-2 Turning off fast mode means the document will not be PDF/A compliant as it is only implemented in the newer fast renderer.

@swarl
Copy link

swarl commented Apr 7, 2020

Hi @danfickle

One option would be to remove the faulty tag, but not its content so that I would end up with

<body>
    <b>To:</b>
</body>

Would be in my eyes a better solution then not to render anything.

One question about turning of fastMode: Why Acrobat Reader still tells me that it's a PDF/A when it is not?
image

Thanks for your work. Appreciating it very much
Joe

@swarl
Copy link

swarl commented Apr 7, 2020

And an other "solution" could be to change the parent tag into <div...> which can take whatever content...

@swarl
Copy link

swarl commented Apr 10, 2020

Hi @danfickle
For me logging a message when the logic will later throw a NPE is not really transparent behavior.

What about a straight forward solution:

        @Override
        void addChild(AbstractTreeItem child) {
            if (child instanceof ListItemStructualElement) {
                listItems.add((ListItemStructualElement) child);
            } else {
                ListItemStructualElement listItemStructualElement = new ListItemStructualElement();
                listItemStructualElement.addChild(child);
                listItems.add(listItemStructualElement);
                logIncompatibleChild(this, child, ListItemStructualElement.class);
            }
        }

Greetings
Joe

@swarl
Copy link

swarl commented Apr 11, 2020

OR: you don't care about broken HTML and just throw a meaningful exception.
You could of course try to add some advice. I just tried some options to repair the broken html. Found a (barly) unmaintained project on github: https://github.com/jtidy/jtidy

        <dependency>
            <groupId>com.github.jtidy</groupId>
            <artifactId>jtidy</artifactId>
            <version>1.0.2</version>
        </dependency>
        try (InputStream targetStream = new ByteArrayInputStream(htmlContent.getBytes(StandardCharsets.UTF_8));
             ByteArrayOutputStream destinationStream = new ByteArrayOutputStream()) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_2_A);
            builder.useFont(() -> AppTest.class.getClassLoader().getResourceAsStream("org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"),
                    "Liberation Sans");

            Tidy tidy = new Tidy();
            tidy.setDropProprietaryTags(false);
            tidy.setInputEncoding(StandardCharsets.UTF_8.name());

            tidy.parse(targetStream, destinationStream);
            String cleanedHtml = destinationStream.toString(StandardCharsets.UTF_8.name());

            builder.withW3cDocument(new W3CDom().fromJsoup(Jsoup.parse(cleanedHtml)), "");

Added to the documentation and referenced in the exceptions message would be a valid solution to me.

So, finally, I would have four options:

  1. Throw a meaningful exception, that HTML is broken
  2. Option 1 with some advice how to fix in documentation
  3. Fix the broken HTML by trying to add the missing parent tag manually (Creating PDF/A-3 document raises NPE #401 (comment))
  4. Option 3 with a switch to enable this behavior if wished

Happy easter :-)
Joe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants