invalid output PDF for input with incomplete CIDsets embedded in PDXObjects #659

mr-mister123 · 2025-01-09T16:23:31Z

Hi,

we create PDFs in the following way: We create Invoice-PDFs (PDF/A-1b) using jasperreports. After that we combine them with an background-layer-PDF that contains the company-logo and other informations like an imprint. This 'background-PDF' is also a valid PDF/A-1b file.
Both files can be verified successfully using verapdf. We combine them using the org.apache.pdfbox.multipdf.Overlay-Class of PDFBox. The result can also be verified successfully.

When adding the XML to build a valid Zugferd-PDF using Mustang the output is not valid anymore. It has to do with the CIDSet as mentioned in issue #249 .
The problem is, that combining both PDFs using the Overlay-Class is done by wrapping the layer in an PDXObject. The patch supplied for #249 just scans for fonts in the root-dictionary.

Fonts embedded in PDXObjects are not scanned.
Replacing ZUGFeRDExporterFromA3.removeCidSet(PDDocument) with these two functions will fix the problem:

	private void removeCidSet(PDDocument doc) throws IOException {
		// https://github.com/ZUGFeRD/mustangproject/issues/249

		COSName cidSet = COSName.getPDFName("CIDSet");
		COSName resources = COSName.getPDFName("Resources");

		// iterate over all pdf pages
		for (Object object : doc.getPages()) {
			if (object instanceof PDPage) {

				PDPage page = (PDPage) object;
				PDResources res = page.getResources();

				// Check for fonts in PDXObjects:
				for (COSName xObjectName : res.getXObjectNames()) {
					PDXObject xObject = res.getXObject(xObjectName);
					COSDictionary d = xObject.getCOSObject().getCOSDictionary(resources);
					if (d != null) {
						PDResources xr = new PDResources(d);
						removeCIDSetFromPDResources(cidSet, xr);
					}
				}
				
				// Check for fonts in document-resources:
				removeCIDSetFromPDResources(cidSet, res);
			}
		}
	}

	private void removeCIDSetFromPDResources(COSName cidSet, PDResources res) throws IOException {
		for (COSName fontName : res.getFontNames()) {
			try {
				PDFont pdFont = res.getFont(fontName);
				if (pdFont instanceof PDType0Font) {
					PDType0Font typedFont = (PDType0Font) pdFont;

					if (typedFont.getDescendantFont() instanceof PDCIDFontType2) {
						@SuppressWarnings("unused")
						PDCIDFontType2 f = (PDCIDFontType2) typedFont.getDescendantFont();
						PDFontDescriptor fontDescriptor = pdFont.getFontDescriptor();

						fontDescriptor.getCOSObject().removeItem(cidSet);
					}
				}
			} catch (IOException e) {
				throw e;
			}
			// do stuff with the font
		}
	}

It would be nice, if you could integrate this patch into mustang-project.

Thanks and greetz,
Karsten

The text was updated successfully, but these errors were encountered:

CommanderRaiker · 2025-01-16T16:23:51Z

Hello,

I have unfortunately also encountered validation errors in the PDF part with embedded CID fonts. These errors occur when the PDF document contains certain special characters. For example, these characters could include: ÓŁĄŚ–.

Here is my process: I convert the PDF to PDF/A using Ghostscript. According to the Vera Online tool, the resulting file is still valid at this time. However, after attaching the XML with Mustang, the file is no longer valid. Mustang responds with multiple errors for the PDF part, such as:

TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012, clause=6.2.11.7.2, testNumber=1], status=failed, message=The Font dictionary of all fonts shall define the map of all used character codes to Unicode values, either via a ToUnicode entry, or other mechanisms as defined in ISO 19005-3, 6.2.11.7.2, location=Location [level=CosDocument, context=root/document[0]/pages[0](41 0 obj PDPage)/contentStream[0](49 0 obj PDContentStream)/operators[328]/usedGlyphs[1](RKFOLQ+Arial RKFOLQ+Arial 21 0 0 false)], locationContext=null, errorMessage=The glyph can not be mapped to Unicode]

If the code adjustment proposed by mr-mister123 works, I would be very happy to see it included in a new version.

thanks and greetings
Raik

mr-mister123 · 2025-01-31T10:54:40Z

@CommanderRaiker

I don't think that my patch will fix your issue, since with my patch mustang will remove incomplete CIDsets not only from the root document, but also from embedded XObjects. But your problem doesn't seem to be related to incomplete CIDsets, but to missing records in the unicode-mapping of embedded fonts.

Since you are using ghostscript i assume, that your source-pdf-file that is the input to the mustang-lib will be of type PDF/A-1b. What Mustang does is, to change the filetype to PDF/A-3b as this is required when embedding files into pdf. this is be done by basically saying just changing the header of the pdf-file.
But PDF/A-3b is stricter with embedded fonts than PDF/A-1b. That's why mustang removes incomplete CIDsets while converting...
If the input-pdf is not conform to PDF/A-3b, it will never give you a valid output. It just seems to be valid when validating the input using vera-pdf since the file ghostscript created just states to be of type PDF/A-1b an vera-pdf just checks if it conforms PDF/A-1b - wich it does...

Can you provide sample-files?

mr-mister123 linked a pull request Jan 31, 2025 that will close this issue

remove CIDSets from XObjects #721

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invalid output PDF for input with incomplete CIDsets embedded in PDXObjects #659

invalid output PDF for input with incomplete CIDsets embedded in PDXObjects #659

mr-mister123 commented Jan 9, 2025

CommanderRaiker commented Jan 16, 2025

mr-mister123 commented Jan 31, 2025

invalid output PDF for input with incomplete CIDsets embedded in PDXObjects #659

invalid output PDF for input with incomplete CIDsets embedded in PDXObjects #659

Comments

mr-mister123 commented Jan 9, 2025

CommanderRaiker commented Jan 16, 2025

mr-mister123 commented Jan 31, 2025