Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid output PDF for input with incomplete CIDsets embedded in PDXObjects #659

Open
mr-mister123 opened this issue Jan 9, 2025 · 2 comments · May be fixed by #721
Open

invalid output PDF for input with incomplete CIDsets embedded in PDXObjects #659

mr-mister123 opened this issue Jan 9, 2025 · 2 comments · May be fixed by #721

Comments

@mr-mister123
Copy link
Contributor

Hi,

we create PDFs in the following way: We create Invoice-PDFs (PDF/A-1b) using jasperreports. After that we combine them with an background-layer-PDF that contains the company-logo and other informations like an imprint. This 'background-PDF' is also a valid PDF/A-1b file.
Both files can be verified successfully using verapdf. We combine them using the org.apache.pdfbox.multipdf.Overlay-Class of PDFBox. The result can also be verified successfully.

When adding the XML to build a valid Zugferd-PDF using Mustang the output is not valid anymore. It has to do with the CIDSet as mentioned in issue #249 .
The problem is, that combining both PDFs using the Overlay-Class is done by wrapping the layer in an PDXObject. The patch supplied for #249 just scans for fonts in the root-dictionary.

Fonts embedded in PDXObjects are not scanned.
Replacing ZUGFeRDExporterFromA3.removeCidSet(PDDocument) with these two functions will fix the problem:

	private void removeCidSet(PDDocument doc) throws IOException {
		// https://github.com/ZUGFeRD/mustangproject/issues/249

		COSName cidSet = COSName.getPDFName("CIDSet");
		COSName resources = COSName.getPDFName("Resources");

		// iterate over all pdf pages
		for (Object object : doc.getPages()) {
			if (object instanceof PDPage) {

				PDPage page = (PDPage) object;
				PDResources res = page.getResources();

				// Check for fonts in PDXObjects:
				for (COSName xObjectName : res.getXObjectNames()) {
					PDXObject xObject = res.getXObject(xObjectName);
					COSDictionary d = xObject.getCOSObject().getCOSDictionary(resources);
					if (d != null) {
						PDResources xr = new PDResources(d);
						removeCIDSetFromPDResources(cidSet, xr);
					}
				}
				
				// Check for fonts in document-resources:
				removeCIDSetFromPDResources(cidSet, res);
			}
		}
	}

	private void removeCIDSetFromPDResources(COSName cidSet, PDResources res) throws IOException {
		for (COSName fontName : res.getFontNames()) {
			try {
				PDFont pdFont = res.getFont(fontName);
				if (pdFont instanceof PDType0Font) {
					PDType0Font typedFont = (PDType0Font) pdFont;

					if (typedFont.getDescendantFont() instanceof PDCIDFontType2) {
						@SuppressWarnings("unused")
						PDCIDFontType2 f = (PDCIDFontType2) typedFont.getDescendantFont();
						PDFontDescriptor fontDescriptor = pdFont.getFontDescriptor();

						fontDescriptor.getCOSObject().removeItem(cidSet);
					}
				}
			} catch (IOException e) {
				throw e;
			}
			// do stuff with the font
		}
	}

It would be nice, if you could integrate this patch into mustang-project.

Thanks and greetz,
Karsten

@CommanderRaiker
Copy link

Hello,

I have unfortunately also encountered validation errors in the PDF part with embedded CID fonts. These errors occur when the PDF document contains certain special characters. For example, these characters could include: ÓŁĄŚ–.

Here is my process: I convert the PDF to PDF/A using Ghostscript. According to the Vera Online tool, the resulting file is still valid at this time. However, after attaching the XML with Mustang, the file is no longer valid. Mustang responds with multiple errors for the PDF part, such as:

TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012, clause=6.2.11.7.2, testNumber=1], status=failed, message=The Font dictionary of all fonts shall define the map of all used character codes to Unicode values, either via a ToUnicode entry, or other mechanisms as defined in ISO 19005-3, 6.2.11.7.2, location=Location [level=CosDocument, context=root/document[0]/pages[0](41 0 obj PDPage)/contentStream[0](49 0 obj PDContentStream)/operators[328]/usedGlyphs[1](RKFOLQ+Arial RKFOLQ+Arial 21 0 0 false)], locationContext=null, errorMessage=The glyph can not be mapped to Unicode]

If the code adjustment proposed by mr-mister123 works, I would be very happy to see it included in a new version.

thanks and greetings
Raik

@mr-mister123 mr-mister123 linked a pull request Jan 31, 2025 that will close this issue
@mr-mister123
Copy link
Contributor Author

@CommanderRaiker

I don't think that my patch will fix your issue, since with my patch mustang will remove incomplete CIDsets not only from the root document, but also from embedded XObjects. But your problem doesn't seem to be related to incomplete CIDsets, but to missing records in the unicode-mapping of embedded fonts.

Since you are using ghostscript i assume, that your source-pdf-file that is the input to the mustang-lib will be of type PDF/A-1b. What Mustang does is, to change the filetype to PDF/A-3b as this is required when embedding files into pdf. this is be done by basically saying just changing the header of the pdf-file.
But PDF/A-3b is stricter with embedded fonts than PDF/A-1b. That's why mustang removes incomplete CIDsets while converting...
If the input-pdf is not conform to PDF/A-3b, it will never give you a valid output. It just seems to be valid when validating the input using vera-pdf since the file ghostscript created just states to be of type PDF/A-1b an vera-pdf just checks if it conforms PDF/A-1b - wich it does...

Can you provide sample-files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants