Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF/A-4 (ISO 19005-4): handling of embedded, associated files which are not PDF themselves #385

Open
u-fischer opened this issue Mar 27, 2024 · 3 comments
Labels
documentation Improvements or additions to documentation Parked Parked (eg. passed to another TWG, next ISO spec) PDF/A-3 ISO 19005-3:2012 PDF/A-4 PDF/A-4 (ISO 19005-4:202x)

Comments

@u-fischer
Copy link

We are producing tagged 2.0-PDFs which attach mathml and tex files as associated files (AF) to Formula structure elements. Trying to validate these files also against PDF/A-4 we got failures where we are unsure about the right handling according the spec.

In our files we have AF with the registered media type application/mathml+xml and the unregistered (but wildly used see e.g. wikipedia) media type application/x-tex. Both types are plain text files.

A part of the AF are currently listed in the EmbeddedFiles name tree but we can (and also want) produce files where none of the AF are listed.

An example document is mathml-AF-ex1

Remark: the following quotes from ISO 19005-4 are from a draft and should be verified against the official version.

PDF/A-4 requirements

Question 1

6.9 Embedded files writes

All embedded files, as part of a file specification dictionary, shall conform with ISO 19005-1, ISO 19005-2
or this international standard.

  • What does as part of a file specification dictionary mean? All files whose stream is referenced from a /Filespec dictionary? Or only files listed in the EmbeddedFiles name tree?

  • What does shall conform with mean for plain text files like our mathml and tex files? Can they conform to these standards? And if yes how can one tell a validator that they do? Currently when validating against A-4 verapdf complains that none of our AF conforms to one of these standards, regardless if they are in the EmbeddedFiles name tree or not, and also regardless if they use a registered media type or not.

Question 2

6.9 Embedded files continues with

Each embedded file’s file specification dictionary shall contain [...] A Subtype key whose value is a valid IANA Media Type.

Table 43 — Entries in a file specification dictionary in ISO 32000-2:2020 does not list a Subtype in the file specification dictionary. The Subtype key is instead listed in Table 44 — Additional entries in an embedded file stream dictionary. This looks like an error in the spec.

Question 3

Each embedded file’s file specification dictionary should contain the Desc key.

This relates to question 1: Does this apply to every embedded file, even to the ones not listed in the EmbeddedFiles name tree?

PDF/A-4f

Due to the failure we tried to validate against A-4f and the document passed. But it is not clear if this actually the correct way to handle them. The spec says here

A PDF/A-4f conforming file shall contain an EmbeddedFiles key in the name dictionary of the document
catalog dictionary.

  • How can we then produce a conforming PDF with AF without listing them in the EmbeddedFiles entry?

All file specification dictionaries present in the value of the EmbeddedFiles key shall
conform with the requirements of 6.9, except that the embedded files may be of any type.

The exception of any type is rather vage. Does that refers only to the requirement regarding a registered media type mentioned in question 2 above or does that also lift the requirement that the files shall conform with ISO 19005-1, ISO 19005-2 or this international standard?

Although embedded files that do not comply with any part of this document should not be rendered
by a conforming PDF/A-4f processor, a conforming interactive PDF/A-4f processor should enable the
extraction of any embedded file. The conforming interactive PDF/A-4f processor should also require an
explicit user action to initiate the process.

What does that means for AF files meant for accessibility support like our mathml files? Would a reader have to ask user before passing such a mathml to AT software?

@u-fischer u-fischer added the bug Something isn't correct label Mar 27, 2024
@petervwyatt
Copy link
Member

petervwyatt commented Mar 27, 2024

See also PDF/A TWG Issue #40 - only visible to PDF Association Members who are members of the PDF/A TWG.

@u-fischer - I suggest you join the PDF/A TWG for this discussion...

@petervwyatt petervwyatt added documentation Improvements or additions to documentation PDF/A-4 PDF/A-4 (ISO 19005-4:202x) Parked Parked (eg. passed to another TWG, next ISO spec) PDF/A-3 ISO 19005-3:2012 and removed bug Something isn't correct labels Mar 27, 2024
@petervwyatt
Copy link
Member

Note also that although this issue only mentions PDF/A-4, the same feature is PDF/A-3 so some consistency would be expected between PDF 1.7 and PDF 2.0 PDF/A files.
Parking this issue so it can be handled by the PDF/A TWG.

@petervwyatt
Copy link
Member

PDF/A-4 dated revision DIS draft addresses all these concerns. This is currently within ISO and should appear publicly very soon...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Parked Parked (eg. passed to another TWG, next ISO spec) PDF/A-3 ISO 19005-3:2012 PDF/A-4 PDF/A-4 (ISO 19005-4:202x)
Projects
None yet
Development

No branches or pull requests

2 participants