Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why name trees for EmbeddedFiles? #502

Open
JanSlabon opened this issue Dec 9, 2024 · 3 comments
Open

Why name trees for EmbeddedFiles? #502

JanSlabon opened this issue Dec 9, 2024 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation Parked Parked (eg. passed to another TWG, next ISO spec)

Comments

@JanSlabon
Copy link

While checking interoperability for ZUGFeRD implementations I stumbled over an implementation where the embedded file specification was registered with a randomized/unspecified name in the name tree but the file specification had the correct file name in its file specification property (F):

14 0 obj
<<
/Names [(000) 12 0 R]
>> 

Object 12 holds the correct file specification (/F (factur-x.xml)).

While most implementations use the filename as the key in the name tree, too:

14 0 obj
<<
/Names [(factur-x.xml) 12 0 R]
>>

In the ZUGFeRD specification I also cannot find any requirement, that the file has to be registered with the name (e.g. here) "factur-x.xml" in the name tree but only that the file name of the file specification has to be that name.

Until now I thought, we can use the EmbeddedFiles name tree to find a file by its name (at the end a name tree is made for searching in it) - but it seems to be optional to keep the names in sync with the file specifications and I am wondering why there is a name tree then at all? For what I would search in this tree?

I also cannot find anything about the naming of the keys for EmbeddedFiles name tree in the PDF specification:

A name tree mapping name strings to file specifications for embedded file streams.[...]

The only thing I can find about naming in this tree is for collections in case of folders:

  • The name tree keys are PDF text strings.
    [...]
  • The remainder of the string is a file name.

What are your thoughts?

@JanSlabon JanSlabon added the question Further information is requested label Dec 9, 2024
@DietrichSeggern
Copy link

I agree that using an arbitrary name for an embedded file in the EmbeddedFiles name tree is "against the spirit" of the name tree concept that should make sure that these "objects in a PDF file can be referred to by name rather than by object reference" (as specified in the first sentence of 7.7.4 Name dictionary).
But the Names tree already has value if that is not the case. There is no provision in the PDF spec that limits File Specifications to occur only in certain locations, so the Names tree already is helpful to e.g. in a ZUGFeRD invoice identify the possible locations for the embedded XML invoice.

I agree that this group should decide whether there should be a recommendation in the PDF Spec that the strings in the Names tree are meaningful if possible or that the strings for EmbeddedFiles should be identical to the names of the embedded files. (It already says that that should not be the case for unencrypted wrapper documents.)

If this group decided that there should not be such a recommendation and you feel that that would be helpful for interoperability this "issue" could be brought to the ZUGFeRD committee (ferd.de).

@u-fischer
Copy link

The keys in a name tree should be unique strings. As nothing prevents users to include files with names containing e.g. unicode chars and as it is not forbidden to include more than one file with the same name (see the attached PDF which embeds two grüße.txt with different content) I don't quite see how one could enforce a requirement to keep them in sync with the file names.

(In LaTeX we came also across the problem, that some implementation assumes that the key factur-x.xml is used in the name tree, but I don't know if the ZUGFeRD standard really requires that, or if someone confused key and file name).

test-utf8.pdf

@petervwyatt
Copy link
Member

ISO 32000 never defines which string is to be used as a filename for embedded files - this includes both embedded files listed in the document catalog Names/EmbeddedFiles name tree and those NOT listed in the EmbeddedFile name tree. Every embedded file stream must have F and/or UF entries in their EF dictionary - but there is no requirement that any of these match anything in the document catalog Names/EmbeddedFiles name tree, since embedded files are not mandated to always be listed in the DocCatalog name tree. In the case of PDF portable collections, embedded file matching and folder support are formally defined to use the document catalog Names/EmbeddedFiles name tree but only as a matching byte-string index.

To support long-term preservation requirements, PDF/A (ISO 19005) does make certain things more explicit for filename display: "A conforming interactive reader shall provide a mechanism to display the name strings from the value of the EmbeddedFiles key in the names dictionary of a conforming file." - but this should NOT be extrapolated to "general PDF" since PDF/A also imposes many other constraints. And since ZUGFeRD e-invoices build on PDF/A, this is what ZUGFeRD capable viewers MUST do.

See also my previous PDF Association article which discusses the inverse situation where an embedded file stream is referenced multiple times and thus may or may not need de-duplication...

Thus, assuming that EmbeddedFile name tree strings are somehow always the "correct" filename for display of "general PDF" is an assumption made by some implementations - and implementations DO vary!


There are other errata also related to handling embedded files (e.g. #481, #385) and, more recently, related updates to the latest edition of the dated revision of PDF/A-4 (ISO 19005-4:202x). During the Prague PDF Week meeting it was agreed to revise the PDF Association's "PDF 2.0 Application Note for Associated Files" (see here) so informative information and possible recommendations for filename display might be considered as part of that broader work.

@petervwyatt petervwyatt self-assigned this Jan 3, 2025
@petervwyatt petervwyatt added documentation Improvements or additions to documentation Parked Parked (eg. passed to another TWG, next ISO spec) and removed question Further information is requested labels Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Parked Parked (eg. passed to another TWG, next ISO spec)
Projects
None yet
Development

No branches or pull requests

4 participants