Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content between xref section and trailer #511

Open
vasekp opened this issue Jan 6, 2025 · 2 comments
Open

Content between xref section and trailer #511

vasekp opened this issue Jan 6, 2025 · 2 comments
Labels
documentation Improvements or additions to documentation Parked Parked (eg. passed to another TWG, next ISO spec)

Comments

@vasekp
Copy link

vasekp commented Jan 6, 2025

The specification is somewhat ambiguous about the mutual position of a cross-reference section and the corresponding file trailer. That the latter should directly follow the former is, to my best knowledge, only made explicit in 7.5.1 in the form of a list (presumably calling the only cross-reference section a "table"), and then in Figures 2 and 3.

The next mention of their mutual position is in the new text of 7.5.4 saying "PDF comments shall not be included in a cross-reference table between the keywords xref and trailer". This, however, has two issues:

  1. Note that trailer is not a part of the cross-reference table, so it is a strange thing to say "in" the table "between" the keywords when the latter keyword itself is not in it.
  2. This is the first mention of the keyword trailer whatsoever, derailing a person reading the specification in a linear fashion.

Note that the other similar mention near the end of 7.5.8.4, stating that "PDF comments shall not be included in a cross-reference table or in cross-reference streams", speaks of the content of the table (section?) itself, i.e., in the case of a "traditional" table, the subsection headers and the 20-character entries. (I'm having a hard time trying to imagine where a comment could ever fit inside the binary content of a cross-reference stream.) But my question is specifically about the space between the end of a cross-reference section (after its last subsection) and the beginning of a trailer.

This opens several questions. "Comments shall not be included": why comments, specifically? What situation does this intend to prevent (within the already existing constraints)? Other forms of whitespace are acceptable? If so, are they OK in any amount, as long as they do not include a PDF comment? If a PDF reader already has to be prepared to skip an arbitrary amount of whitespace, what extra complication it would be to skip comments too?

From a practical point of view, the location of the trailer is specified in 7.5.5 given by its preceding the startxref keyword, which is impractical for implementations given that it spans an a priori unknown amount of bytes or lines, and could in principle happen to contain the characters "trailer[whitespace]<<" by coincidence (e.g. as part of the /ID strings if they are written literal). So in order to find the trailer quickly and unambiguously, reading past the end of the cross-reference section is the only way.

A possible reading of the list in 7.5.1 is that nothing can appear between the cross-reference section and its trailer, because nothing is listed between the two. Together with #112 (with the EOL directly before trailer being the final EOL of the last subsection), this would be a very useful guarantee, because the keyword would act as a sentinel for the end of the xref section, which itself has no explicit EOD marker. (The other option is stopping reading the section when a subsection is finished and followed by a line not conforming to the format of another subsection header. This would allow to check on string equality rather than failure to match a pattern.) Moreover, the PDF reader could immediately proceed to reading the trailer, as opposed to having to find where it starts.

NB that the latter would exclude the possibility of a blank like preceding trailer, which, albeit uncommon, does appear in the PDF Association's own examples, see e.g. the final trailer of PDF 2.0 via incremental save.pdf in pdf-association/pdf20examples, so maybe that's too much to ask for, but the rest of the proposal stands. (Nevertheless, I haven't ever seen that done in PDF from any other source but I don't mind being shown otherwise.)

Proposed solution

  1. Remove the sentence "PDF comments shall not be included in a cross-reference table between the keywords xref and trailer" either entirely, or its part "between the keywords xref and trailer".
  2. Remove the subsequent Note 2 (2020) as well.
  3. Consider removing the corresponding sentence from 7.5.8.4 entirely or its last part about comments in cross-reference streams, as the latter is impossible to achieve.
  4. Add to 7.5.5 a requirement that [nothing | only white-space characters (and no PDF comments?)] may separate the keyword trailer from the last cross-reference section, or similar.
@vasekp vasekp added the bug Something isn't correct label Jan 6, 2025
@petervwyatt petervwyatt added documentation Improvements or additions to documentation Parked Parked (eg. passed to another TWG, next ISO spec) and removed bug Something isn't correct labels Jan 7, 2025
@petervwyatt
Copy link
Member

The official PDF specification has never been intended or designed to be read "in a linear fashion" (this applies to the early Adobe editions and all ISO versions). Other books exist that explain PDF in a more textbook-like manner (altho I'm not sure if those books were pre- or post- the introduction of cross-reference and object streams)...

There are also far bigger issues in the early sections that have only been "patched at" since ISO took over and that cannot be properly addressed unless significantly rewritten and restructured (e.g. most historic text confuses "cross-reference table" with "cross-reference section"; cross-reference streams (and object streams) were never properly integrated when originally written; there is no formally defined list of PDF keywords; some requirements existed for many years but were buried in strange sections; etc. etc.).

As an interim approach, in recent years we have patched in loads more internal cross-referencing and informative NOTES to try and support forward and back mentions, highlight some incorrect and/or inconsistent use of terms across different sections, etc. but wholesale and holistic reorganisation of content with significantly rewritten sections is not possible (or practical) via errata.

To address some of the specifics (some of which go against what ISO/PDF experts have previously agreed): the previous Adobe specs stated that "comments" were not allowed in conventional cross-reference sections but this statement was buried in the cross-reference stream section. See also errata #237 and #273 - and others...

Parking this documentation issue as the correct solution is the heavy lift.

@vasekp
Copy link
Author

vasekp commented Jan 7, 2025

Thank you for looking into this and making a decision, as well as for the additional information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Parked Parked (eg. passed to another TWG, next ISO spec)
Projects
None yet
Development

No branches or pull requests

2 participants