Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize consistent internal nomenclature for outlines/bookmarks #1098

Closed
mtd91429 opened this issue Jul 11, 2022 · 17 comments · Fixed by #1156
Closed

Utilize consistent internal nomenclature for outlines/bookmarks #1098

mtd91429 opened this issue Jul 11, 2022 · 17 comments · Fixed by #1156
Labels
is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements

Comments

@mtd91429
Copy link
Contributor

mtd91429 commented Jul 11, 2022

Explanation

The PDF Reference uses the term "Outline" but recognizes "Bookmarks" as a synonymous term. From PDF Reference version 1.6 page 554 (section 8.2.2):

The outline consists of a tree-structured hierarchy of outline items (sometimes called bookmarks), which serve as a visual table of contents to display the document’s structure to the user.

There is inconsistency within PyPDF2 regarding the nomenclature. In the PdfReader object, these objects are referred to as outlines; however, within the PdfWriter object, they are referred to as both bookmarks (add_bookmark, add_bookmark_dict, add_bookmark_destination) and outlines (get_outline_root).

Most PDF reference software refers to these objects as "Bookmarks". For example, a screenshot from the Adobe Acrobat:

image

I propose that PyPDF2 be modified such that all user-facing aspects of the code have redundant and synonymous functions using both terms (outline and bookmark), but that all internal nomenclature adopts a single and consistent term which performs the manipulations. Specifically, I think internally it should be outline.

Proposed Code Example

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")

reader.outlines
>>>[{'/Title': 'Bookmark 1', '/Page': IndirectObject(9, 0), '/Type': '/Fit'}, {'/Title': 'Bookmark 2', '/Page': IndirectObject(9, 0), '/Type': '/Fit', {'/Title': 'Bookmark 3', '/Page': IndirectObject(1, 0), '/Type': '/Fit'}]

reader.bookmarks
>>>[{'/Title': 'Bookmark 1', '/Page': IndirectObject(9, 0), '/Type': '/Fit'}, {'/Title': 'Bookmark 2', '/Page': IndirectObject(9, 0), '/Type': '/Fit', {'/Title': 'Bookmark 3', '/Page': IndirectObject(1, 0), '/Type': '/Fit'}]

writer = PdfWriter()
for page in reader.pages:
        writer.add_page(page)

writer.add_outline('Bookmark 4', 4)

writer.add_bookmark('Bookmark 5', 5)
@MartinThoma
Copy link
Member

I agree that we should use consistent names / avoid synonyms. I am also still confused about the outlines/bookmarks topic. If they really are the same I would also be in favor of dropping "outlines" and going with "bookmarks".

@MartinThoma
Copy link
Member

I am a bit confused by your code. PdfReader does not have a bookmarks attribute in the main branch:

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader("overlay.pdf")
>>> reader.bookmarks
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'PdfReader' object has no attribute 'bookmarks'
>>> reader.outlines
[]

@MartinThoma MartinThoma added the is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements label Jul 12, 2022
@MartinThoma
Copy link
Member

MartinThoma commented Jul 12, 2022

Here is a list what we might touch in this context:

PdfReader:

  • outlines property
  • def _get_outlines(...)
  • def _build_outline(...)

PdfWriter:

  • def add_bookmark(...): Makes use of add_bookmark_destination and utils._create_bookmark
  • def add_bookmark_dict(...): Makes use of add_bookmark_destination. I think this hsould be a private method
  • def add_bookmark_destination(...): Makes use of get_outline_root. I think this should be a private method
  • def get_outline_root(...): I think this should be a private method

There is some overlap in utils._create_bookmark and PdfWriter.add_bookmark_dict. I think we could re-write _create_bookmark so that we could remove add_bookmark_dict

@MartinThoma
Copy link
Member

Does anybody know how outlines/bookmarks and named destinations relate?

@mtd91429
Copy link
Contributor Author

mtd91429 commented Jul 12, 2022

I am a bit confused by your code. PdfReader does not have a bookmarks attribute in the main branch:

Sorry for the confusion, the code block was to serve as an example of the proposed behavior. I edited the heading above to try and reflect that.

... I would also be in favor of dropping "outlines" and going with "bookmarks".

An important reason to keep the term "outline" within the code base is that this is this is within the lexicon of the PDF specification. See PDF 1.7 specifications section 12.3.3 (page 367. index 375). For example, in the Document Catalog section of every PDF, this is called the "Outline" and is referred to via /Outline. See PDF spec 1.7 section 7.7.2 (page 72, index 80). One could also call it the "Table of Contents" as does the PDF viewer within the Edge Browser and Apple Preview application. Some software refers to the /Outline as the "Table of Contents" and implements its own method of storing what it deems "Bookmarks". See this blog post for some examples.

If we had to choose a single term, I'd argue that it must be "outline" due to the document specification. Anecdotally, I think most people refer to them as "Bookmarks" but, as noted in the above blog post, there are issues with that.

Either way, the code base really needs some sort of consistency regarding the name, which is main problem I think should be addressed.

Perhaps my initial proposal isn't very Pythonic (There should be one-- and preferably only one --obvious way to do it). Rather, the best solution would probably be to deprecate all terms using the word bookmark and re-write similar methods using the term outline. That would be simple and straightforward.

Does anybody know how outlines/bookmarks and named destinations relate?

A destination defines a view within the document and is composed of the page, location on the page, and magnification. Destinations can be associated with outlines, annotations, or actions. Named destinations provide a convenient means of referring to a specific destination. They name/destination key/value dictionary is stored in the document catalog (see PDF spec 1.7 section 7.7.2, page 72/index 80 for a visual; see section 7.7.4, page 80, index 88 for more discussion) and allows one to simply reference the name when wanting to reuse the destination.

An outline object can contain the destination information under the key /Dest. The /Dest object can contain one of two things:

  1. The destination itself. Meaning it has the view type (/Type) and relevant co-attributes. For example, '/XYZ', '/Left', '/Top, and '/Zoom'. (Look at the example code on page 369 (index page 377) of PDF 1.7 for an example of how this is encoded within the file.
  2. A name reference. This is simply a key. The destination information is stored under that key in a separate dictionary (the Name Dictionary within the document catalog). For example, 'G4.1051620'.

Alternatively, an outline object may not contain a /Dest key. Instead, it may contain an /A attribute which is an action. Under the /A attribute, the action /GoTo can specify the destination. (Look at the example code on page 419 (index page 427) for how this is implemented in a link; the formatting is similar for an outline.

Both the /A and /Dest attributes of the outline object are optional.

I hope that provides some clarification.

@MartinThoma
Copy link
Member

Thank you! That helped a lot

If we had to choose a single term, I'd argue that it must be "outline" due to the document specification.

I was thinking about this for the documentInfo attribute (now metadata). I went for metadata instead of using the PDF-coined term "document info" because ...

  1. Users expect it to be called metadata: That is the most important to me. I've validated that hypothesis by looking at StackOverflow questions / searching on Google for questions around that and looking at other Python PDF libraries.
  2. Contributors likely don't have pain points with that: It's not used that often within PyPDF2 itself and it is a reasonable name.
  3. We are either consistent with Python or with PDF: Both is not possible due to CamelCase in PDF and snake_case in Python (It's a super minor reason, I know - (1) is by far the dominant reason for me)

The users of PyPDF2 are mostly not people who know the PDF standard at all. They very likely don't care what is done in the PDF standard.

There are also libraries / business applications build on top of PyPDF2. We need to communicate changes clearly and give a good heads-up before they start to be breaking. But if we think we can improve the situation for our users, we can use whatever name we want.

@MartinThoma
Copy link
Member

However, I do value the opinion of contributors as my view might not represent the community well. Especially @pubpub-zz @MasterOdin , what do you think about renaming outline/outlines to bookmark/bookmarks consistently with proper deprecations?

@MartinThoma
Copy link
Member

Oh, I just read https://pspdfkit.com/blog/2019/understanding-pdf-outline/ .... so there is actually a difference between outlines and bookmarks?

@MasterOdin
Copy link
Member

MasterOdin commented Jul 12, 2022

I think that they're the same thing, it's just that Adobe calls them bookmarks, but the PDF spec calls them outlines, and it's just kind of on the vendor to use whatever terminology they want.

I agree with @mtd91429 that I think that internally, use the term "outline" as that's what's in the PDF spec. I like having one way of doing things, and so would vote to expose only outline named functions, with documenting that outlines and bookmarks are synonymous in the function doc.

For reference, pikepdf uses outline: https://pikepdf.readthedocs.io/en/latest/topics/outlines.html

@pubpub-zz
Copy link
Collaborator

outlines called bookmarks in Adobe as Odin is saying, but there is also the named_destination that are considered as bookmarks.
using non ambiguous (outlines and named_dest) is clearly the best option.
One point to remember about outlines is that the use of /Count to identify folded/unfolded outlines : the best would be to include a boolean to have an human readable parameter.

mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022
mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022
@mtd91429
Copy link
Contributor Author

Here is a list what we might touch in this context:

PdfReader:

  • outlines property
  • def _get_outlines(...)
  • def _build_outline(...)

PdfWriter:

  • def add_bookmark(...): Makes use of add_bookmark_destination and utils._create_bookmark
  • def add_bookmark_dict(...): Makes use of add_bookmark_destination. I think this hsould be a private method
  • def add_bookmark_destination(...): Makes use of get_outline_root. I think this should be a private method
  • def get_outline_root(...): I think this should be a private method

There is some overlap in utils._create_bookmark and PdfWriter.add_bookmark_dict. I think we could re-write _create_bookmark so that we could remove add_bookmark_dict

I think the only additional component would be the Bookmark class (generic.py).

As far as nomenclature goes, I propose the following terminology for clarity.

An "outline" is a collection of "outline items". There is only one outline in the document and it is composed of multiple outline items. For example:

Chapter 1
	Section 1.1
	Section 1.2
	Section 1.3
Chapter 2
	Section 2.1
	Section 2.2
...

In this, "Chapter 1" is an outline item, as is "Section 1.2". The collection as a whole is an outline. In the description of "outline" we mention that it is synonymous with a "Table of Contents" or "Bookmarks Collection". In the description of "outline item" we mention that it is synonymous with a "Bookmark" but that term can be misleading based on terminology used in other software programs (especially the MacOS preview application).

I recognize that the plurality issue adds another layer of complexity to this rectification. I think that a single letter (s) is more confusing as the distinguisher between the singular and plural forms that the above proposal.

I propose the following nomenclature changes (I currently don't have an opinion on public/private designation):

For PdfReader

  • outlines be renamed to outline
  • _build_outline() be renamed to _build_outline_item()

For PdfWriter

  • add_bookmark() be renamed to add_outline_item()
  • add_bookmark_dict() be renamed to add_outline() and specify that it takes multiple outline items stored as a dictionary
  • get_outline_root() left alone
    • Could change to get_outline_dictionary() to conform to Table 152 in PDFv1.7 specs; however, I think that would generate more confusion due to the similarity to add_bookmark_dict() function

For PdfMerger

  • _write_bookmarks() be renamed to _write_outline()
  • _write_bookmark_on_page() be renamed to _write_outline_item_on_page()
    • I'm not entirely sure what this is supposed to do, but I think it just applies an outline_item to a specific page
  • _associate_bookmarks_to_pages() be renamed to _associate_outline_to_pages()
  • find_bookmark() be renamed to find_outline_item()
  • _trim_outline() be left alone

For generic.py:

  • Bookmark class be changed to OutlineItem

@MartinThoma
Copy link
Member

Overall, I like your changes. Especially the clear nomenclature.

In addition, I like to have symmetry between PdfReader and PdfWriter. I would like to be able to do something like this:

writer.add_outline(reader.outline)

# similar to
writer.add_metadata(reader.metadata)

However, the semantics of this are not completely clear. For "add", as a user, I would expect an "append" behavior. So if I do this twice, I have double the outline items.

In contrast, we might want to allow writer.outline = reader.outline, but the setter needs to do quite a bit more than just setting an attribute.

@MartinThoma
Copy link
Member

@mtd91429 What do you think about making get_outline_root private (e.g. renaming to _get_outline_root)?

@pubpub-zz
Copy link
Collaborator

some time ago, I worked on different improvements on PyPDF4. I've introduced at that time some change on bookmark, aspecially a new parameter to specify where to add the outline. You may be interested to have a look at https://github.com/pubpub-zz/PyPPDF4.sav

@mtd91429
Copy link
Contributor Author

For "add", as a user, I would expect an "append" behavior. So if I do this twice, I have double the outline items.

a new parameter to specify where to add the outline

Yes, this becomes a much more difficult problem to consider. I'm currently not sure what the best way forward is. Because, as the two of you point out, the challenge is to specify where to add the outline item(s) onto a tree. I'll look at your work @pubpub-zz, how the writer object works in more detail and lay out some proposals.

@mtd91429 What do you think about making get_outline_root private (e.g. renaming to _get_outline_root)?

I can understand why it may make sense to have that as a private method, but I'm not sure. Privatizing a once public method should not be done lightly for its unintended consequences on the end users' experience.

@mtd91429
Copy link
Contributor Author

An update on functions (largely for my own purposes):

the function add_bookmark_dict() isn't designed to process a dictionary with nested outline items to generate an outline. Rather, it seems to only be used within the code base during the merger process to iterate through the outline items on the appended documents and write them to the output stream. I'm not sure it should be end-user facing and therefore think that this method should be privatized to _add_outline_item_dict(). I've updated my comments above.

I think the PdfWriter.add_outline() method should be reserved for a function which does what a user would expect: namely, add an outline (consisting of multiple outline objects) to the writer object. I have an idea for what that should look like and will try to implement it then issue a PR.

@mtd91429
Copy link
Contributor Author

I created PR attempting to address this issue (#1156). In there, I did not change public/private designation for methods as I felt that should be a separate commit + PR for each method (rather than bury it within this large change).

MartinThoma pushed a commit that referenced this issue Jul 30, 2022
This PR makes sure PyPDF2 uses a consistent nomenclature for the outline:

* **Outline**: A document has exactly one outline (also called "table of contents", in short toc). That outline might be empty.
* **Outline Item**: An element within an outline. This is also called a "bookmark" by some PDF viewers.

This means that some names will be deprecated to ensure consistency:

## PdfReader

* `outlines` ➔ `outline`
* `_build_outline()` ➔ `_build_outline_item()`

## PdfWriter

* Keep `get_outline_root()`
* `add_bookmark_dict()` ➔ `add_outline()` 
* `add_bookmark()` ➔ `add_outline_item()`


## PdfMerger

* `find_bookmark()` ➔ `find_outline_item()`
* `_write_bookmarks()` ➔ `_write_outline()`
* `_write_bookmark_on_page()` ➔ `_write_outline_item_on_page()`
* `_associate_bookmarks_to_pages()` ➔ `_associate_outline_items_to_pages()`
* Keep `_trim_outline()`

## generic.py

* `Bookmark` ➔ `OutlineItem`

Closes #1048
Closes #1098
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants