Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: The PDF Format #810

Merged
merged 2 commits into from
Apr 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions docs/dev/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,57 @@ the users, but for people who want to work on PyPDF2 itself.
pip install -r requirements/dev.txt
```

## Running Tests

```
pytest .
```

## Tools: git and pre-commit

Git is a command line application for version control. If you don't know it,
you can [play ohmygit](https://ohmygit.org/) to learn it.

Github is the service where the PyPDF2 project is hosted. While git is free and
open source, Github is a paid service by Microsoft - but for free in lot of
cases.

[pre-commit](https://pypi.org/project/pre-commit/) is a command line application
that uses git hooks to automatically execute code. This allows you to avoid
style issues and other code quality issues. After you entered `pre-commit install`
once in your local copy of PyPDF2, it will automatically be executed when
you `git commit`.

## Commit Messages

Having a clean commit message helps people to quickly understand what the commit
was about, witout actually looking at the changes. The first line of the
commit message is used to [auto-generate the CHANGELOG](https://github.com/py-pdf/PyPDF2/blob/main/make_changelog.py). For this reason, the format should be:

```
PREFIX: DESCRIPTION

BODY
```

The `PREFIX` can be:

* `BUG`: A bug was fixed. Likely there is one or multiple issues. Then write in
the `BODY`: `Closes #123` where 123 is the issue number on Github.
It would be absolutely amazing if you could write a regression test in those
cases. That is a test that would fail without the fix.
* `ENH`: A new feature! Describe in the body what it can be used for.
* `DEP`: A deprecation - either marking something as "this is going to be removed"
or actually removing it.
* `ROB`: A robustness change. Dealing better with broken PDF files.
* `DOC`: A documentation change.
* `TST`: Adding / adjusting tests.
* `DEV`: Developer experience improvements - e.g. pre-commit or setting up CI
* `MAINT`: Quite a lot of different stuff. Performance improvements are for sure
the most interesting changes in here. Refactorings as well.
* `STY`: A style change. Something that makes PyPDF2 code more consistent.
Typically a small change.

## Benchmarks

We need to keep an eye on performance and thus we have a few benchmarks.
Expand Down
101 changes: 101 additions & 0 deletions docs/dev/pdf-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# The PDF Format

It's recommended to look in the PDF specification for details and clarifications.
This is only intended to give a very rough overview of the format.

## Overall Structure

A PDF consists of:

1. Header: Contains the version of the PDF, e.g. `%PDF-1.7`
2. Body: Contains a sequence of indirect objects
3. Cross-reference table (xref): Contains a list of the indirect objects in the body
4. Trailer

## The xref table

A cross-reference table (xref) is a table of the indirect objects in the body.
It allows quick access to those objects by pointing to their location in the file.

It looks like this:

```text
xref 42 5
0000001000 65535 f
0000001234 00000 n
0000001987 00000 n
0000011987 00000 n
0000031987 00000 n
```

Let's go through it step-by-step:

* `xref` is justa keyword that specifies the start of the xref table.
* `42` is TODO; `6` is the number of entries in the xref table.
* Now every object has 3 entries `nnnnnnnnnn ggggg n`: The 10-digit byte offset,
a 5-digit generation number, and a literal keyword which is either `n` or `f`.
* `nnnnnnnnnn` is the byte offset of the object. It tells the reader where
the object is in the file.
* `ggggg` is the generation number. It tells the reader how old the object is.
* `n` means that the object is a normal in-use object, `f` means that the object
is a free object.
* The first free object always has a generation number of 65535. It forms
the head of a linked-list of all free objects.
* The generation number of a normal object is always 0. The generation
number allows the PDF format to contain multiple versions of the same
object. This is a version history mechanism.

## The body

The body is a sequence of indirect objects:

`counter generationnumber << the_object >> endobj`

* `counter` (integer) is a unique identifier for the object.
* `generationnumber` (integer) is the generation number of the object.
* `the_object` is the object itself. It can be empty. Starts with `/Keyword` to
specify which kind of object it is.
* `endobj` marks the end of the object.

A concrete example can be found in `test_reader.py::test_get_images_raw`:

```text
1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj
2 0 obj << >> endobj
3 0 obj << >> endobj
4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0]
/MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R
/Resources << /Font << >> >>
/Rotate 0 /Type /Page >> endobj
5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj
```

## The trailer

The trailer looks like this:

```text
trailer << /Root 5 0 R
/Size 6
>>
startxref 1234
%%EOF
```

Let's go through it:

* `trailer <<` indicates that the *trailer dictionary` starts. It ends with `>>`.
* `startxref` is a keyword followed by the byte-location of the `xref` keyword.
As the trailer is always at the bottom of the file, this allows readers to
quickly find the xref table.
* `%%EOF` is the end-of-file marker.

The trailer dictionary is a key-value list. The keys are specified in
Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).

* `/Root` (dictionary) contains the document catalog.
* The `5` is the object number of the catalog dictionary
* `0` is the generation number of the catalog dictionary
* `R` is the keyword that indicates that the object is a reference to the
catalog dictionary.
* `/Size` (integer) contains the total number of entries in the files xref table.
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,11 @@ You can contribute to `PyPDF2 on Github <https://github.com/py-pdf/PyPDF2>`_.
modules/PageRange

.. toctree::
:caption: PyPDF Developers
:caption: Developer Guide
:maxdepth: 1

dev/intro
dev/pdf-format

.. toctree::
:caption: About PyPDF2
Expand Down