Skip to content

1.1.1 Hello World PDF

Felix Schütt edited this page Jul 5, 2017 · 7 revisions

Almost every programming course starts with a "hello world" program. Since PDF is a text-based format, why not try to read the simplest, minimal PDF file - see if you can figure out what's going on:

%PDF-1.4
1 0 obj
<< /Title (Hallo Welt) >>
endobj
2 0 obj
<< /Type /Catalog
   /Pages 3 0 R
>>
endobj
3 0 obj
<< /Type /Pages
   /MediaBox [0 0 595 842]
   /Resources
   << /Font << /F1 4 0 R >>
      /ProcSet [/PDF /Text]
   >>
   /Kids [5 0 R]
   /Count 1
>>
endobj
4 0 obj
<< /Type /Font
   /Subtype /Type1
   /BaseFont /Helvetica
   /Encoding /WinAnsiEncoding
>>
endobj
5 0 obj
<< /Type /Page
   /Parent 3 0 R
   /Contents 6 0 R
>>
endobj
6 0 obj
<< /Length 41
>>
stream
/F1 48 Tf
BT
72 746 Td
(Hallo Welt) Tj
ET
endstream
endobj
xref
0 7
0000000000 65535 f 
0000000009 00000 n 
0000000050 00000 n 
0000000102 00000 n 
0000000268 00000 n 
0000000374 00000 n 
0000000443 00000 n 
trailer
<< /Size 7
   /Info 1 0 R
   /Root 2 0 R
>>
startxref
534
%%EOF

This is a valid PDF document with exactly one DIN-A4 page an the text "Hallo Welt" ("Hello World") in the top left corner. In this case the file is in a valid Unix text format (used by macOS and Linux). In the Windows world, there is a slight difference because of the line breaks - on the Object 6:

6 0 obj
<< /Length 45
>>

... and on the xref-table:

xref
0 7
0000000000 65535 f
0000000010 00000 n
0000000054 00000 n
0000000111 00000 n
0000000288 00000 n
0000000401 00000 n
0000000476 00000 n
trailer
<< /Size 7
   /Info 1 0 R
   /Root 2 0 R
>>
startxref
578
%%EOF

Explanation

Alright, this is fine and all, but what does everything mean. We'll cover everything in detail, but here's an overview:

%PDF-1.4

We declare this PDF file as version 1.4

1 0 obj
<< /Title (Hallo Welt) >>
endobj

The metadata of this document - in this case it's just the title.

2 0 obj
<< /Type /Catalog
   /Pages 3 0 R
>>
endobj

The "Catalog" object contains various references to objects (marked with the R) to aid the interpreter. In this case it's only a reference to the "Pages" dictionary.

3 0 obj
<< /Type /Pages
   /MediaBox [0 0 595 842]
   /Resources
   << /Font << /F1 4 0 R >>
      /ProcSet [/PDF /Text]
   >>
   /Kids [5 0 R]
   /Count 1
>>
endobj

The "Pages" dictionary only contains the necessary settings for the pages as well as a list of the individual pages. In this case, we define the page size to be 595 * 842 points. 1 pt = 1 / 72 inch, which means that the page is 21 * 29 cm large (A4 size).

Additionally, we add a "Resources" list. It contains the necessary resources to draw the page - a font called "F1" (an arbitrary name) that is contained in the object nr. 4 - as well as a declaration that this PDF only uses basic structures and text drawing. Last but not least we can see a list and a count of individual page objects. In this case we only have one page.

4 0 obj
<< /Type /Font
   /Subtype /Type1
   /BaseFont /Helvetica
   /Encoding /WinAnsiEncoding
>>
endobj

This object contains the definition of the font (which was referenced using 4 0 R). Here we use a built-in font, in which case these informations (encoding and font name) suffice.

5 0 obj
<< /Type /Page
   /Parent 3 0 R
   /Contents 6 0 R
>>
endobj

On the "Page" object, we can override all settings of the page dictionary. But this wouldn't make any sense in this case. Instead, we only reference the contents of the page, which contains the actual page contents. It should be noted that we have to use a reference here. Streams (the page contents is a stream) can only be referenced, never embedded directly on the parent object (like it is possible with any other objects).

6 0 obj
<< /Length 41
>>
stream
/F1 48 Tf
BT
72 746 Td
(Hallo Welt) Tj
ET
endstream
endobj

The page contents object is a so-called stream object. Streams are generally used to embed long (often binary) strings or external files. Here we use them for describing the page content of our "Hello World" page. Let's look at the individual lines:

/F1 48 Tf        % We want to use the font called "F1" using the font size 48
BT               % Marks the beginning of a text object
72 746 Td        % Marks the position from the left bottom corner - 
                 % 72 pt from the left, 746 pt from the bottom side
(Hallo Welt) Tj  % Our actual text
ET               % Ends the text object
xref
0 7
0000000000 65535 f 
0000000009 00000 n 
0000000050 00000 n 
0000000102 00000 n 
0000000268 00000 n 
0000000374 00000 n 
0000000443 00000 n 

The cross-reference table contains entries about the byte offsets from the first byte into the page. Every entry is exactly 20 bytes long, the first entry is special and must contain the number 65535.

trailer
<< /Size 7
   /Info 1 0 R
   /Root 2 0 R
>>

The trailer defines how large the reference table itself is as well which object contains metadata about the file (in this case the first object) and which object is the "Catalog" of our file.

startxref
534

We define the start of the cross-reference table, measured from the end of the PDF.

%%EOF

Finally, we conclude our first PDF document with an end-of-file marker.

Next up: File structure

Clone this wiki locally