Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
sl2c authored Aug 3, 2023
1 parent d86b01b commit cb35249
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,9 @@ A few other things have to be noted as well. First, note that in order to acompl
```
So, each leaf (element) of the tree list is itself a list of 2 or 3 elements: the first two elements are the command and the list of arguments (an empty list if the command has no arguments), while the third optional argument is present in the case where the leaf is a block of commands, in which case it is the tree list of the commands that make up the block. By design, there's just one type of block: the BT/ET text block, which uses the name 'BT' in the parsed tree. Note, that in the original PDF stream there's a sequence of commands; it is the PdfStream parse that creates these blocks its output for convenience. To further familiarize themselves with the structure of the output of the stream parser, try inserting a command like pprint(treeIn) right after the call to the parser.

As declared above, the stream parser is implemented in pure Python using just about 300 lines of code with the help of the popular (and pure Python!) [SLY](https://github.com/dabeaz/sly) parser generator library by [David Beazley](https://github.com/dabeaz/sly) — check out his lectures on YT, he is terrific! For the curious: the parser uses two parser states (i.e. two different parsers with different grammars that switch between themselves as they operate), one for parsing PDF literal strings (i.e. strings that are in parentheses), and the other — for everything else. Yep, in order for the literal strings to support encoding parentheses as part of the string the format of the PDF literal strings has been made so intricate that it required a separate parser just to parse those. So, if for some reason (speed?) you will ever want to implement the stream parser using another parser generator library make sure it supports a stack for parser states.
Note also that the _toArray_ function, however useful, has _not_ been implemented in the module, and so you have to code it every time you do the parsing. This may sound strange, but it's the result of the same design principles: the module just parses the stream; its up to the developer to code everything else.

The PdfStream parser class is implemented in pure Python using just about 300 lines of code with the help of the popular (and pure Python!) [SLY](https://github.com/dabeaz/sly) parser generator library by [David Beazley](https://github.com/dabeaz/sly) — check out his lectures on YT, he is terrific! For the curious: the parser uses two parser states (i.e. two different parsers with different grammars that switch between themselves as they operate), one for parsing PDF literal strings (i.e. strings that are in parentheses), and the other — for everything else. Yep, in order for the literal strings to support encoding parentheses as part of the string the format of the PDF literal strings has been made so intricate that it required a separate parser just to parse those. So, if for some reason (speed?) you will ever want to implement the stream parser using another parser generator library make sure it supports a stack for parser states.

# More text stuff: PdfFont

Expand Down

0 comments on commit cb35249

Please sign in to comment.