Skip to content

PhilterPaper/Text-KnuthPlass

Repository files navigation

Text-KnuthPlass

A line-splitting (paragraph shaping) library for Perl.

What is it?

Text::KnuthPlass uses the famed Knuth-Plass line-splitting algorithm (used in TeX and LaTeX) to break up a text string into a properly "shaped" paragraph. Certain rules are followed to not only efficiently pack the paragraph into a minimal number of lines, but also to minimize hyphenation, keep line density fairly constant, and take other measures to ensure that the output is typographically "nice looking". It works with both fixed-width fonts and with proportional (variable-width) fonts, where you supply the font library that calculates word lengths (e.g., PDF::Builder's advancewidth() method). Text::KnuthPlass permits varying line lengths, to allow text to flow around other objects, such as illustrations. It also makes use of (by default) Text::Hyphen, a library to indicate where words can be split (for hyphenation purposes).

See also this blog on Paragraph Shaping for a deeper dive into the subject.

Home Page, including Documentation and Examples.

Open Issues PRs Welcome Maintenance

Text::KnuthPlass is a Perl and XS (C) implementation of the well-known Knuth-Plass TeX paragraph-shaping (a.k.a. line-breaking) algorithm, as created by Donald E. Knuth and Michael F. Plass in 1981.

Given a long string containing the text of a paragraph, this module decides where to split a line (possibly hyphenating a word in the process), while attempting to:

  • maintain text "tightness" within a reasonable, comfortably readable, range (neither jammed together nor excessively loose).
  • maintain fairly consistent text "tightness" (limited change from line to line).
  • minimize the amount of hyphenation overall (words split at a line end).

What is a stated objective of Knuth-Plass but I don't think this implementation directly does:

  • minimize the number of lines resulting.
  • not have two or more hyphenated lines in a row.
  • not have entire words "floating" over the next line (particularly when not fully justified, e.g., "ragged right").
  • not hyphenate the paragraph's penultimate line.

What it definitely doesn't do:

  • attempt to avoid widows and orphans. This is the job of the calling routine, as Text::KnuthPlass doesn't know how much of the paragraph fits on this page (or column) and how much has to be spilled to the next page or column. It simply splits up the entire paragraph, and leaves it to your code to render.
  • attempt to avoid hyphenating the last word of the last line of a split paragraph on a page or column (as before, it doesn't know where you're going to be splitting the paragraph between columns or pages).
  • attempt to optimize over an entire page (it handles one paragraph at a time).
  • avoid having the same word (or fragment) starting or ending two lines in a row (a "stack"). This is undesirable because it makes it easier to mistrack while reading, and accidentally skip up or down a line.
  • avoid near-vertical "rivers" of whitespace.
  • avoid a very short or single word last line (a "cub").

In spite of these limitations, the Knuth-Plass ("TeX line splitting") algorithm is still pretty much the gold standard for paragraph shaping.

The Knuth-Plass algorithm does this by defining "boxes", "glue", and "penalties" for the paragraph text, and fiddling with line break points to minimize the overall sum of demerits (a penalty value for various "bad typesetting" gaffes). This can result in the "breaking" of one or more of the listed rules, if it results in an overall better scoring ("better looking") layout.

Text::KnuthPlass handles word widths by either character count, or a user-supplied width function (such as based on the current font and font size). It can also handle varying-length lines, if your column is not a perfect rectangle (see examples).

Installation

perl Build.PL
./Build
./Build test
./Build install

Note that if the XS (C) code fails to build and install for some reason, or you enjoy watching paint dry, you can still run "pure Perl" code -- it's much slower, but will always run. In lib/Text/KnuthPlass.pm, look for the flag setting use constant purePerl => 0; and change it to a value of 1.

Documentation

After installation, documentation can be found via

perldoc Text::KnuthPlass

or

pod2html lib/Text/KnuthPlass.pm > KnuthPlass.html

Support

Bug tracking is via

"https://github.com/PhilterPaper/Text-KnuthPlass/issues?q=is%3Aissue+sort%3Aupdated-desc+is%3Aopen"

(you will need a GitHub account to create a ticket, or contribute to a discussion, but anyone can read tickets.) The old RT ticket system is closed.

Do NOT under ANY circumstances open a PR (Pull Request) to report a bug. It is a waste of both your and our time and effort. A PR is an offering of code that you think belongs permanently in the product. Instead, simply open a regular ticket (issue) in GitHub, and attach a Perl (.pl) program illustrating the problem, if possible. If you believe that you have a good program patch, and offer to share it as a PR, we may give the go-ahead. Unsolicited PRs may be closed without further action.

License

This product is licensed under the Perl license. You may redistribute under the GPL license, if desired, but you will have to add a copy of that license to your distribution, per its terms.

(c)copyright 2020-2023 by Phil M Perry; earlier copyrights held by Simon Cozens

History

Around 2009, Bram Stein wrote a Javascript implementation of the Knuth-Plass paragraph fitting algorithm named typeset (not to be confused with the language typescript, nor the publishing system Typeset). It may be found on GitHub in bramstein/typeset, and does not appear to be maintained (last update 2017). In 2011, Simon Cozens ported typeset to Perl, and called it Text::KnuthPlass, maintaining it for only a short time. In 2020, Phil Perry took over maintenance of this package.

Note: gitpan/Text-KnuthPlass (on GitHub) appears to be a Read-Only archive of Text::KnuthPlass from before Perry took over maintenance. It is old, and thus not very useful. See PhilterPaper/Text-KnuthPlass for the latest code.

There are many copies of the Knuth-Plass paper/thesis, as well as discussions and explanations of the algorithm, floating around on the Web, so I will leave it to you to find some examples. Just the keywords Knuth and Plass should get you there. Rather than my listing everything here, pay a visit to my discussion on the subject. There is also a list of criteria of what makes good paragraph shaping.

There is also a refactored (still Javascript) version of typeset, intended for use as a library, in frobnitzem/typeset. Finally, there are a number of Knuth-Plass implementations in other languages, such as Python (akuchling/texlib) and typescript (avery-laird/breaker) that could be studied. And of course, there is the original Knuth-Plass paper and the annotated listing in TeX: The Program. It's just a matter of finding the time to go through all these sources and extend Text::KnuthPlass in various ways.

An Example

Find an example of using Text::KnuthPlass in examples/PDF/Flatland.pl, derived from the example in typeset. It assumes that Text::Hyphen and PDF::Builder are installed. You can easily substitute PDF::API2 and change the PDF::Builder references in the code. You can change many settings, such as the font, font size, indentation amount, leading, line length (in Points), and whether output is flush right or ragged right. The output file is Flatland.pdf.

There are more examples, including KP.pl and Triangle.pl, both giving some usage examples to get various effects, for a variety of input texts. Both PDF and text file outputs are produced.