[RT 98548] hooks for line-splitting #19

PhilterPaper · 2017-07-09T22:03:53Z

Subject: hooks for line-splitting
Date: Tue, 02 Sep 2014 12:13:47 -0400
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry

PDF::API2 v2.022 Perl 5.16.3 Windows 7 severity: Wishlist

Content.pm's text_fill_*() methods can currently only split a line at a space (x20) character. It would be good to be able to properly hyphenate words, to better fill a line. It's easy enough to split at camelCase, internal non-letters (hard hyphens, digits, punctuation), and at soft hyphens (&SHY;). It's fairly involved to properly split complete words, and different languages have different rules. I think that the first three cases could be implemented in the text_fill_*() methods, but we might have to pass control to a user-supplied routine for splitting of complete words.

Mon May 04 00:09:59 2015 steve [...] deefs.net - Correspondence added

econtrario contributed a patch to implement part of this a few months ago:
https://bitbucket.org/ssimms/pdfapi2/pull-request/2/_text_fill_line-with-space-hyphen-and-soft/diff
[This repository has disappeared. Perhaps it's somewhere on GitHub now? -- Mod.]

It needs some tests to be added.

Mon May 04 00:10:00 2015 The RT System itself - Status changed from 'new' to 'open'

Subject: Re: [rt.cpan.org #98548] hooks for line-splitting
Date: Mon, 4 May 2015 16:10:44 -0400
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry

Steve,

I'm a bit concerned that the new code is using hard-coded single byte encoding for SHY and (?) xC2. xAD is SHY in Latin-1, but xC2 appears to be Â, so I'm not sure what encoding this is. At any rate, before committing any new non-ASCII character handling code, I think we should decide how we want to handle various encodings. Splitting words will require knowing if we're in the middle of a single multibyte character.

Subject: [rt.cpan.org #98548]
Date: Sun, 24 Jan 2016 16:40:58 -0500
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry

Ah, I see I misread the code. It's apparently not two Latin-1 characters xC2 and xAD (one of which is SHY), but a UTF-8 representation of a SHY. A few comment lines in the code would have helped. Anyway, we still have the issue of what character encoding we're working in -- can we count on one particular encoding, or should we be able to handle a variety of encodings? Many, if not all, the font sets supplied for PDF appear to be in something close to Windows-1252 (more or less Latin-1), so can we even work with UTF-8 text? Before we embark on changes hard coded for one encoding or another, let's be clear what character encodings are even possible to use.

Wed Feb 17 16:49:59 2016 steve [...] deefs.net - Correspondence added

Given encoding issues and the complications of implementing hyphenation rules for multiple languages, this is something that's better left to an add-on module.

Wed Feb 17 16:50:00 2016 steve [...] deefs.net - Status changed from 'open' to 'rejected'

Subject: [rt.cpan.org #98548]
Date: Thu, 18 Feb 2016 15:34:39 -0500
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry

True, but should we think about building in some simple word splitting scenarios? I would really like to split words after hyphens, but beyond that, it could get messy with non-ASCII characters. You don't want arbitrary (non-language sensitive) word splitting between accented Latin characters and ASCII letters, without being fully aware of the encoding used. You also don't want to end up accidentally splitting within a UTF-8 multibyte character. Em and en dashes, non-breaking spaces, soft hyphens, and various thickness space characters will depend on the encoding. ASCII characters and text are easy enough, but what to do about anything not ASCII? Perhaps allow splitting only between ASCII characters (0xxxxxxx byte) for now? It should be safe for multibyte UTF-8 characters, as all bytes for non-ASCII start with a 1 bit (1xxxxxxx). I think we could safely break between ASCII characters for hyphen and other non-letters in the range x21..x7E, and letters (letter to non-letter, or non-letter to letter transition, as well as lower-to-upper and upper-to-lower camelCase). Would that be useful? A dummy hook might be put in for future calling of user-supplied hyphenation routines for various encodings and languages, or just mark the spot in the code for now. For English, at least, a minimum of two characters must be left on each line, and be careful about not splitting something like O'Mallory into O'-Mallory, or thinking Ma is camelCase and splitting it O'M-allory.

I'd sure like to get some other people participating in this discussion, to get some more viewpoints and algorithm experience. Perhaps we should just go ahead with starting the add-on module with the above simple algorithm, and flesh it out over time?

April 03, 2017, 11:05:30 AM by Phil

Status report: I have started the implementation of a hyphenation (word splitting) routine so that text-fill methods work better. Right now, it splits at soft and hard hyphens, in camelCase, after runs of digits, after runs of ASCII letters, and after specific ASCII punctuation. Currently, it does not recognize non-ASCII letters or punctuation. It is split out into an independent module, Hyphenate_en, for English hyphenation rules (other languages will get their own modules). Currently, I don't have any code to split normal English words (all letters), and am looking at sources for hyphenation algorithms.

Note that this looks only at one line at a time. It does not implement fancier algorithms, such as Knuth-Plass, that attempt to balance hyphenation over multiple lines to avoid multiple consecutive lines ending in a hyphen, and "rivers of whitespace" flowing down a paragraph.

This code is still preliminary, and could be significantly revised. Among the issues I'm looking at:

It is possible that additional parameters or options will be added to permit the fine-tuning of behavior without having to edit the code. For instance, turning on and off camelCase splitting.
I'm looking at ways to suppress splitting at a given point in an input string, via a flag (option) or added parameter. This would be for cases where the automatic hyphenation is splitting a word at an inappropriate point, and you want to suppress that split on a case-by-case basis, without disabling other instances of such a split via a flag/option or code change.
Should a lot of common (non-language-specific) material be pulled out into another (common) routine, or leave it in Hyphenate_$lang.pm? Should things like hard and soft hyphens, camelCase, punctuation, runs of letters and digits, etc. be under language-specific routines, or put into one common routine? I'm not familiar with word-splitting rules in other languages (except that German may require doubling of the last letter at a split in some circumstances).
Non-ASCII letters and punctuation are not yet supported. I am updating the documentation to remind users of PDF::API2 to convert ($string = Encode::decode(SOURCE_ENCODING, $source) strings containing non-ASCII characters. This will affect camelCase, runs-of-letters, and punctuation splitting. It is important to get en- and em-dashes supported as split points. (They should also not be split before the dash.)
Clarification on priority levels for splitting. Currently all split points are treated equally (build up large @splitLoc), but should it be configurable to look for highest-priority split points (e.g., hard and soft hyphens first) and split there if any found, and only if none are found, move down to the next split priority? Flags would be added to set this behavior.
Right now, once the word has been examined for potential split points, it searches from right to left for the first split that fits the specified width. I'm thinking about speeding up the trial-and-error splitting by estimating the start point based on $width/$em, and finding the closest @splitLoc entry to start at. In real life, most words are reasonably short, and it may not be worth the extra code to speed up the rare extra-long word.
Are soft hyphens always 173? For many single-byte encodings, they may not be. If we require internal wide encoding (UTF-8) for any string containing non-ASCII characters, this would likely be a non-issue.
The text-fill utility first splits up lines on ASCII blanks (0x20). Are there any other forms of "spaces" that preliminary splitting should also be done on? Naturally, required blanks (non-breaking spaces) should not be treated this way (as a point to split text into words). Should runs of spaces be condensed into single spaces (like HTML does), or honored in the PDF output?
This "greedy" algorithm looks only at one line at a time, and can hyphenate the last word on a page, which is undesirable. It can also leave a single line ("widow") at the end of a paragraph, to go to the next page, and can leave an undesirably short final line on a paragraph. Finally, it will split on hard hyphens ("-" within a word), which may be undesirable (a flag could control this). It may be a good idea to revisit the whole concept of paragraph formation (per Knuth-Plass?) to look at the paragraph and its line breaking as a whole. This would be quite a change to the existing code, so perhaps it should be put off to a new module.
If Hyphenate__$lang_.pm is not installed, should we fall back to English (en), the current behavior, or turn off hyphenation altogether? A lot of non-language-specific splitting could still be done.

I'm probably going to lay off this for a few weeks or more, as tax time is approaching and I also have some repairs and cleanup around the house I need to make after a snowy winter. I'll keep thinking about it, but I won't promise that this will be out in release 3.003. Also, even when it is released, I want developers and users of PDF::API2 to be aware that the final hyphenation capabilities may not be written in stone for some time to come! I'd like to have people playing with this, and giving feedback on what could be improved, before calling it a final release (because some changes may be not backwards compatible).

April 04, 2017, 05:50:44 PM by Phil

Re: #9. Code has been revised to not split on a hard (explicit) hyphen (-), by default. There is a switch to control it. It also will not split at the end of a digit run, letter run, or after punctuation, if the next character is a hard hyphen (-).

The text was updated successfully, but these errors were encountered:

PhilterPaper · 2017-11-13T14:57:33Z

This one is going to take some additional thought about how much hyphenation should be handled by the basic PDF::Builder code, and how much should be left to higher level typesetting packages (i.e., paragraph shaping). I'll mark it as 'stalled' for now, but leave it open.

PhilterPaper · 2020-12-19T23:58:46Z

I am looking at either Text::Hyphen or TeX::Hyphen (or maybe a new package using the best of both), and Text::KnuthPlass to do paragraph shaping. This will improve hyphenation to do true language-sensitive word splitting. I may go ahead and put one of the hyphenation routines in (optional) to expand upon the current hyphenation, and use it later in paragraph-shaping for full-blown document layout. One thing to keep in mind with hyphenation is that it is language-dependent, and needs a way to switch on the fly between languages within text.

Add: note that the current line-breaking is a simple "greedy" algorithm to fit as much as possible on a line, and is not Knuth-Plass paragraph shaping (to get the typographically best looking paragraph). It also doesn't use Knuth-Liang or any other language-sensitive word-splitting, just some language-independent obvious places. At some point I plan to offer Knuth-Plass shaping (with Knuth-Liang word-splitting); I may or may not be able to backfit Knuth-Liang into the existing line-breaking code (such as the paragraph() call). Then there's the issue of HarfBuzz::Shaper use for text, and how to make it (with font changes inline possibly from Text::Layout) compatible with paragraph shaping. Such fun!

PhilterPaper · 2022-04-29T14:35:56Z

#183 (CTS 20, Hyphenation) is probably the correct place to discuss word splitting (hyphenation) for existing line-fitting routines (e.g., paragraph()). New paragraph-shaping routines (possibly a package above PDF::Builder, using it only for low-level output) should probably use Knuth-Plass or something like it. Anyway, we had two tickets for word-splitting when having only one would be clearer.

PhilterPaper added the enhancement request a new feature label Jul 9, 2017

PhilterPaper added the stalled things have ground to a halt on this one label Nov 13, 2017

PhilterPaper closed this as completed Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RT 98548] hooks for line-splitting #19

[RT 98548] hooks for line-splitting #19

PhilterPaper commented Jul 9, 2017

PhilterPaper commented Nov 13, 2017

PhilterPaper commented Dec 19, 2020 •

edited

Loading

PhilterPaper commented Apr 29, 2022

[RT 98548] hooks for line-splitting #19

[RT 98548] hooks for line-splitting #19

Comments

PhilterPaper commented Jul 9, 2017

PhilterPaper commented Nov 13, 2017

PhilterPaper commented Dec 19, 2020 • edited Loading

PhilterPaper commented Apr 29, 2022

PhilterPaper commented Dec 19, 2020 •

edited

Loading