-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CTS 20] Hyphenation #183
Comments
Comment 2017 May 02 at 12:33:38 by PhilterPaper If I understand your post, you are advising against including hyphenation code (or paragraph-shaping code), and should instead provide an interface to external modules. Is this correct? That's fine by me, and I'm open to simply providing interfaces to good hyphenation/shaping code. Suggestions are more than welcome. Right now, I just have some very simple hyphenation (soft hyphens, camelCase, punctuation, runs of letters or digits, etc., but no word splitting). Hopefully any external modules will cover those, too. Can I presume that you've taken a quick look at 3.003, just released last night? |
Comment 2017 May 03 at 05:29:45 by sciurius Personally I'd would draw a line between the PDF technical aspects (document structure, graphics, fonts, ...) and typesetting. And paragraph shaping is typesetting (just like changing contrast on images belongs to the realm of image manipulation). It is fine to provide some basic facilities for paragraph shaping but be very careful to add more, since it won't stop until you have reimplemented LibreOffice. It is fine to provide very basic (and language agnostic) word breaking (i.e. on soft and hard hyphens) but anything else will be great for some and a nuisance for others. Please bear in mind that many users are using PDF::API to produce native language (or mixed language) documents. So do not hyphenate by default but let the user explicitly ask for it. And don't fall back to "en" if support for the user-designated language is not available. FWIW, I would not put Hyphenate_en.pm under PDF::API2::Content, probably better under PDF::API2::Utils or something similar. Also, "_en" is too simple. There's en_US, en_CA, en_UK, and so on. The code in Hyphenate_en.pm is talking about encodings again. Remember, you do not need to deal with encodings in Perl. Just replace the literal 173 by "\x{ad}" and it will just work. If you insist on splitting on punctuation, you may consider using the builtin character class patterns like [:punct:] . |
Comment 2017 May 03 at 08:53:31 by PhilterPaper All fine points! I welcome critical discussion of what direction this should go in.
So, you would recommend that real paragraph shaping and other typesetting functions be kept out of PDF::API2 and in a separate package (that might call PDF::API2)? That's reasonable. I've been wondering where a good place is to draw the line. Some very, very basic calls like paragraph() and section() were already there and possibly being used, so I'll leave them (unless you can prove that no one is using them). I won't add anything to do markup within a paragraph (bold, italic, etc.) within PDF::API2.
I realize that different languages will have different hyphenation rules, and there may even be different rules for different applications (publishers, etc.). As you say, it would probably be better not to hyphenate by default. If a user does request hyphenation, but does not have their language hyphenation support installed, do you think it would be better to not fall back to 'en' (simply refuse to hyphenate)? This brings up a point that I've long been curious about. In bidirectional (RTL) Middle Eastern languages, what is "left justified" (and thus defining "right justified")? Is it the same side as the "beginning of the line" margin, or is it "left is left" In other words, to "left justify" Hebrew, would the lines align on the physical right? That is, does justification use a logical left and right, rather than a physical left and right? I suppose the same question arises with Chinese and other East Asian languages when written top-to-bottom... where is justification?
_en was intended to be a basic fallback (at least for English). en_US, etc. should override it (_en would be ignored if en_US was installed and that was the language request). Do you have a specific reason for installing hyphenation support in some other place than Content? Is some other place better?
I'll look at that again.
I didn't want to split on all punctuation, just places were it would make sense to a reader in the flow of the text. For example, you wouldn't want to split at quotation marks or opening brackets. Also, normally a hard hyphen is not a split point. |
Comment 2017 May 04 at 08:16:43 by sciurius
Yes, that's fine. I'd expect the extension package to have similar (and even improved) functions.
Definitely. It is better to have non-hyphenated results than wrongly hyphenated.
I'm sorry, but I'm not familiar with this.
Hyphenate_en.pm doesn't have a relation to PDF. It is a general module providing general functions.
Human texts do not contain punctuation inside words. I think it's a computer-originated idiom to use things like long_variable_names and CamelCaseWords. And I'm not sure whether I'd want these to be split.
Think again. Does the name 'hyphen' ring a bell? |
Comment 2017 May 04 at 09:00:06 by PhilterPaper
How about "computer-originated" (or, "re-educated" or "co-operative")? Other than hard hyphens and apostrophes, you're right that splitting should normally be only within words. However, as a practical matter, when you have long URLs, variable names, and other computer stuff, they're going to need to be split up to fit on lines. I could easily see a long URL that won't fit within an entire line -- are there typesetting conventions for how to deal with that? E.g., split after a / or _, and do/do not hyphenate? Perhaps a long computer word should preferably be given its own line if necessary, and only if it's too long for even that, split it at some point?
Initially I had it always split on a hard hyphen. Then in checking on some English grammatical rules, I read (a number of sources) that a hard hyphen should not be a split point. So I changed it to user-selectable. Maybe it_'s a language-specific rule? |
Comment 2017 May 05 at 15:30:13 by sciurius
An old typesetter once taught me that if the text doesn't fit nicely, rewrite it. Trying to stretch or squeeze more than a small amount makes the end result ugly. I think the main problem is mixing two things that should be distinct: text paragraphs and arbitrary content. While it is (almost always) possible to automatically format a text paragraph (where 'text' is human prose), arbitrary content cannot. Hence arbitrary content should be typeset 'as is', unformatted, possibly in the form of an example, figure, quote or something appropriate. URLs do normally not occur in formatted text paragraphs, only in badly written articles. Remember, we're producing PDF documents. Why print a long and ugly URL while it can be stashed away as a link? A good example is (not quite surprising) the PDF Reference documentation. It is formatted very well, there are many, many 'computer words' and yet none of them are broken. Probably the most ugly paragraph is at the end of page 420 (ref. version 1.7) where they decided (and, IMHO correctly) to not break the matrix. Personally, if URLs are needed in the text, I have made a habit of turning them into footnotes. See e.g. http://johan.vromans.org/articles/wxglade.pdf, page 3.
AFAIK, the purpose of a hyphen (hard U+2010, discretionary U+00AD) is to split on. If this is not desired, use non-breaking hyphen (U+2011, yes, the name is confusing). The problem is whether U+002D (ambiguous hyphen) should be treated as U+2010 or as U+2011. Word processor manuals explicitly advise to use non-breaking hyphens where appropriate (e.g. in telephone numbers) so it is safe to consider U+002D to be a split point. It may, however, be wise to add an option to change this default behaviour. |
Comment 2017 June 09 at 23:26:15 by PhilterPaper I finally had some time to get back to thinking about this issue, and here's where things stand. First, there are 4 different kinds of hyphens to worry about:
Regarding the non-appearance on most keyboards of the last three hyphens, perhaps user input handling could include some sort of preprocessor, such as escape sequences (e.g., \- is a SHY, \= is a non-breaking hyphen) to turn them into Latin-1 or UTF-8 characters. For now, this is beyond the scope of PDF::Builder, although it might be added later. We will assume that of these four hyphens, they are native (binary) Latin-1 or UTF-8 sequences, and leave it at that. We also need to consider whether the Unicode hyphen U+2010 and non-breaking hyphen U+2011 should be replaced by normal hyphens (U+002D) for consistent appearance (assuming that possibly the font either doesn't have U+2010 or U+2011, or they look different). Soft hyphens (U+00AD) all need to be removed anyway, so if one ends up being used as a split point, it will be replaced by a normal hyphen anyway. There is a U+2012 "figure dash", which may look like an en-dash (U+2013), as well as an em-dash and a quotation dash, but I have no plans to deal with these (should we?). It is usually permissible to split a line after an em-dash (without adding a hyphen, of course), but not a figure- or en-dash. A quotation dash, which apparently looks much like an em-dash, is used unpaired before the attribution of a quote, so you probably would never break after it, although possibly before (if it's an inline attribution). I will make the default not to hyphenate, and let the user explicitly choose to hyphenate. There are a few calls in base PDF::Builder (text fill, paragraph, section, etc.) which should probably implement some level of line (word) splitting, but full-fledged paragraph shaping will not be built into the base PDF::Builder. Paragraph shaping involves getting all the possible word splits in the paragraph, and deciding when and where to hyphenate to get the best appearance. This can mean minimizing the sum of numeric "penalties" for too many consecutive lines ending in hyphens, "rivers" of white space, splitting of proper names and titles (language- and culture-dependent), widows and orphans (which means you need to find out if the following paragraph will have at least 4 lines of output), hyphenation on the last word of a column (or worse, a page), too-short of last lines in a paragraph, and probably other considerations. Such items may be language- and even publisher-dependent, and (at least) the settings would have to be made specific to language and publisher, but an actual paragraph shaping routine might be itself language-independent. Where words can be split is language-dependent, and may also depend on typesetting standards of a given publisher. English is straightforward in the sense that you simply find a split point (per rules and exceptions list), stick a hyphen on the end of the first fragment, and start the next line with the remainder (which in turn may need to be split again!). Some languages, such as German, may require doubling of one or more letters at the split, complicating calculations for line lengths. Anyway, if PDF::Builder itself is going to make use of language/publisher specific splitting libraries, there could be a PDF::Builder::WordSplit::Hyphenate_xx_xx module for each flavor, where xx_xx could be just a language code (like "en") or it could be language+country (e.g., en_GB). I will have to look and see if there is some sort of locale information in PDF::Builder, or if it needs to be added. There could even be publisher-specific extensions (e.g., de_DE_SV for Springer-Verlag German-language texts). Hyphenation would not be done if the requested language support module is not found (no fallback to, say, English), but we could consider allowing "en" as a fallback for any English en_XX request, or simply require an exact match to avoid unexpected results (could be a setting). Hyphenate_xx_xx() would be fed either a single word (just doing "greedy" line splitting) or an array of the entire paragraph's words, and return in some form both the word fragments and the source of each split: hyphen or non-breaking hyphen (both of which need to be restored), soft hyphen, or by language algorithm. The paragraph shape routine might use different priorities, such as preferring to split on a soft hyphen or a hard hyphen if available (of equal or different priorities), and then try other splits. There might even be some sort of priority value built into the returned data, indicating where the preferred splits are. Now, besides normal human prose, there can also be "computer" words, such as camelCase and underscore_separated_words, as well as long URLs with /'s and the like. In technical documents it may not always be possible to avoid typesetting such things (although the result may not be all that elegant). The current code splits camelCase between a lowercase and an Uppercase ASCII letter (note than names such as MacDonald could end up being split Mac- Donald, which is undesirable), as well as after runs of letters (ASCII only) or numbers or after certain punctuation. You don't want to split just after opening brackets [ ( { etc., nor opening (left) quotation marks of various kinds, nor just before ] ) } or closing/right quotation marks. To extend these to non-ASCII letters would be difficult enough for Latin-based alphabets, never mind non-Latin! The current code has hard coded switches, and could be extended to make these a hash in the argument list. We also need to consider whether adding hyphens to a (split) URL or other technical term is risking introducing errors and confusion if the reader thinks the hyphen is actually part of the word! However, it is very easy for URLs etc. to exceed the line length (even in a footnote), and thus require splitting. The current hyphenation looks only at the last word in a line (that is too long to fit, unsplit). This is known as "greedy" line splitting, and while it makes a paragraph most compact, it takes no action to prevent orphans and widows, nor other undesirable effects (e.g., hyphenated last word on a page). I'm really not sure whether there is a point to doing full splitting (according to language and publisher rules) for the little that paragraph() and section() will be used in full-bore quality typesetting. It would be nice to allow folding of long URLs and other computerese, but it might be better to do proper line splitting in another package. Thoughts and comments? |
Comment 2017 June 10 at 08:50:52 by sciurius The two major points are: language-neutral basic splitting, and it being turned off by default. To which I fullheartly agree. |
Comment 2017 July 25 at 16:22:13 by PhilterPaper I came across something interesting in the PDF-1.7 specification. It suggests that when words are split (at other than a hard hyphen), that a soft hyphen be used (which a reader should display like a hard hyphen). When a screen reader or other text scraper sees the soft hyphen at the end of a line, it knows it can simply discard it when gluing the line back together into a long string. Also, resizable PDF reader displays can then reflow text into longer or shorter lines without introducing spurious hard hyphens in the middle of words. |
Comment 2017 December 17 at 17:01:47 by PhilterPaper I just installed Text::Reflow and hope to have some time to play with it soon. I already see two major problems with it:
I haven't even run Text::Reflow yet, so I don't know how it's splitting within words (or if it even does), and what spelling rules it's using to split. Non-English languages and orthographies will have different rules, including repeating a letter on the next line, which could greatly complicate an algorithm that thinks it can have a split point and that's that. Also, ligatures and other glyph substitutions and positioning probably need to be disposed of first, before words are split. At this point, I don't see prereq'ing Text::Reflow for PDF::Builder, but perhaps mining it for algorithms and ideas, to extend into my code. |
See #95 for some additional thoughts on splitting document production from low-level PDF stuff. |
See PhilterPaper/Text-KnuthPlass#9 for some possible problems with Text::Hyphen package word splitting. |
Let's keep this ticket for word-splitting (hyphenation) only. It could be fitted to existing line-splitting routines (e.g.,
This may require a new hyphenation package, as it doesn't look like Text::Hyphen or TeX::Hyphen will do the job. I may open a new ticket to discuss paragraph shaping (line splitting), although there is already Text::KnuthPlass and a number of other non-Perl packages on GitHub. |
See also sciurius/perl-Text-Layout#10 (for a while, it wandered off into a discussion about hyphenation). Some interesting information from Johan from the Dutch perspective. |
Another feature that would be good in a new Text::KnuthLiang package would be to have multiple encodings available. By default, there would be a utf8/ directory with US English patterns (it looks like the US English patterns are pure ASCII, and thus could be copied to any other encoding). The default encoding expected would be UTF-8, but other encoding directories could be added as desired. There might be a Perl tool or utility provided to 1) update any existing pattern/exception file if it is refreshed in the central repository, 2) bring in a new language/encoding from the repository, as desired. I think these files change slowly enough that checking them at each use of Text::KnuthLiang would be excessive. Depending on whether single byte encodings are easily available, a conversion utility to create non-UTF-8 encodings (from UTF-8) might be useful. Regarding the US English patterns being in ASCII, I'll have to check how well they handle words with a diaersis ("umlaut"), such as coöperate, etc. Perhaps that is best handled by an extended exceptions list (such as a supplementary list you could provide alongside the standard list). I don't know if there is a need for additional patterns, but it shouldn't be hard to extend it to look for and read in supplemental pattern files. I don't know if it would be useful to have "fallback" languages and encodings to try, if a word can't be split up with the primary patterns. For example, someone writes primarily in US English, but other authors are spelling in UK English in the same document -- would it be possible, and useful, to have en_UK as a fallback language for words which are UK spelling? Would this be at all useful for the occasional German word, for instance? Ideally, a non-English word would be marked as de_DE and the appropriate pattern and exception files looked for. Add: Note that TeX::Hyphen includes a "style" parameter apparently to permit specification of non-ASCII characters in patterns and exceptions by means of (sort of) TeX-like escape sequences, e.g., \'a to create á. This might make an interesting package in its own right, but I'm not sure if it should be part of a hyphenation package. Anyway, keep this in mind somewhere. Add: Perhaps use only UTF-8 patterns (and exceptions), and on-the-fly convert other encodings to UTF-8? I think there are Perl packages to do such encoding to and from UTF-8 (as you would have to return the result in the original encoding, after the UTF-8 spelling is split). |
Hyphenation is a subconcern of line breaking (to break up a string to fit within paragraphs of a given width). See also the UniWrap.pm for line breaking (though it's quite out of date). A better choice might be the Unicode::Linebreak package, for showing mandatory, prohibited, and allowed line break points in any language. I have not yet tried out this package, to see if it works as promised! See also http://unicode.org/reports/tr14/ for the official specification. It still may be necessary to split a word (hyphenate it) to best fit a line (see Text::KnuthPlass, or a simple "greedy" algorithm), which is used by line breaking, but not the concern of the Unicode standard. |
It would be good to give some assistance in splitting words, not only where splits are permitted, but any changed/deleted/added letters. For example, I understand in Dutch (Netherlands) that Drucker would be hyphenated Druck-ker. This can change as nations officially change their orthography rules. We need to be careful, if there is more than one split point, not to give all the modified letters -- for example, say there is a word "Farfhendrucker" (I just made that up), split at "Far-fhen-druck-er". If using the last split point, it would be "Farfhendruck-ker", but an earlier split point would be "Far-fhendrucker". Thus, you could not return the split list as "Far-fhen-druck-ker", because simply gluing together pieces would give you "Far-fhendruckker", which is incorrect. It might be necessary to return a list of all possible split points: (Far-fhendrucker, Farfhen-drucker, and Farfhendruck-ker). If a word is long enough that it might potentially be split over multiple lines -- well, that's complicated! Even in English, there might be issues with diacritics, particularly the use of a dieresis (e.g., naïve). Possibly, if split between a and ï (na-ïve), the "i" might lose its dieresis (na-ive). That will have to be investigated. Finally, what is to be done when ligatures come into the picture? Would it be best to hyphenate first, and then allow something like HarfBuzz::Shaper to place ligatures on the fragments? I would expect a ligature to slightly reduce the length of a word [fragment], usually not enough to warrant avoiding (or moving) a split, although that's certainly possible. I just want to avoid the overhead of figuring ligatures multiple times, as we try to figure out where to split a word. |
See Alex Holkner's thesis (https://citeseerx.ist.psu.edu/pdf/ee95750a9dd047b52901efda59819864bb9ede4a) page 11, for some interesting thoughts on how to represent splitable words, including those with German/Dutch orthography. |
Opened 2017 May 02 at 05:27:14 by sciurius
I strongly advise against adding hyphenating code. You'll find yourself in a terrible mess before you know it.
Note that this applies to the hyphenating code. Support for hyphenation is greatly appreciated but should be handled via external libraries/tools.
The text was updated successfully, but these errors were encountered: