Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CTS 20] Hyphenation #183

Open
PhilterPaper opened this issue Mar 2, 2022 · 18 comments
Open

[CTS 20] Hyphenation #183

PhilterPaper opened this issue Mar 2, 2022 · 18 comments
Labels
general discussion roadmaps, etc., discuss direction

Comments

@PhilterPaper
Copy link
Owner

Opened 2017 May 02 at 05:27:14 by sciurius

I strongly advise against adding hyphenating code. You'll find yourself in a terrible mess before you know it.

Note that this applies to the hyphenating code. Support for hyphenation is greatly appreciated but should be handled via external libraries/tools.

@PhilterPaper
Copy link
Owner Author

Comment 2017 May 02 at 12:33:38 by PhilterPaper

If I understand your post, you are advising against including hyphenation code (or paragraph-shaping code), and should instead provide an interface to external modules. Is this correct? That's fine by me, and I'm open to simply providing interfaces to good hyphenation/shaping code. Suggestions are more than welcome. Right now, I just have some very simple hyphenation (soft hyphens, camelCase, punctuation, runs of letters or digits, etc., but no word splitting). Hopefully any external modules will cover those, too.

Can I presume that you've taken a quick look at 3.003, just released last night?

@PhilterPaper
Copy link
Owner Author

Comment 2017 May 03 at 05:29:45 by sciurius

Personally I'd would draw a line between the PDF technical aspects (document structure, graphics, fonts, ...) and typesetting. And paragraph shaping is typesetting (just like changing contrast on images belongs to the realm of image manipulation). It is fine to provide some basic facilities for paragraph shaping but be very careful to add more, since it won't stop until you have reimplemented LibreOffice. It is fine to provide very basic (and language agnostic) word breaking (i.e. on soft and hard hyphens) but anything else will be great for some and a nuisance for others. Please bear in mind that many users are using PDF::API to produce native language (or mixed language) documents. So do not hyphenate by default but let the user explicitly ask for it. And don't fall back to "en" if support for the user-designated language is not available.

FWIW, I would not put Hyphenate_en.pm under PDF::API2::Content, probably better under PDF::API2::Utils or something similar. Also, "_en" is too simple. There's en_US, en_CA, en_UK, and so on.

The code in Hyphenate_en.pm is talking about encodings again. Remember, you do not need to deal with encodings in Perl. Just replace the literal 173 by "\x{ad}" and it will just work. If you insist on splitting on punctuation, you may consider using the builtin character class patterns like [:punct:] .

@PhilterPaper
Copy link
Owner Author

Comment 2017 May 03 at 08:53:31 by PhilterPaper

All fine points! I welcome critical discussion of what direction this should go in.

Personally I would draw a line between the PDF technical aspects (document structure, graphics, fonts, ...) and typesetting. And paragraph shaping is typesetting (just like changing contrast on images belongs to the realm of image manipulation). It is fine to provide some basic facilities for paragraph shaping but be very careful to add more, since it won't stop until you have reimplemented LibreOffice. It is fine to provide very basic (and language agnostic) word breaking (i.e. on soft and hard hyphens) but anything else will be great for some and a nuisance for others.

So, you would recommend that real paragraph shaping and other typesetting functions be kept out of PDF::API2 and in a separate package (that might call PDF::API2)? That's reasonable. I've been wondering where a good place is to draw the line. Some very, very basic calls like paragraph() and section() were already there and possibly being used, so I'll leave them (unless you can prove that no one is using them). I won't add anything to do markup within a paragraph (bold, italic, etc.) within PDF::API2.

Please bear in mind that many users are using PDF::API to produce native language (or mixed language) documents. So do not hyphenate by default but let the user explicitly ask for it. And don't fall back to "en" if support for the user-designated language is not available.

I realize that different languages will have different hyphenation rules, and there may even be different rules for different applications (publishers, etc.). As you say, it would probably be better not to hyphenate by default. If a user does request hyphenation, but does not have their language hyphenation support installed, do you think it would be better to not fall back to 'en' (simply refuse to hyphenate)?

This brings up a point that I've long been curious about. In bidirectional (RTL) Middle Eastern languages, what is "left justified" (and thus defining "right justified")? Is it the same side as the "beginning of the line" margin, or is it "left is left" In other words, to "left justify" Hebrew, would the lines align on the physical right? That is, does justification use a logical left and right, rather than a physical left and right? I suppose the same question arises with Chinese and other East Asian languages when written top-to-bottom... where is justification?

FWIW, I would not put Hyphenate_en.pm under PDF::API2::Content, probably better under PDF::API2::Utils or something similar. Also, "_en" is too simple. There's en_US, en_CA, en_UK, and so on.

_en was intended to be a basic fallback (at least for English). en_US, etc. should override it (_en would be ignored if en_US was installed and that was the language request). Do you have a specific reason for installing hyphenation support in some other place than Content? Is some other place better?

The code in Hyphenate_en.pm is talking about encodings again. Remember, you do not need to deal with encodings in Perl. Just replace the literal 173 by "\x{ad}" and it will just work.

I'll look at that again.

If you insist on splitting on punctuation, you may consider using the builtin character class patterns like [:punct:] .

I didn't want to split on all punctuation, just places were it would make sense to a reader in the flow of the text. For example, you wouldn't want to split at quotation marks or opening brackets. Also, normally a hard hyphen is not a split point.

@PhilterPaper
Copy link
Owner Author

Comment 2017 May 04 at 08:16:43 by sciurius

Some very, very basic calls like paragraph() and section() were already there and possibly being used, so I'll leave them

Yes, that's fine. I'd expect the extension package to have similar (and even improved) functions.

If a user does request hyphenation, but does not have their language hyphenation support installed, do you think it would be better to not fall back to 'en' (simply refuse to hyphenate)?

Definitely. It is better to have non-hyphenated results than wrongly hyphenated.

In bidirectional (RTL) Middle Eastern languages, what is "left justified"...

I'm sorry, but I'm not familiar with this.

Do you have a specific reason for installing hyphenation support in some other place than Content? Is some other place better?

Hyphenate_en.pm doesn't have a relation to PDF. It is a general module providing general functions.

I didn't want to split on all punctuation, just places were it would make sense to a reader in the flow of the text. For example, you wouldn't want to split at quotation marks or opening brackets.

Human texts do not contain punctuation inside words. I think it's a computer-originated idiom to use things like long_variable_names and CamelCaseWords. And I'm not sure whether I'd want these to be split.

Also, normally a hard hyphen is not a split point.

Think again. Does the name 'hyphen' ring a bell?

@PhilterPaper
Copy link
Owner Author

Comment 2017 May 04 at 09:00:06 by PhilterPaper

I didn't want to split on all punctuation, just places were it would make sense to a reader in the flow of the text. For example, you wouldn't want to split at quotation marks or opening brackets.

Human texts do not contain punctuation inside words. I think it's a computer-originated idiom to use things like long_variable_names and CamelCaseWords. And I'm not sure whether I'd want these to be split.

How about "computer-originated" (or, "re-educated" or "co-operative")? Other than hard hyphens and apostrophes, you're right that splitting should normally be only within words. However, as a practical matter, when you have long URLs, variable names, and other computer stuff, they're going to need to be split up to fit on lines. I could easily see a long URL that won't fit within an entire line -- are there typesetting conventions for how to deal with that? E.g., split after a / or _, and do/do not hyphenate? Perhaps a long computer word should preferably be given its own line if necessary, and only if it's too long for even that, split it at some point?

Also, normally a hard hyphen is not a split point.

Think again. Does the name 'hyphen' ring a bell?

Initially I had it always split on a hard hyphen. Then in checking on some English grammatical rules, I read (a number of sources) that a hard hyphen should not be a split point. So I changed it to user-selectable. Maybe it_'s a language-specific rule?

@PhilterPaper
Copy link
Owner Author

Comment 2017 May 05 at 15:30:13 by sciurius

Are there typesetting conventions for how to deal with that? E.g., split after a / or _, and do/do not hyphenate? Perhaps a long computer word should preferably be given its own line if necessary, and only if it's too long for even that, split it at some point?

An old typesetter once taught me that if the text doesn't fit nicely, rewrite it. Trying to stretch or squeeze more than a small amount makes the end result ugly. I think the main problem is mixing two things that should be distinct: text paragraphs and arbitrary content. While it is (almost always) possible to automatically format a text paragraph (where 'text' is human prose), arbitrary content cannot. Hence arbitrary content should be typeset 'as is', unformatted, possibly in the form of an example, figure, quote or something appropriate.

URLs do normally not occur in formatted text paragraphs, only in badly written articles. Remember, we're producing PDF documents. Why print a long and ugly URL while it can be stashed away as a link? A good example is (not quite surprising) the PDF Reference documentation. It is formatted very well, there are many, many 'computer words' and yet none of them are broken. Probably the most ugly paragraph is at the end of page 420 (ref. version 1.7) where they decided (and, IMHO correctly) to not break the matrix. Personally, if URLs are needed in the text, I have made a habit of turning them into footnotes. See e.g. http://johan.vromans.org/articles/wxglade.pdf, page 3.

Then in checking on some English grammatical rules, I read (a number of sources) that a hard hyphen should not be a split point.

AFAIK, the purpose of a hyphen (hard U+2010, discretionary U+00AD) is to split on. If this is not desired, use non-breaking hyphen (U+2011, yes, the name is confusing). The problem is whether U+002D (ambiguous hyphen) should be treated as U+2010 or as U+2011. Word processor manuals explicitly advise to use non-breaking hyphens where appropriate (e.g. in telephone numbers) so it is safe to consider U+002D to be a split point. It may, however, be wise to add an option to change this default behaviour.

@PhilterPaper
Copy link
Owner Author

Comment 2017 June 09 at 23:26:15 by PhilterPaper

I finally had some time to get back to thinking about this issue, and here's where things stand. First, there are 4 different kinds of hyphens to worry about:

  1. U+002D ASCII hyphen/minus sign -- what almost everyone is going to type in when preparing text to be fed to PDF::Builder. Under most circumstances, it is apparently OK to split a line immediately after a hyphen. I don't know if it would be appropriate to allow splitting to be suppressed here. Some sources say that a hyphen used to attach a prefix or suffix (e.g., "co-operation") should not be a split point (effectively a non-breaking hyphen). Other sources claim that compound words (e.g., _third-year medical student", between "third" and "year") should not be split. It's confusing. This hyphen, which looks the same as the other three, is available on every keyboard, whereas the other three are usually not easy to type in (at least, not without knowing their UTF-8 byte sequence or Unicode value).
  2. U+2010 Unicode hyphen -- from what I can tell, it should be treated the same as the U+002D hyphen.
  3. U+00AD soft hyphen (optional hyphen, discretionary hyphen) -- HTML entity ­, although there have been a lot of conflicting definitions, in the end it seems to mark a place where a word may be split. Note that soft hyphens need to be removed from text before being passed to PDF::Builder's output routines, or they will show up as hyphens in the reader.
  4. U+2011 non-breaking hyphen -- this looks like a normal hyphen (U+002D), but if at all possible, avoid breaking the word (line) after it. It is used for things such as telephone numbers, some date formats, and Social Security numbers, and ideally they should be kept as one unit, but if the text is longer than the line length (column width), you're gonna have to split it somewhere, and it might as well be after a non-breaking hyphen. If the text has to be split, you might as well first fill up as much as possible of the current line, rather than leaving a huge gap and starting the long text on the next line (and still having to split it).

Regarding the non-appearance on most keyboards of the last three hyphens, perhaps user input handling could include some sort of preprocessor, such as escape sequences (e.g., \- is a SHY, \= is a non-breaking hyphen) to turn them into Latin-1 or UTF-8 characters. For now, this is beyond the scope of PDF::Builder, although it might be added later. We will assume that of these four hyphens, they are native (binary) Latin-1 or UTF-8 sequences, and leave it at that.

We also need to consider whether the Unicode hyphen U+2010 and non-breaking hyphen U+2011 should be replaced by normal hyphens (U+002D) for consistent appearance (assuming that possibly the font either doesn't have U+2010 or U+2011, or they look different). Soft hyphens (U+00AD) all need to be removed anyway, so if one ends up being used as a split point, it will be replaced by a normal hyphen anyway.

There is a U+2012 "figure dash", which may look like an en-dash (U+2013), as well as an em-dash and a quotation dash, but I have no plans to deal with these (should we?). It is usually permissible to split a line after an em-dash (without adding a hyphen, of course), but not a figure- or en-dash. A quotation dash, which apparently looks much like an em-dash, is used unpaired before the attribution of a quote, so you probably would never break after it, although possibly before (if it's an inline attribution).

I will make the default not to hyphenate, and let the user explicitly choose to hyphenate. There are a few calls in base PDF::Builder (text fill, paragraph, section, etc.) which should probably implement some level of line (word) splitting, but full-fledged paragraph shaping will not be built into the base PDF::Builder. Paragraph shaping involves getting all the possible word splits in the paragraph, and deciding when and where to hyphenate to get the best appearance. This can mean minimizing the sum of numeric "penalties" for too many consecutive lines ending in hyphens, "rivers" of white space, splitting of proper names and titles (language- and culture-dependent), widows and orphans (which means you need to find out if the following paragraph will have at least 4 lines of output), hyphenation on the last word of a column (or worse, a page), too-short of last lines in a paragraph, and probably other considerations. Such items may be language- and even publisher-dependent, and (at least) the settings would have to be made specific to language and publisher, but an actual paragraph shaping routine might be itself language-independent.

Where words can be split is language-dependent, and may also depend on typesetting standards of a given publisher. English is straightforward in the sense that you simply find a split point (per rules and exceptions list), stick a hyphen on the end of the first fragment, and start the next line with the remainder (which in turn may need to be split again!). Some languages, such as German, may require doubling of one or more letters at the split, complicating calculations for line lengths. Anyway, if PDF::Builder itself is going to make use of language/publisher specific splitting libraries, there could be a PDF::Builder::WordSplit::Hyphenate_xx_xx module for each flavor, where xx_xx could be just a language code (like "en") or it could be language+country (e.g., en_GB). I will have to look and see if there is some sort of locale information in PDF::Builder, or if it needs to be added. There could even be publisher-specific extensions (e.g., de_DE_SV for Springer-Verlag German-language texts). Hyphenation would not be done if the requested language support module is not found (no fallback to, say, English), but we could consider allowing "en" as a fallback for any English en_XX request, or simply require an exact match to avoid unexpected results (could be a setting).

Hyphenate_xx_xx() would be fed either a single word (just doing "greedy" line splitting) or an array of the entire paragraph's words, and return in some form both the word fragments and the source of each split: hyphen or non-breaking hyphen (both of which need to be restored), soft hyphen, or by language algorithm. The paragraph shape routine might use different priorities, such as preferring to split on a soft hyphen or a hard hyphen if available (of equal or different priorities), and then try other splits. There might even be some sort of priority value built into the returned data, indicating where the preferred splits are.

Now, besides normal human prose, there can also be "computer" words, such as camelCase and underscore_separated_words, as well as long URLs with /'s and the like. In technical documents it may not always be possible to avoid typesetting such things (although the result may not be all that elegant). The current code splits camelCase between a lowercase and an Uppercase ASCII letter (note than names such as MacDonald could end up being split Mac- Donald, which is undesirable), as well as after runs of letters (ASCII only) or numbers or after certain punctuation. You don't want to split just after opening brackets [ ( { etc., nor opening (left) quotation marks of various kinds, nor just before ] ) } or closing/right quotation marks. To extend these to non-ASCII letters would be difficult enough for Latin-based alphabets, never mind non-Latin! The current code has hard coded switches, and could be extended to make these a hash in the argument list. We also need to consider whether adding hyphens to a (split) URL or other technical term is risking introducing errors and confusion if the reader thinks the hyphen is actually part of the word! However, it is very easy for URLs etc. to exceed the line length (even in a footnote), and thus require splitting.

The current hyphenation looks only at the last word in a line (that is too long to fit, unsplit). This is known as "greedy" line splitting, and while it makes a paragraph most compact, it takes no action to prevent orphans and widows, nor other undesirable effects (e.g., hyphenated last word on a page). I'm really not sure whether there is a point to doing full splitting (according to language and publisher rules) for the little that paragraph() and section() will be used in full-bore quality typesetting. It would be nice to allow folding of long URLs and other computerese, but it might be better to do proper line splitting in another package.
So, my proposal is to rename Hyphenate_en.pm to Hyphenate_basic.pm (language-independent), make all hyphenation optional (off by default), use only the current forms of hyphenation, no support for U+2010 and U+2011 hyphen variants, and leave all language-specific word splitting to another package. I may make some of the currently hard coded switches accessible in the call as hash elements. The base PDF::Builder has only rudimentary formatting and paragraph formation capabilities (text fill, paragraph, section) and they probably won't get any more enhancement than this level of hyphenation. If someone wants to use them for production, they can supply their text with SHY's already inserted, but will have to put up with no control over widows and orphans and column-break hyphens. Real typesetting (using PDF::Builder as its base) will have to do a much better job of paragraph shaping, and I agree that it's better left to a separate package.

Thoughts and comments?

@PhilterPaper
Copy link
Owner Author

Comment 2017 June 10 at 08:50:52 by sciurius

The two major points are: language-neutral basic splitting, and it being turned off by default. To which I fullheartly agree.

@PhilterPaper
Copy link
Owner Author

Comment 2017 July 25 at 16:22:13 by PhilterPaper

I came across something interesting in the PDF-1.7 specification. It suggests that when words are split (at other than a hard hyphen), that a soft hyphen be used (which a reader should display like a hard hyphen). When a screen reader or other text scraper sees the soft hyphen at the end of a line, it knows it can simply discard it when gluing the line back together into a long string. Also, resizable PDF reader displays can then reflow text into longer or shorter lines without introducing spurious hard hyphens in the middle of words.

@PhilterPaper
Copy link
Owner Author

Comment 2017 December 17 at 17:01:47 by PhilterPaper

I just installed Text::Reflow and hope to have some time to play with it soon. I already see two major problems with it:

  1. Line size is in characters, rather than points (or other dimensions). This is OK for fixed-pitch fonts, but will be a major problem with proportional fonts. We can't just count characters, but have to see what width each glyph takes up.
  2. The rules are for English, or at least, the "no break here" words lists (titles, conjunctives) are English. At the least, a way will have to be provided to allow other languages' titles and conjunctive words to be handled.

I haven't even run Text::Reflow yet, so I don't know how it's splitting within words (or if it even does), and what spelling rules it's using to split. Non-English languages and orthographies will have different rules, including repeating a letter on the next line, which could greatly complicate an algorithm that thinks it can have a split point and that's that. Also, ligatures and other glyph substitutions and positioning probably need to be disposed of first, before words are split.

At this point, I don't see prereq'ing Text::Reflow for PDF::Builder, but perhaps mining it for algorithms and ideas, to extend into my code.

@PhilterPaper
Copy link
Owner Author

  1. Knuth-Liang word splitting (for hyphenation) is available in several packages for Perl.
  2. Knuth-Plass paragraph shaping is in Text::KnuthPlass. It needs some work, but no point in re-inventing the wheel! At any rate, it should probably be outside of PDF::Builder itself.
  3. PDF::Builder has a routine UniWrap.pm lists where a line can be split, based on the Unicode characters. It appears to be an implementation of rules published for the Unicode Standard, but is a bit out of date.

See #95 for some additional thoughts on splitting document production from low-level PDF stuff.

@PhilterPaper PhilterPaper added the general discussion roadmaps, etc., discuss direction label Mar 2, 2022
@PhilterPaper
Copy link
Owner Author

See PhilterPaper/Text-KnuthPlass#9 for some possible problems with Text::Hyphen package word splitting.

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Apr 29, 2022

Let's keep this ticket for word-splitting (hyphenation) only. It could be fitted to existing line-splitting routines (e.g., paragraph()) and possibly used in a Knuth-Plass (or similar) routine for true paragraph shaping. There are a few things I would add to any word-splitting package:

  1. Ability to choose the (human) language on the fly, so (if appropriately marked up) French text and phrases within an English document could be properly hyphenated for any purpose.
  2. Be able to update (and add) hyphenation dictionaries easily from a central repository (such as https://mirrors.rit.edu/CTAN/language/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/). Add: include various CTAN-sourced supplemental libraries of words that don't hyphenate properly, including proper names and chemical names. And don't forget to add "coworker[s]" to the list, if it gets split as "cow-orker[s]"! Ship with American English (plus supplemental libraries) and standard Latin (for Lorem Ipsum use), and instructions for how to easily add other languages.
  3. Force splitting of excessively long text that a hyphenator can't handle. For instance, a password or hash may be very long, but Knuth-Liang and the like can't handle them. Force a split if a fragment is more than, say, 7 or 8 characters long, after splitting at dashes/hyphens, camelCase, punctuation, letter-digit boundaries, etc. The current PDF::Builder hyphenator attempts to do this already, although it can't do anything about true words (could use Text::Hyphen, etc.).

This may require a new hyphenation package, as it doesn't look like Text::Hyphen or TeX::Hyphen will do the job.

I may open a new ticket to discuss paragraph shaping (line splitting), although there is already Text::KnuthPlass and a number of other non-Perl packages on GitHub.

@PhilterPaper
Copy link
Owner Author

See also sciurius/perl-Text-Layout#10 (for a while, it wandered off into a discussion about hyphenation). Some interesting information from Johan from the Dutch perspective.

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented May 15, 2022

Another feature that would be good in a new Text::KnuthLiang package would be to have multiple encodings available. By default, there would be a utf8/ directory with US English patterns (it looks like the US English patterns are pure ASCII, and thus could be copied to any other encoding). The default encoding expected would be UTF-8, but other encoding directories could be added as desired. There might be a Perl tool or utility provided to 1) update any existing pattern/exception file if it is refreshed in the central repository, 2) bring in a new language/encoding from the repository, as desired. I think these files change slowly enough that checking them at each use of Text::KnuthLiang would be excessive. Depending on whether single byte encodings are easily available, a conversion utility to create non-UTF-8 encodings (from UTF-8) might be useful.

Regarding the US English patterns being in ASCII, I'll have to check how well they handle words with a diaersis ("umlaut"), such as coöperate, etc. Perhaps that is best handled by an extended exceptions list (such as a supplementary list you could provide alongside the standard list). I don't know if there is a need for additional patterns, but it shouldn't be hard to extend it to look for and read in supplemental pattern files.

I don't know if it would be useful to have "fallback" languages and encodings to try, if a word can't be split up with the primary patterns. For example, someone writes primarily in US English, but other authors are spelling in UK English in the same document -- would it be possible, and useful, to have en_UK as a fallback language for words which are UK spelling? Would this be at all useful for the occasional German word, for instance? Ideally, a non-English word would be marked as de_DE and the appropriate pattern and exception files looked for.

Add: Note that TeX::Hyphen includes a "style" parameter apparently to permit specification of non-ASCII characters in patterns and exceptions by means of (sort of) TeX-like escape sequences, e.g., \'a to create á. This might make an interesting package in its own right, but I'm not sure if it should be part of a hyphenation package. Anyway, keep this in mind somewhere.

Add: Perhaps use only UTF-8 patterns (and exceptions), and on-the-fly convert other encodings to UTF-8? I think there are Perl packages to do such encoding to and from UTF-8 (as you would have to return the result in the original encoding, after the UTF-8 spelling is split).

@PhilterPaper
Copy link
Owner Author

Hyphenation is a subconcern of line breaking (to break up a string to fit within paragraphs of a given width). See also the UniWrap.pm for line breaking (though it's quite out of date). A better choice might be the Unicode::Linebreak package, for showing mandatory, prohibited, and allowed line break points in any language. I have not yet tried out this package, to see if it works as promised! See also http://unicode.org/reports/tr14/ for the official specification. It still may be necessary to split a word (hyphenate it) to best fit a line (see Text::KnuthPlass, or a simple "greedy" algorithm), which is used by line breaking, but not the concern of the Unicode standard.

@PhilterPaper
Copy link
Owner Author

It would be good to give some assistance in splitting words, not only where splits are permitted, but any changed/deleted/added letters. For example, I understand in Dutch (Netherlands) that Drucker would be hyphenated Druck-ker. This can change as nations officially change their orthography rules. We need to be careful, if there is more than one split point, not to give all the modified letters -- for example, say there is a word "Farfhendrucker" (I just made that up), split at "Far-fhen-druck-er". If using the last split point, it would be "Farfhendruck-ker", but an earlier split point would be "Far-fhendrucker". Thus, you could not return the split list as "Far-fhen-druck-ker", because simply gluing together pieces would give you "Far-fhendruckker", which is incorrect. It might be necessary to return a list of all possible split points: (Far-fhendrucker, Farfhen-drucker, and Farfhendruck-ker). If a word is long enough that it might potentially be split over multiple lines -- well, that's complicated!

Even in English, there might be issues with diacritics, particularly the use of a dieresis (e.g., naïve). Possibly, if split between a and ï (na-ïve), the "i" might lose its dieresis (na-ive). That will have to be investigated.

Finally, what is to be done when ligatures come into the picture? Would it be best to hyphenate first, and then allow something like HarfBuzz::Shaper to place ligatures on the fragments? I would expect a ligature to slightly reduce the length of a word [fragment], usually not enough to warrant avoiding (or moving) a split, although that's certainly possible. I just want to avoid the overhead of figuring ligatures multiple times, as we try to figure out where to split a word.

@PhilterPaper
Copy link
Owner Author

See Alex Holkner's thesis (https://citeseerx.ist.psu.edu/pdf/ee95750a9dd047b52901efda59819864bb9ede4a) page 11, for some interesting thoughts on how to represent splitable words, including those with German/Dutch orthography.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
general discussion roadmaps, etc., discuss direction
Projects
None yet
Development

No branches or pull requests

1 participant