Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word splitting in non-English languages #2

Open
PhilterPaper opened this issue Dec 10, 2020 · 4 comments
Open

Word splitting in non-English languages #2

PhilterPaper opened this issue Dec 10, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@PhilterPaper
Copy link
Owner

I am aware that some other languages, such as Dutch and German, have some specific rules about changing or repeating letters when a word is split. These rules will need to be built into either Text::KnuthPlass itself (which in turn needs to be made aware of the human language being used), or possibly into a code layer involved with paragraph shaping and such. It might even be an extension to Text::Hyphen or other hyphenation code. Currently, you need to invoke the appropriate Text::Hyphen::XX (XX is the language code) to get the right place to split a word, but I don't think it goes beyond that.

@PhilterPaper PhilterPaper added the enhancement New feature or request label Dec 10, 2020
@PhilterPaper
Copy link
Owner Author

See PDF::Builder's /UniWrap.pm for code which claims to follow the Unicode rules for breaking lines (and words?) according to the script (alphabet) in use. It might be useful for Text::KnuthPlass in dividing up lines in places other than within a word, and/or for non-Latin text.

@PhilterPaper
Copy link
Owner Author

UniWrap.pm does not appear to be used anywhere in PDF::Builder, and may be quite obsolete (when compared against the classes table in https://unicode.org/reports/tr14/). This UniCode page does mention quite a few cases of how to handle line splitting, and could be a good starting point (such as for updating UniWrap).

@PhilterPaper
Copy link
Owner Author

See PhilterPaper/Perl-PDF-Builder#183 for further thoughts on hyphenation for non-English languages (both Latin alphabet and not).

@PhilterPaper
Copy link
Owner Author

See Alex Holkner's thesis (https://citeseerx.ist.psu.edu/pdf/ee95750a9dd047b52901efda59819864bb9ede4a) on page 11, for some interesting thoughts on how to represent splitable words, including those with German/Dutch orthography. In any case, you can't simply break the word into syllables -- you need to indicate if there's any "funny business" where the word is split or is put together, which has an effect on counting lengths of fragments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant