Investigate how textwrap works for East Asian languages #80

mgeisler · 2017-08-26T15:09:45Z

I wonder if it would be possible to tweak the wrapping so as to follow some of the rules mentioned here:
https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages

This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.

This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). We can use this to find words in non-ASCII text. The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on ASCII space. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.

mgeisler · 2021-05-02T17:55:21Z

I'll close the issue now: the new WordSeparator trait makes it possible to customize exactly how a line of text is slices up into "words" or "syllables" or "Ideographs".

The UnicodeBreakProperties word separator merged in #313 uses the Unicode line breaking algorithm to find words and will handle CJK characters according to the Unicode rules. This means that we can find words without intervening whitespace:

assert_eq!(UnicodeBreakProperties.find_words("CJK: 你好").collect::<Vec<_>>(),
           vec![Word::from("CJK: "),
                Word::from("你"),
                Word::from("好")]);

These words are then wrapped by measuring their width and putting them into lines as usual.

If someone with experience in these languages finds problems, then I would love to hear about them since I'm not an expert at all in these things.

mgeisler mentioned this issue Nov 8, 2020

Does not work for languages without word separators #220

Closed

mgeisler mentioned this issue Apr 8, 2021

Use Unicode line breaking algorithm to find words #313

Merged

mgeisler mentioned this issue Apr 19, 2021

Add a trait for finding words #328

Closed

mgeisler closed this as completed May 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate how textwrap works for East Asian languages #80

Investigate how textwrap works for East Asian languages #80

mgeisler commented Aug 26, 2017

mgeisler commented May 2, 2021

Investigate how textwrap works for East Asian languages #80

Investigate how textwrap works for East Asian languages #80

Comments

mgeisler commented Aug 26, 2017

mgeisler commented May 2, 2021