Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate how textwrap works for East Asian languages #80

Closed
mgeisler opened this issue Aug 26, 2017 · 1 comment
Closed

Investigate how textwrap works for East Asian languages #80

mgeisler opened this issue Aug 26, 2017 · 1 comment

Comments

@mgeisler
Copy link
Owner

I wonder if it would be possible to tweak the wrapping so as to follow some of the rules mentioned here:
https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages

mgeisler added a commit that referenced this issue Apr 8, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue Apr 8, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue Apr 14, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue Apr 14, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue Apr 14, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue May 2, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/). We can use this to
find words in non-ASCII text.

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on ASCII space.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue May 2, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/). We can use this to
find words in non-ASCII text.

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on ASCII space.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue May 2, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/). We can use this to
find words in non-ASCII text.

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on ASCII space.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
@mgeisler
Copy link
Owner Author

mgeisler commented May 2, 2021

I'll close the issue now: the new WordSeparator trait makes it possible to customize exactly how a line of text is slices up into "words" or "syllables" or "Ideographs".

The UnicodeBreakProperties word separator merged in #313 uses the Unicode line breaking algorithm to find words and will handle CJK characters according to the Unicode rules. This means that we can find words without intervening whitespace:

assert_eq!(UnicodeBreakProperties.find_words("CJK: 你好").collect::<Vec<_>>(),
           vec![Word::from("CJK: "),
                Word::from("你"),
                Word::from("好")]);

These words are then wrapped by measuring their width and putting them into lines as usual.

If someone with experience in these languages finds problems, then I would love to hear about them since I'm not an expert at all in these things.

@mgeisler mgeisler closed this as completed May 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant