Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not work for languages without word separators #220

Closed
Kestrer opened this issue Nov 8, 2020 · 15 comments
Closed

Does not work for languages without word separators #220

Kestrer opened this issue Nov 8, 2020 · 15 comments

Comments

@Kestrer
Copy link
Contributor

Kestrer commented Nov 8, 2020

For example, fill("កើតមកមានសេរីភាព", 6) outputs កើតមកម, ានសេរភ and ាព when it should output កើតមក, មាន and សេរីភាព. See also w3's Approaches to line breaking document which has the correct ways to line break words; implementing support for this would require storing a dictionary and matching words in it.

@mgeisler
Copy link
Owner

mgeisler commented Nov 8, 2020

Hi @Koxiaet, thanks you very much for linking me to the w3 article. I had not read that before and it's very interesting to get some background information on this.

You're completely right that textwrap only breaks on whitespace right now — I would like to extend this, as mentioned in #80 about how East-Asian languages could be handled.

I'm currently working on reimplementing the basic line breaking algorithm. The goal is to make it more robust and flexible. Right now, the algorithm naively goes through the string and breaks the string into words based on whitespace. The processing is trying to do everything at once and it's not easy to adjust or extend it to handle more complicated cases like you describe.

Do you know if there are freely available dictionaries we can refer to here? I know Danish, English and German myself — all written the same way with Latin script... So I'm very much out of my depth here and would need all the help I can get!

@Kestrer
Copy link
Contributor Author

Kestrer commented Nov 8, 2020

I'm also out of my depth here, I don't know any non-European languages either. I think the best option is to use rust_icu_ubrk which provides Rust wrappers for Unicode's boundary analysis algorithms in its ICU library.

@mgeisler
Copy link
Owner

When I was looking into this previously, I was considering to use unicode-linebreak, but I think this crate only does a subset of what you're talking about: I think it only finds breakpoints around spaces and hyphens, and it forbids some breakpoints around certain kinds of punctuation.

An an example, it will find no breakpoints in "Hello !" since the space before the ! doesn't count for some reason specified in Unicode Standard Annex #14.

I guess the breaks computed by rust_icu_ubrk are quite different if they can also occur inside words?

I'll probably not make a lot of progress on this issue by myself since I'm not an expert in this field. I'll leave it open in case such an expert comes by :-)

@Kestrer
Copy link
Contributor Author

Kestrer commented Nov 12, 2020

Yes, rust_icu_ubrk works very differently. But I don't think we need an expert - ICU seems to be the state of the art when it comes to this sort of thing. Chrome uses it for its line breaking algorithm. When you've finished rewriting the line breaking algorithm I can try to integrate it if you like.

@mgeisler
Copy link
Owner

Okay, please take a look at the state of things after #221. It's merged now and changes the inner algorithm quite a bit — and it should open up the possibility of more custom splitting. Basically, the wrapping algorithm works with a sequence of Word elements, and I can image that you could take the words found by the naive split-on-space algorithm and further refine them using rust_icu_ubrk.

I would be very interested to see what you can come up with here.

@tavianator
Copy link

An an example, it will find no breakpoints in "Hello !" since the space before the ! doesn't count for some reason specified in Unicode Standard Annex #14.

That makes sense to me, it would be quite strange to see the ! on a line by itself after all. Some languages like French commonly put extra spaces before final punctuation, e.g. «bonjour le monde !» which would look silly as

bonjour le monde
!

I can image that you could take the words found by the naive split-on-space algorithm and further refine them using rust_icu_ubrk.

It would be better to let ICU do all the splitting, for the reason above, and possibly others.

@mgeisler
Copy link
Owner

That makes sense to me, it would be quite strange to see the ! on a line by itself after all.

Yeah, sorry, I didn't mean to say that it is strange to skip this break point — I think it's really cool to get such finer adjustments to the breaks.

It would be better to let ICU do all the splitting, for the reason above, and possibly others.

I still haven't looked more into the ICU library or the Rust bindings. But I would be happy to have textwrap handle more languages and writing systems correctly, possibly via optional Cargo features which people can enable or disable depending on their use cases.

@mainrs
Copy link

mainrs commented Mar 18, 2021

Maybe an approach where splitting is abstracted away behind a trait would be better? That way the current behaviour can be implemented using that trait but if someone wants to use the ICU library they could implement the trait for it. Maybe a function that takes the input str and returns a list of Words.

@mgeisler
Copy link
Owner

mgeisler commented Apr 5, 2021

Hi @sirwindfield, yeah that's certainly a possibility!

Right now the way to do this is to use your own Fragment implementation: do the splitting any way you want and use the core wrapping functionality to wrap your custom fragments. This way you can wrap things with non-integer widths (see #312) and you can decide precisely where to split the text.

@mainrs
Copy link

mainrs commented Apr 5, 2021

Thanks for letting me know! I've seen that quite some work (and discussions) have happened around this library in #257 and #244. Have you two reached any consensus?

I am by far not that familiar with the code base but I think the ideas mentioned in both APIs are good approaches, especially #257. It's a good foundation to build upon. I do get your argument about "what can these changes do that we can't currently", but it's hard to argue (from my side) from that perspective as the changes needed to make the API more performant will be gradually. Putting wrapping behind iterators is just the first step to lower the allocations.

I am not a huge fan of larger discussions as I often find that, in the end, they lead to exhaustion and ultimately a loss of interest. Maybe as a compromise for the time being (and to see how these changes can (positively!) impact the library), we could merge the work by @Kestrer into a new branch (something like experimental) where they would then open PRs against that branch. I'd help out with reviews as well and can give my opinion and thoughts about those PRs. But I think a more practical solution might be the way to go. Implementing new features inside that branch allows us to compare them to the current implementation as well through some benchmarks.

It's also up to @Kestrer as they are the driving force for these changes right now. I'd be willing to contribute with some easier issues/implementations. I am not familiar with unicode specifications at all so I definitely need to read up on some stuff for the more complex problems.

What are your thoughts?

@mgeisler
Copy link
Owner

mgeisler commented Apr 6, 2021

What are your thoughts?

Hey @sirwindfield, in short, I would like to see the inner logic working in a streaming fashion, i.e., with iterators. That would also be useful for #224.

The reason Textwrap is not doing so today is basically

  1. I used to have an iterator approach and it was a complicated and inflexible mess. Further, there were some subtle bugs and inconsistencies which finally got fixed when I rewrote the whole thing in Reformulate wrapping in terms of words with whitespace and penalties #221. Moving to a more layered design has helped a lot (I think) and it allowed me to easily do things like fill_inplace. This was all part of the 0.13 release late last year.

    Now, iterators can certainly also be layered and organized in a nice way. However, I do think they add complexity simply by adding code. I removed code last year and I'm reluctant to add it back.

  2. With the addition of the wrap_optimal_fit function (which requires O(1) access to the width of all words), the appeal of iterators diminished a lot. You'll see that I used iterators in the find_words and split_words functions, but when I came to the wrapping functions, I simply used the simplest common interface.

    Yes, one can of course let wrap_optimal_fit do a collect() followed by returning something which implements Iterator. However, you don't get any of the expected advantages: you cannot feed in words one-by-one and get nicely wrapped lines out the other end.

  3. Consumers of Textwrap, such as Clap, seems to be okay happy with calling textwrap::fill and call it a day. So they transform String to String and that's it. Now, this is likely very biased since that's the only option I left them with after version 0.13.

On a higher level, I find that iterators combine really poorly in Rust. Basically, you need to push all branching decisissions down into the iterators, i.e., you cannot do this:

let result = if some_condition() {
        it.filter(|&n| n > 10)
    } else {
        it.filter(|&n| n <= 10)
    };

since the two filter calls result in different iterators. We see this in the changes in #257 where we have code like this in textwrap::wrap:

        let wrapped_words = match options.wrap_algorithm {
            core::WrapAlgorithm::OptimalFit => core::wrap_optimal_fit(broken_words, line_lengths)
                .terminate_eol()
                .into_vec(),
            core::WrapAlgorithm::FirstFit => core::wrap_first_fit(broken_words, line_lengths)
                .terminate_eol()
                .collect(),
        };

So while the inner logic uses iterators all the way up to the wrapping logic, the driver of this logic still converts everything into an iterator at this point. I think we also see the consequence of this in #244 where I believe the top-level function becomes a textwrap::wrap_greedy and presumably there will also be a textwrap::wrap_optimal_fit function later.

Maybe I'm missing a trick here, but this makes for a poor API in my opinion. We can then try to push the complexity down the stack and @Kestrer mentioned doing that with an enum-based iterator. I'm not sure how nice that will look or feel in the code.

mgeisler added a commit that referenced this issue Apr 8, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue Apr 8, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
@mgeisler
Copy link
Owner

mgeisler commented Apr 9, 2021

Hi @Kestrer and @tavianator, I've put up #313 which uses the unicode-linebreak crate to do find line breaks. We talked about it above and you both suggested that using the Rust ICU bindings might be better. I would like support for rust_icu too — for now I simply went with unicode-linebreak since it's a pure-Rust crate.

Why does it matter that it's pure Rust? Well, simply adding rust_icu = "*" to the dependencies and running cargo build gives me this

   Compiling rust_icu_sys v0.4.1
error: failed to run custom build command for `rust_icu_sys v0.4.1`

Caused by:
  process didn't exit successfully: `/home/mg/src/textwrap/target/debug/build/rust_icu_sys-e810eefa177b5c20/build-script-build` (exit code: 101)
  --- stdout
  cargo:rustc-cfg=feature="renaming"
  cargo:rustc-cfg=feature="use-bindgen"
  cargo:rustc-cfg=feature="icu_config"

  --- stderr
  thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: Empty }', /home/mg/.cargo/registry/src/github.com-1ecc6299db9ec823/rust_icu_sys-0.4.1/build.rs:224:36
  stack backtrace:
     0:     0x55d82481f2f0 - std::backtrace_rs::backtrace::libunwind::trace::hfe3b1cace85e87d8
                                 at /rustc/acca818928654807ed3bc1ce0e97df118f8716c8/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
     1:     0x55d82481f2f0 - std::backtrace_rs::backtrace::trace_unsynchronized::h542330af06479043
                                 at /rustc/acca818928654807ed3bc1ce0e97df118f8716c8/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
     2:     0x55d82481f2f0 - std::sys_common::backtrace::_print_fmt::h6b88726367858985
                                 at /rustc/acca818928654807ed3bc1ce0e97df118f8716c8/library/std/src/sys_common/backtrace.rs:67:5

That's not a super impressive out-of-the-box experience 😄 I'm sure I could install a few development packages to make it work, but I'll start with the much simpler unicode-linebreak crate.

Maybe an approach where splitting is abstracted away behind a trait would be better? That way the current behaviour can be implemented using that trait but if someone wants to use the ICU library they could implement the trait for it. Maybe a function that takes the input str and returns a list of Words.

@sirwindfield, yes, you're quite right! I've been playing with removing the existing WordSplitter trait I already have and replace it with an enum — since I cannot find any third-party implementations of WordSplitter in the wild, I figured that it would be easier for everybody to just present the three known implementations as enum variants.

However, I realize now that this is bad because it prevents us from dealing with my other worry:

On a higher level, I find that iterators combine really poorly in Rust. Basically, you need to push all branching decisions down into the iterators,

Basically, the way to make everything work smoothly with iterators is to only build one big iterator chain. At no point can you make choices in such a chain — so things like this from #313 is forbidden:

    match line_break_algorithm {
        LineBreakAlgorithm::Whitespace => find_ascii_words(line),
        LineBreakAlgorithm::UnicodeLineBreaks => find_unicode_words(line),
    }

In the PR, I had to collect() the results of the two iterators into a Vec because of the type mismatch. I'll now go and rework this to use a trait instead so that there is only choice at runtime — and thus no branches with conflicting types.

@mgeisler
Copy link
Owner

mgeisler commented Apr 9, 2021

If someone wants to turn the wrap algorithms into a trait too, then that would be great! We'll end up with several generic parameters on Options, but i hope they'll mostly stay out of sight because of the defaults you get when you use Options::new.

mgeisler added a commit that referenced this issue Apr 14, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue Apr 14, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue Apr 14, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/).

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue May 2, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/). We can use this to
find words in non-ASCII text.

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on ASCII space.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue May 2, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/). We can use this to
find words in non-ASCII text.

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on ASCII space.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
mgeisler added a commit that referenced this issue May 2, 2021
This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/). We can use this to
find words in non-ASCII text.

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on ASCII space.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
@mgeisler
Copy link
Owner

mgeisler commented May 2, 2021

Hi all,

I've just merged #332 which adds a WordSeparator trait for specifying how words are found in a line of text. I've also just merged #313 which implements the trait using the unicode-linebreak crate. With this word separator, we can now wrap some CJK text, even if there is no whitespace between characters:

assert_eq!(UnicodeBreakProperties.find_words("CJK: 你好").collect::<Vec<_>>(),
           vec![Word::from("CJK: "),
                Word::from("你"),
                Word::from("好")]);

It would now be exciting to see an implementation of WordSeparator for the rust_icu crate. I would be happy to accept such a pull request.

@Kestrer you should also be able to implement the wrapping you mentioned in the comment that started this issue: the word separator should be able to own a dictionary so it can compute the correct breaks for the text you used in your example.

@mgeisler
Copy link
Owner

mgeisler commented May 2, 2021

I created #334 to track adding support for rust_icu. I'll close this issue now since I believe we now have (some) support for languages without word separators 🎉 Please reopen or file a new issue if you see problems!

@mgeisler mgeisler closed this as completed May 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants