-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does not work for languages without word separators #220
Comments
Hi @Koxiaet, thanks you very much for linking me to the w3 article. I had not read that before and it's very interesting to get some background information on this. You're completely right that I'm currently working on reimplementing the basic line breaking algorithm. The goal is to make it more robust and flexible. Right now, the algorithm naively goes through the string and breaks the string into words based on whitespace. The processing is trying to do everything at once and it's not easy to adjust or extend it to handle more complicated cases like you describe. Do you know if there are freely available dictionaries we can refer to here? I know Danish, English and German myself — all written the same way with Latin script... So I'm very much out of my depth here and would need all the help I can get! |
I'm also out of my depth here, I don't know any non-European languages either. I think the best option is to use |
When I was looking into this previously, I was considering to use An an example, it will find no breakpoints in I guess the breaks computed by I'll probably not make a lot of progress on this issue by myself since I'm not an expert in this field. I'll leave it open in case such an expert comes by :-) |
Yes, |
Okay, please take a look at the state of things after #221. It's merged now and changes the inner algorithm quite a bit — and it should open up the possibility of more custom splitting. Basically, the wrapping algorithm works with a sequence of I would be very interested to see what you can come up with here. |
That makes sense to me, it would be quite strange to see the
It would be better to let ICU do all the splitting, for the reason above, and possibly others. |
Yeah, sorry, I didn't mean to say that it is strange to skip this break point — I think it's really cool to get such finer adjustments to the breaks.
I still haven't looked more into the ICU library or the Rust bindings. But I would be happy to have textwrap handle more languages and writing systems correctly, possibly via optional Cargo features which people can enable or disable depending on their use cases. |
Maybe an approach where splitting is abstracted away behind a trait would be better? That way the current behaviour can be implemented using that trait but if someone wants to use the ICU library they could implement the trait for it. Maybe a function that takes the input str and returns a list of |
Hi @sirwindfield, yeah that's certainly a possibility! Right now the way to do this is to use your own |
Thanks for letting me know! I've seen that quite some work (and discussions) have happened around this library in #257 and #244. Have you two reached any consensus? I am by far not that familiar with the code base but I think the ideas mentioned in both APIs are good approaches, especially #257. It's a good foundation to build upon. I do get your argument about "what can these changes do that we can't currently", but it's hard to argue (from my side) from that perspective as the changes needed to make the API more performant will be gradually. Putting wrapping behind iterators is just the first step to lower the allocations. I am not a huge fan of larger discussions as I often find that, in the end, they lead to exhaustion and ultimately a loss of interest. Maybe as a compromise for the time being (and to see how these changes can (positively!) impact the library), we could merge the work by @Kestrer into a new branch (something like It's also up to @Kestrer as they are the driving force for these changes right now. I'd be willing to contribute with some easier issues/implementations. I am not familiar with unicode specifications at all so I definitely need to read up on some stuff for the more complex problems. What are your thoughts? |
Hey @sirwindfield, in short, I would like to see the inner logic working in a streaming fashion, i.e., with iterators. That would also be useful for #224. The reason Textwrap is not doing so today is basically
On a higher level, I find that iterators combine really poorly in Rust. Basically, you need to push all branching decisissions down into the iterators, i.e., you cannot do this: let result = if some_condition() {
it.filter(|&n| n > 10)
} else {
it.filter(|&n| n <= 10)
}; since the two let wrapped_words = match options.wrap_algorithm {
core::WrapAlgorithm::OptimalFit => core::wrap_optimal_fit(broken_words, line_lengths)
.terminate_eol()
.into_vec(),
core::WrapAlgorithm::FirstFit => core::wrap_first_fit(broken_words, line_lengths)
.terminate_eol()
.collect(),
}; So while the inner logic uses iterators all the way up to the wrapping logic, the driver of this logic still converts everything into an iterator at this point. I think we also see the consequence of this in #244 where I believe the top-level function becomes a Maybe I'm missing a trick here, but this makes for a poor API in my opinion. We can then try to push the complexity down the stack and @Kestrer mentioned doing that with an |
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
Hi @Kestrer and @tavianator, I've put up #313 which uses the unicode-linebreak crate to do find line breaks. We talked about it above and you both suggested that using the Rust ICU bindings might be better. I would like support for rust_icu too — for now I simply went with unicode-linebreak since it's a pure-Rust crate. Why does it matter that it's pure Rust? Well, simply adding
That's not a super impressive out-of-the-box experience 😄 I'm sure I could install a few development packages to make it work, but I'll start with the much simpler unicode-linebreak crate.
@sirwindfield, yes, you're quite right! I've been playing with removing the existing However, I realize now that this is bad because it prevents us from dealing with my other worry:
Basically, the way to make everything work smoothly with iterators is to only build one big iterator chain. At no point can you make choices in such a chain — so things like this from #313 is forbidden: match line_break_algorithm {
LineBreakAlgorithm::Whitespace => find_ascii_words(line),
LineBreakAlgorithm::UnicodeLineBreaks => find_unicode_words(line),
} In the PR, I had to |
If someone wants to turn the wrap algorithms into a trait too, then that would be great! We'll end up with several generic parameters on |
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). We can use this to find words in non-ASCII text. The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on ASCII space. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). We can use this to find words in non-ASCII text. The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on ASCII space. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). We can use this to find words in non-ASCII text. The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on ASCII space. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
Hi all, I've just merged #332 which adds a assert_eq!(UnicodeBreakProperties.find_words("CJK: 你好").collect::<Vec<_>>(),
vec![Word::from("CJK: "),
Word::from("你"),
Word::from("好")]); It would now be exciting to see an implementation of @Kestrer you should also be able to implement the wrapping you mentioned in the comment that started this issue: the word separator should be able to own a dictionary so it can compute the correct breaks for the text you used in your example. |
I created #334 to track adding support for rust_icu. I'll close this issue now since I believe we now have (some) support for languages without word separators 🎉 Please reopen or file a new issue if you see problems! |
For example,
fill("កើតមកមានសេរីភាព", 6)
outputsកើតមកម
,ានសេរភ
andាព
when it should outputកើតមក
,មាន
andសេរីភាព
. See also w3's Approaches to line breaking document which has the correct ways to line break words; implementing support for this would require storing a dictionary and matching words in it.The text was updated successfully, but these errors were encountered: