-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
&str.words should be using the unicode word boundary algorithm #15628
Comments
Note that this gives some possibly unintuitive results.
|
It occurs to me that we could take a |
I think taking a bool for this is bad design. It should be a separate method. |
This has to be decided before 1.0. |
Here is the implementation I've just written for a program I'm working on: fn word_at(&self, idx: usize) -> &str {
let s = &self[idx..];
let mut take_prev = true;
let mut take_cur = true;
let mut state = WState::Start;
let mut prev_idx = 0;
let mut idx = 0;
for (curr, c) in s.char_indices() {
prev_idx = idx;
idx = curr;
let cat = word::category(c);
if cat == WCat::Extend || cat == WCat::Format {
if state == WState::Start {
break;
} else {
continue;
}
}
state = match state {
WState::Start if c == '\r' => {
if idx + 1 != s.len() && s.char_at(idx + 1) == '\n' {
idx += 1;
}
break;
},
WState::Start => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::Numeric => WState::Numeric,
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
WCat::RegionalIndicator => WState::RegionalIndicator,
_ => break,
},
WState::ALetter => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidLetter | WCat::MidNumLet |
WCat::SingleQuote => WState::ALetterExt,
WCat::Numeric => WState::Numeric,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::ALetterExt => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::HLetter => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidLetter | WCat::MidNumLet => WState::HLetterExt(0),
WCat::SingleQuote => WState::HLetterExt(1),
WCat::DoubleQuote => WState::HLetterExt(2),
WCat::Numeric => WState::Numeric,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::HLetterExt(n) => match cat {
WCat::ALetter if n < 2 => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
take_prev = n == 1;
break;
}
},
WState::Numeric => match cat {
WCat::Numeric => WState::Numeric,
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidNum | WCat::MidNumLet |
WCat::SingleQuote => WState::NumericExt,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::NumericExt => match cat {
WCat::Numeric => WState::Numeric,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::Katakana => match cat {
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::ExtendNumLet => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::Numeric => WState::Numeric,
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::RegionalIndicator => match cat {
WCat::RegionalIndicator => WState::RegionalIndicator,
_ => {
take_cur = false;
break;
}
},
WState::SingleQuote => unreachable!(),
}
}
if take_cur {
idx += s.char_at(idx).len_utf8();
} else if !take_prev {
idx = prev_idx;
}
&s[..idx]
}
fn word_at_reverse(&self, idx: usize) -> &str {
let s = &self[..idx];
let mut take_prev = true;
let mut take_cur = true;
let mut state = WState::Start;
let mut prev_prev_idx = s.len();
let mut prev_idx = s.len();
let mut idx = s.len();
for (curr, c) in s.char_indices().rev() {
let cat = word::category(c);
if cat == WCat::Extend || cat == WCat::Format {
continue;
}
prev_prev_idx = prev_idx;
prev_idx = idx;
idx = curr;
state = match state {
WState::Start if c == '\n' => {
if idx > 0 && s.char_at_reverse(idx) == '\r' {
idx -= 1;
}
break;
},
WState::Start => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::SingleQuote => WState::SingleQuote,
WCat::Numeric => WState::Numeric,
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
WCat::RegionalIndicator => WState::RegionalIndicator,
_ => break,
},
WState::ALetter => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidLetter | WCat::MidNumLet |
WCat::SingleQuote => WState::ALetterExt,
WCat::Numeric => WState::Numeric,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::ALetterExt => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::HLetter => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidLetter | WCat::MidNumLet => WState::HLetterExt(0),
WCat::DoubleQuote => WState::HLetterExt(2),
WCat::Numeric => WState::Numeric,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::HLetterExt(n) => match cat {
WCat::ALetter if n < 2 => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::SingleQuote => match cat {
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
break;
}
},
WState::Numeric => match cat {
WCat::Numeric => WState::Numeric,
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidNum | WCat::MidNumLet |
WCat::SingleQuote => WState::NumericExt,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::NumericExt => match cat {
WCat::Numeric => WState::Numeric,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::Katakana => match cat {
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::ExtendNumLet => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::Numeric => WState::Numeric,
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::RegionalIndicator => match cat {
WCat::RegionalIndicator => WState::RegionalIndicator,
_ => {
take_cur = false;
break;
}
},
}
}
if idx == s.len() {
idx = 0;
} else if !take_prev {
idx = prev_prev_idx;
} else if !take_cur {
idx = prev_idx;
}
&s[idx..]
} |
It's still not clear to me that we want "words" to be UAX#29 vs "normal" splitting. Most people expect to break on whitespace based on their experiences in other languages. That's not a reason to do or not do it; I'm just pointing out that some thought would be good here. I'm convinced that it's worthwhile to have a UAX#29 implementation available in the library, since there is good reason to have strong Unicode support. I'm just unclear on how the naming should be handled. |
I think This seems like a nice function though. I suppose it's quite specific so it could have a longer name... I can give some suggestions: And if you wanted to be ultra precise/unambiguous/mean you could remove |
One might want to read the documentation then.
It's clear that no thought went into the current implementation. A function that uses a degenerate algorithm which barely finds words in ascii does not belong in a unicode library. I'm not sure what purpose this algorithm serves at all. Chrome certainly doesn't use it to find word boundaries and neither does vim. If someone actually needs this behavior then they can get it with the one-liner in the first post. |
This is a fair observation. In that case, probably we should
This works for me if it works for you. From a practical standpoint: are you interested in putting together a pull request to integrate the above into libunicode? If not, I can probably find some time to do it over the weekend. |
I don't know python so I won't do it. Besides, the current python implementation seems bad because it doesn't use the normative file for the grapheme classes. Also, the names of the functions should reflect what they are actually doing. |
Don't need to do as described. Just need to decide something. |
FWIW, it seems totally reasonable to mark this |
I can find exactly one and a half other languages that implements similar functionality (rather than just using plain split). Haskell has PR for unstabilsation, #22253. |
It is not totally clear if we should just use whitespace, or if the full unicode word-breaking algorithm is more correct. If there is demand we can reconsider this decision (and consider the precise algorithm to use in detail). cc rust-lang#15628.
The two algorithms are different enough that they should use different method names IMO. Whitespace split is popular and useful (as a poor man's lex for command lines, for example?). |
It is not totally clear if we should just use whitespace, or if the full unicode word-breaking algorithm is more correct. If there is demand we can reconsider this decision (and consider the precise algorithm to use in detail). cc rust-lang#15628.
It is not totally clear if we should just use whitespace, or if the full unicode word-breaking algorithm is more correct. If there is demand we can reconsider this decision (and consider the precise algorithm to use in detail). cc rust-lang#15628.
@mahkoh points out in rust-lang#15628 that unicode.py does not use normative data for Grapheme classes. This pr fixes that issue, and does some minor cleanup of the unicode.py script. In addition, GC_RegionalIndicator is renamed GC_Regional_Indicator in order to stay in line with the Unicode class name definitions. I have updated refs in u_str.rs, and verified that there are no refs elsewhere in the codebase. However, in principle someone using the unicode tables for their own purposes might see breakage from this.
RFC’d rust-lang/rfcs#1054 |
I've become convinced that, while it might be useful to offer UAX#29 breaking as part of libunicode, it's definitely not what most people will want most of the time, so it should probably not be the default I'm almost ready to submit a PR to add a UAX#29 word breaking implementation to libunicode, but probably having a short function name that performs the standard "split on whitespace" is something that people kind of expect from the stdlib. Perhaps renaming the current function per @SimonSapin's suggestion is the right way to go here. By the way, @mahkoh, I think your reverse iterator above does not correctly handle Format and Extend characters. For example, I think your implementation might group Format or Extend characters with the wrong word, and I'm pretty sure it doesn't do the right thing in the pathological case, namely, when there are Format or Extend characters at the beginning of a string. |
@kwantam that’s great! However I’d encourage you to publish this code on crates.io rather than try to get it into libunicode. That way it can evolve independently of the compiler, and the API does not need to make as strong compatibility promises as |
That's a not unreasonable opinion, but I think it's probably worth having the discussion in the PR rather than here. If it's ultimately rejected in favor of moving to a separate library that's fine. A larger question is whether more of the functionality currently in libunicode should be moved elsewhere, e.g., into a standalone crate. For example, it seems slightly arbitrary to have grapheme clustering, normalizations, etc. in stdlib (as is currently the case) and keeping other functionality elsewhere. |
Yes, definitely. I kinda volunteered to do this a while back, but haven’t gotten to it yet. |
I worry that having a Unicode is hard and complicated, so I fear many people will just treat strings like ASCII if a lot of Unicode functionality is moved out of the stdlib. The best way to encourage people to handle non-ASCII data correctly is, in my opinion, to make it very easy to do so, and that can be accomplished effectively by keeping all Unicode functionality in-tree. |
“More correct” really depends on what you’re doing with it. While I agree with the general sentiment that good Unicode support is good, just because it’s says “Unicode” in the name doesn’t mean that UAX#29 is always the right thing. Let’s pretend for a minute that The Python equivalent is The only code I ever wrote that was actually manipulating words in prose text was line breaking in a text layout engine, but 1. that’s UAX#14, not #29, 2. it’s specialized enough that I want lots of control: I don’t expect to just use a single standard library method. |
Note: I'm out of my depth here and late for bed, so hopefully I'm not just adding noise. Personally I'd love to have a words iterator that works along the same lines as the "Counting the Words in a UTF-8 String" example in this icu guide http://userguide.icu-project.org/strings/utext So use UAX#29 word breaking, but only return words and numbers; not whitespace or punctuation. I don't think that this behaviour would be surprising. The only downsides I see are complexity of implementation and the fact that it would be less speedy than just splitting by whitespace. As far as complexity goes, I think that this type of complexity argues for it being implemented in the standard library, where it will get a well scrutinized and high quality implementation that is available to anyone using the standard library. As for performance, it would be easy for a programmer to opt out. If they know what they're doing and what they're doing doesn't need the more complex word splitting algorithm then splitting by whitespace is only a one-liner away. It appears to me to be a suitable default. This seems like another place that the standard library can help programmers avoid tripping up. It would be more complex for implementor not user. For the user it would just work, even in cases the programmer never thought about. |
I disagree with the claim that putting something in |
@SimonSapin I agree with your concerns about stability and domain knowledge, but I fear you might be overselling them slightly. My understanding is that marking it As far as domain knowledge goes, in general I agree but in this case it's mostly just a matter of implementing the slightly gross rules (the backward iterator is the gross one) in a more-or-less linear time automaton with reasonable constant factors and acceptable code quality. The deep domain knowledge is required for writing UAX#29, surely, but not so much for implementing it. @blankname I think what you're after is effectively |
This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.
This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.
For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to s.words() throughout codebase. Closes rust-lang#15628
For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to s.words() throughout codebase. Closes rust-lang#15628
For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to s.words() throughout codebase. Closes rust-lang#15628
For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to s.words() throughout codebase. Closes rust-lang#15628
For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to s.words() throughout codebase. Closes rust-lang#15628
For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to s.words() throughout codebase. Closes rust-lang#15628
For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to str.words() throughout codebase. Closes rust-lang#15628
minor: Sync from downstream
Currently it's just splitting on whitespace (i.e. is equivalent to splitting with
regex!("\s+")
). Preferably it should be using http://www.unicode.org/reports/tr29/#Word_Boundaries(The old behaviour is easy to replicate with the above regex, or with
s.split(|c: char| c.is_whitespace()).filter(|s| !s.is_empty())
.)The text was updated successfully, but these errors were encountered: