&str.words should be using the unicode word boundary algorithm #15628

huonw · 2014-07-12T11:18:42Z

Currently it's just splitting on whitespace (i.e. is equivalent to splitting with regex!("\s+")). Preferably it should be using http://www.unicode.org/reports/tr29/#Word_Boundaries

(The old behaviour is easy to replicate with the above regex, or with s.split(|c: char| c.is_whitespace()).filter(|s| !s.is_empty()).)

The text was updated successfully, but these errors were encountered:

kwantam · 2014-07-12T17:44:31Z

Note that this gives some possibly unintuitive results.

("Hello there") gives ( " Hello there " ), which might not be what one expects.

kwantam · 2014-07-14T21:31:02Z

It occurs to me that we could take a bool indicating whether the user wants the full UAX#29 algorithm or the simple one. A lot of other languages say words() when they mean "split on whitespace", and it's a trivial thing to continue offering the present behavior even after implementing the UAX#29 algorithm.

lucy · 2014-07-15T11:21:57Z

I think taking a bool for this is bad design. It should be a separate method.

mahkoh · 2015-01-11T21:09:25Z

This has to be decided before 1.0.

mahkoh · 2015-01-12T01:10:55Z

Here is the implementation I've just written for a program I'm working on:

    fn word_at(&self, idx: usize) -> &str {
        let s = &self[idx..];
        let mut take_prev = true;
        let mut take_cur = true;
        let mut state = WState::Start;
        let mut prev_idx = 0;
        let mut idx = 0;

        for (curr, c) in s.char_indices() {
            prev_idx = idx;
            idx = curr;

            let cat = word::category(c);
            if cat == WCat::Extend || cat == WCat::Format {
                if state == WState::Start {
                    break;
                } else {
                    continue;
                }
            }

            state = match state {
                WState::Start if c == '\r' => {
                    if idx + 1 != s.len() && s.char_at(idx + 1) == '\n' {
                        idx += 1;
                    }
                    break;
                },
                WState::Start => match cat {
                    WCat::ALetter           => WState::ALetter,
                    WCat::HebrewLetter      => WState::HLetter,
                    WCat::Numeric           => WState::Numeric,
                    WCat::Katakana          => WState::Katakana,
                    WCat::ExtendNumLet      => WState::ExtendNumLet,
                    WCat::RegionalIndicator => WState::RegionalIndicator,
                    _ => break,
                },
                WState::ALetter => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidLetter | WCat::MidNumLet |
                        WCat::SingleQuote => WState::ALetterExt,
                    WCat::Numeric => WState::Numeric,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::ALetterExt => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::HLetter => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidLetter | WCat::MidNumLet => WState::HLetterExt(0),
                    WCat::SingleQuote => WState::HLetterExt(1),
                    WCat::DoubleQuote => WState::HLetterExt(2),
                    WCat::Numeric => WState::Numeric,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::HLetterExt(n) => match cat {
                    WCat::ALetter if n < 2 => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        take_prev = n == 1;
                        break;
                    }
                },
                WState::Numeric => match cat {
                    WCat::Numeric => WState::Numeric,
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidNum | WCat::MidNumLet |
                        WCat::SingleQuote => WState::NumericExt,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::NumericExt => match cat {
                    WCat::Numeric => WState::Numeric,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::Katakana => match cat {
                    WCat::Katakana => WState::Katakana,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::ExtendNumLet => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::Numeric => WState::Numeric,
                    WCat::Katakana => WState::Katakana,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::RegionalIndicator => match cat {
                    WCat::RegionalIndicator => WState::RegionalIndicator,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::SingleQuote => unreachable!(),
            }
        }

        if take_cur {
            idx += s.char_at(idx).len_utf8();
        } else if !take_prev {
            idx = prev_idx;
        }

        &s[..idx]
    }

    fn word_at_reverse(&self, idx: usize) -> &str {
        let s = &self[..idx];

        let mut take_prev = true;
        let mut take_cur = true;
        let mut state = WState::Start;
        let mut prev_prev_idx = s.len();
        let mut prev_idx = s.len();
        let mut idx = s.len();

        for (curr, c) in s.char_indices().rev() {
            let cat = word::category(c);
            if cat == WCat::Extend || cat == WCat::Format {
                continue;
            }

            prev_prev_idx = prev_idx;
            prev_idx = idx;
            idx = curr;

            state = match state {
                WState::Start if c == '\n' => {
                    if idx > 0 && s.char_at_reverse(idx) == '\r' {
                        idx -= 1;
                    }
                    break;
                },
                WState::Start => match cat {
                    WCat::ALetter           => WState::ALetter,
                    WCat::HebrewLetter      => WState::HLetter,
                    WCat::SingleQuote       => WState::SingleQuote,
                    WCat::Numeric           => WState::Numeric,
                    WCat::Katakana          => WState::Katakana,
                    WCat::ExtendNumLet      => WState::ExtendNumLet,
                    WCat::RegionalIndicator => WState::RegionalIndicator,
                    _ => break,
                },
                WState::ALetter => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidLetter | WCat::MidNumLet |
                        WCat::SingleQuote => WState::ALetterExt,
                    WCat::Numeric => WState::Numeric,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::ALetterExt => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::HLetter => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidLetter | WCat::MidNumLet => WState::HLetterExt(0),
                    WCat::DoubleQuote => WState::HLetterExt(2),
                    WCat::Numeric => WState::Numeric,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::HLetterExt(n) => match cat {
                    WCat::ALetter if n < 2 => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::SingleQuote => match cat {
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::Numeric => match cat {
                    WCat::Numeric => WState::Numeric,
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidNum | WCat::MidNumLet |
                        WCat::SingleQuote => WState::NumericExt,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::NumericExt => match cat {
                    WCat::Numeric => WState::Numeric,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::Katakana => match cat {
                    WCat::Katakana => WState::Katakana,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::ExtendNumLet => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::Numeric => WState::Numeric,
                    WCat::Katakana => WState::Katakana,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::RegionalIndicator => match cat {
                    WCat::RegionalIndicator => WState::RegionalIndicator,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
            }
        }

        if idx == s.len() {
            idx = 0;
        } else if !take_prev {
            idx = prev_prev_idx;
        } else if !take_cur {
            idx = prev_idx;
        }

        &s[idx..]
    }

kwantam · 2015-01-12T19:49:34Z

It's still not clear to me that we want "words" to be UAX#29 vs "normal" splitting. Most people expect to break on whitespace based on their experiences in other languages. That's not a reason to do or not do it; I'm just pointing out that some thought would be good here.

I'm convinced that it's worthwhile to have a UAX#29 implementation available in the library, since there is good reason to have strong Unicode support. I'm just unclear on how the naming should be handled.

ColonelJ · 2015-01-12T20:46:37Z

I think words should stay as it is. In a language with overloading or optional arguments having an extra argument to specify this as an alternate strategy would be nice, but we do not have this luxury.

This seems like a nice function though. I suppose it's quite specific so it could have a longer name... I can give some suggestions: words_by_boundary (my preference), unicode_words, words_unicode, words_proper, word_extract, sel_words, extract_words, word_break, delimited_words, bounded_words, word_divide, word_divisions, word_intervals, word_segments, just segments or word_segs, word_sep, unicode_word_split, uax29_words, words_precise...

And if you wanted to be ultra precise/unambiguous/mean you could remove words and force people to use words_by_space or words_by_boundary!

mahkoh · 2015-01-12T21:34:06Z

which might not be what one expects

One might want to read the documentation then.

That's not a reason to do or not do it; I'm just pointing out that some thought would be good here.

It's clear that no thought went into the current implementation. A function that uses a degenerate algorithm which barely finds words in ascii does not belong in a unicode library. I'm not sure what purpose this algorithm serves at all. Chrome certainly doesn't use it to find word boundaries and neither does vim. If someone actually needs this behavior then they can get it with the one-liner in the first post.

kwantam · 2015-01-12T22:37:25Z

This is a fair observation.

In that case, probably we should

use your UAXA#29 implementation for words()
include something like the one-liner @huonw posted above in the words() documentation (along with some examples and a pointer to the UAX#29 document, of course)

This works for me if it works for you.

From a practical standpoint: are you interested in putting together a pull request to integrate the above into libunicode? If not, I can probably find some time to do it over the weekend.

mahkoh · 2015-01-13T23:32:29Z

I don't know python so I won't do it. Besides, the current python implementation seems bad because it doesn't use the normative file for the grapheme classes. Also, the names of the functions should reflect what they are actually doing.

brson · 2015-01-15T21:09:14Z

Don't need to do as described. Just need to decide something.

aturon · 2015-02-13T03:10:05Z

FWIW, it seems totally reasonable to mark this #[unstable] for now and stabilize a particular semantics by demand.

huonw · 2015-02-13T03:13:27Z

I can find exactly one and a half other languages that implements similar functionality (rather than just using plain split). Haskell has words and C/C++ have *scanf and various io stream things (>>, istream_iterator) (these aren't as unambiguiously for word breaking, hence the half). Both of these just use whitespace rather than the true algorithm.

PR for unstabilsation, #22253.

It is not totally clear if we should just use whitespace, or if the full unicode word-breaking algorithm is more correct. If there is demand we can reconsider this decision (and consider the precise algorithm to use in detail). cc rust-lang#15628.

bluss · 2015-02-13T09:35:05Z

The two algorithms are different enough that they should use different method names IMO. Whitespace split is popular and useful (as a poor man's lex for command lines, for example?).

It is not totally clear if we should just use whitespace, or if the full unicode word-breaking algorithm is more correct. If there is demand we can reconsider this decision (and consider the precise algorithm to use in detail). cc rust-lang#15628.

@mahkoh

@mahkoh points out in rust-lang#15628 that unicode.py does not use normative data for Grapheme classes. This pr fixes that issue, and does some minor cleanup of the unicode.py script. In addition, GC_RegionalIndicator is renamed GC_Regional_Indicator in order to stay in line with the Unicode class name definitions. I have updated refs in u_str.rs, and verified that there are no refs elsewhere in the codebase. However, in principle someone using the unicode tables for their own purposes might see breakage from this.

SimonSapin · 2015-04-10T14:55:26Z

RFC’d rust-lang/rfcs#1054

kwantam · 2015-04-10T22:24:22Z

I've become convinced that, while it might be useful to offer UAX#29 breaking as part of libunicode, it's definitely not what most people will want most of the time, so it should probably not be the default words() implementation.

I'm almost ready to submit a PR to add a UAX#29 word breaking implementation to libunicode, but probably having a short function name that performs the standard "split on whitespace" is something that people kind of expect from the stdlib. Perhaps renaming the current function per @SimonSapin's suggestion is the right way to go here.

By the way, @mahkoh, I think your reverse iterator above does not correctly handle Format and Extend characters. For example, I think your implementation might group Format or Extend characters with the wrong word, and I'm pretty sure it doesn't do the right thing in the pathological case, namely, when there are Format or Extend characters at the beginning of a string.

SimonSapin · 2015-04-10T22:33:59Z

@kwantam that’s great! However I’d encourage you to publish this code on crates.io rather than try to get it into libunicode. That way it can evolve independently of the compiler, and the API does not need to make as strong compatibility promises as std does: when there’s a breaking change they don’t want to update for yet, users can keep using old versions on a crates.io library and still update their compiler.

kwantam · 2015-04-10T22:48:58Z

That's a not unreasonable opinion, but I think it's probably worth having the discussion in the PR rather than here. If it's ultimately rejected in favor of moving to a separate library that's fine.

A larger question is whether more of the functionality currently in libunicode should be moved elsewhere, e.g., into a standalone crate. For example, it seems slightly arbitrary to have grapheme clustering, normalizations, etc. in stdlib (as is currently the case) and keeping other functionality elsewhere.

SimonSapin · 2015-04-10T22:56:38Z

A larger question is whether more of the functionality currently in libunicode should be moved elsewhere, e.g., into a standalone crate. For example, it seems slightly arbitrary to have grapheme clustering, normalizations, etc. in stdlib (as is currently the case) and keeping other functionality elsewhere.

Yes, definitely. I kinda volunteered to do this a while back, but haven’t gotten to it yet.

ftxqxd · 2015-04-11T00:04:50Z

I worry that having a split_whitespace method in the stdlib but not having a proper words method would encourage people too much to use the split_whitespace method, whose algorithm is nearly useless and is not normally what you want, instead of the more correct words method which would be buried somewhere in crates.io. The documentation for split_whitespace could link to a unicode crate on crates.io, of course, but a lot of people would probably not bother adding a new Cargo dependency and just opt for the ‘good enough’ method of using split_whitespace instead.

Unicode is hard and complicated, so I fear many people will just treat strings like ASCII if a lot of Unicode functionality is moved out of the stdlib. The best way to encourage people to handle non-ASCII data correctly is, in my opinion, to make it very easy to do so, and that can be accomplished effectively by keeping all Unicode functionality in-tree.

SimonSapin · 2015-04-11T00:22:08Z

“More correct” really depends on what you’re doing with it. While I agree with the general sentiment that good Unicode support is good, just because it’s says “Unicode” in the name doesn’t mean that UAX#29 is always the right thing.

Let’s pretend for a minute that #[unstable] items don’t exist. (Which they effectively don’t for beta/stable users.) We’re discussing adding new functionality to std, with at least two alternatives for the behavior. To guide this discussion: what are the use cases for this functionality?

The Python equivalent is str.split without arguments. I’ve used this mostly for parsing files containing space-separated data.

The only code I ever wrote that was actually manipulating words in prose text was line breaking in a text layout engine, but 1. that’s UAX#14, not #29, 2. it’s specialized enough that I want lots of control: I don’t expect to just use a single standard library method.

blankname · 2015-04-11T07:55:22Z

Note: I'm out of my depth here and late for bed, so hopefully I'm not just adding noise.

Personally I'd love to have a words iterator that works along the same lines as the "Counting the Words in a UTF-8 String" example in this icu guide http://userguide.icu-project.org/strings/utext

So use UAX#29 word breaking, but only return words and numbers; not whitespace or punctuation. I don't think that this behaviour would be surprising.

The only downsides I see are complexity of implementation and the fact that it would be less speedy than just splitting by whitespace.

As far as complexity goes, I think that this type of complexity argues for it being implemented in the standard library, where it will get a well scrutinized and high quality implementation that is available to anyone using the standard library.

As for performance, it would be easy for a programmer to opt out. If they know what they're doing and what they're doing doesn't need the more complex word splitting algorithm then splitting by whitespace is only a one-liner away.

It appears to me to be a suitable default.
It would cause programmers to write programs that are more "unicode aware" without even having to think about it.

This seems like another place that the standard library can help programmers avoid tripping up.

It would be more complex for implementor not user. For the user it would just work, even in cases the programmer never thought about.

SimonSapin · 2015-04-11T08:15:36Z

in the standard library, where it will get a well scrutinized and high quality implementation

I disagree with the claim that putting something in std will automatically make it high quality. The people with domain-specific knowledge, those who you would want to do reviews, are not necessarily the same people as the std maintainers. On the contrary, being forced to provide the same stability guarantees as the rest of std would make it harder to fix issues that might be found later: you have to get everything right before you mark it #[stable].

kwantam · 2015-04-11T19:08:39Z

@SimonSapin I agree with your concerns about stability and domain knowledge, but I fear you might be overselling them slightly.

My understanding is that marking it #[stable] doesn't mean bugs can't be fixed! though obviously it does mean we need a sensible decision about what to call things so that the interfaces don't change. One other mitigating factor with regard to stability is that UAX#29 comes with an extensive set of test strings that exercise all letter class combinations (and I have a script that mechanically translates them to Rust so we can test against them automatically). The same is true for the graphemes iterator, and libcollectionstest already includes these tests.

As far as domain knowledge goes, in general I agree but in this case it's mostly just a matter of implementing the slightly gross rules (the backward iterator is the gross one) in a more-or-less linear time automaton with reasonable constant factors and acceptable code quality. The deep domain knowledge is required for writing UAX#29, surely, but not so much for implementing it.

@blankname I think what you're after is effectively mystring.split_by_uax29().filter(|s| s.chars().any(|c| c.is_alphanumeric())). Note that the UnicodeStr trait has an is_alphanumeric() function as well, but that one won't work because it requires that every character is alphanumeric.

@huonw

This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.

@huonw

This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.

For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to s.words() throughout codebase. Closes rust-lang#15628

For now, words() is left in (but deprecated), and Words is a type alias for struct SplitWhitespace. Also cleaned up references to str.words() throughout codebase. Closes rust-lang#15628

minor: Sync from downstream

huonw mentioned this issue Jul 12, 2014

UnicodeStrSlice: add width() and graphemes() methods #15619

Merged

huonw added the A-libs label Jul 12, 2014

bstrie added the I-nominated label Jan 12, 2015

brson added this to the 1.0 beta milestone Jan 15, 2015

brson added the P-backcompat-libs label Jan 15, 2015

alexcrichton removed the I-nominated label Jan 15, 2015

pnkfelix self-assigned this Feb 12, 2015

huonw assigned huonw and unassigned pnkfelix Feb 12, 2015

huonw mentioned this issue Feb 13, 2015

Unstabilise words for now. #22253

Merged

alexcrichton mentioned this issue Apr 9, 2015

APIs stabilization metabug #24028

Closed

29 tasks

kwantam mentioned this issue Apr 12, 2015

add UAX#29 word bounds algorithm to libunicode #24340

Closed

SimonSapin mentioned this issue Apr 12, 2015

Rename or replace str::words to side-step the ambiguity of “a word”. rust-lang/rfcs#1054

Merged

kwantam mentioned this issue Apr 15, 2015

deprecate Unicode functions that will be moved to crates.io #24428

Merged

kwantam mentioned this issue Apr 18, 2015

implement rfc 1054: split_whitespace() fn, deprecate words() #24563

Merged

bors closed this as completed in #24563 Apr 22, 2015

bors added a commit to rust-lang-ci/rust that referenced this issue Sep 18, 2023

Auto merge of rust-lang#15628 - lnicola:sync-from-rust, r=lnicola

258b15c

minor: Sync from downstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

&str.words should be using the unicode word boundary algorithm #15628

&str.words should be using the unicode word boundary algorithm #15628

huonw commented Jul 12, 2014

kwantam commented Jul 12, 2014

kwantam commented Jul 14, 2014

lucy commented Jul 15, 2014

mahkoh commented Jan 11, 2015

mahkoh commented Jan 12, 2015

kwantam commented Jan 12, 2015

ColonelJ commented Jan 12, 2015

mahkoh commented Jan 12, 2015

kwantam commented Jan 12, 2015

mahkoh commented Jan 13, 2015

brson commented Jan 15, 2015

aturon commented Feb 13, 2015

huonw commented Feb 13, 2015

bluss commented Feb 13, 2015

SimonSapin commented Apr 10, 2015

kwantam commented Apr 10, 2015

SimonSapin commented Apr 10, 2015

kwantam commented Apr 10, 2015

SimonSapin commented Apr 10, 2015

ftxqxd commented Apr 11, 2015

SimonSapin commented Apr 11, 2015

blankname commented Apr 11, 2015

SimonSapin commented Apr 11, 2015

kwantam commented Apr 11, 2015

&str.words should be using the unicode word boundary algorithm #15628

&str.words should be using the unicode word boundary algorithm #15628

Comments

huonw commented Jul 12, 2014

kwantam commented Jul 12, 2014

kwantam commented Jul 14, 2014

lucy commented Jul 15, 2014

mahkoh commented Jan 11, 2015

mahkoh commented Jan 12, 2015

kwantam commented Jan 12, 2015

ColonelJ commented Jan 12, 2015

mahkoh commented Jan 12, 2015

kwantam commented Jan 12, 2015

mahkoh commented Jan 13, 2015

brson commented Jan 15, 2015

aturon commented Feb 13, 2015

huonw commented Feb 13, 2015

bluss commented Feb 13, 2015

SimonSapin commented Apr 10, 2015

kwantam commented Apr 10, 2015

SimonSapin commented Apr 10, 2015

kwantam commented Apr 10, 2015

SimonSapin commented Apr 10, 2015

ftxqxd commented Apr 11, 2015

SimonSapin commented Apr 11, 2015

blankname commented Apr 11, 2015

SimonSapin commented Apr 11, 2015

kwantam commented Apr 11, 2015