Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

&str.words should be using the unicode word boundary algorithm #15628

Closed
huonw opened this issue Jul 12, 2014 · 28 comments · Fixed by #24563
Closed

&str.words should be using the unicode word boundary algorithm #15628

huonw opened this issue Jul 12, 2014 · 28 comments · Fixed by #24563
Assignees
Labels
P-medium Medium priority

Comments

@huonw
Copy link
Member

huonw commented Jul 12, 2014

Currently it's just splitting on whitespace (i.e. is equivalent to splitting with regex!("\s+")). Preferably it should be using http://www.unicode.org/reports/tr29/#Word_Boundaries

(The old behaviour is easy to replicate with the above regex, or with s.split(|c: char| c.is_whitespace()).filter(|s| !s.is_empty()).)

@kwantam
Copy link
Contributor

kwantam commented Jul 12, 2014

Note that this gives some possibly unintuitive results.

("Hello there") gives ( " Hello there " ), which might not be what one expects.

@kwantam
Copy link
Contributor

kwantam commented Jul 14, 2014

It occurs to me that we could take a bool indicating whether the user wants the full UAX#29 algorithm or the simple one. A lot of other languages say words() when they mean "split on whitespace", and it's a trivial thing to continue offering the present behavior even after implementing the UAX#29 algorithm.

@lucy
Copy link
Contributor

lucy commented Jul 15, 2014

I think taking a bool for this is bad design. It should be a separate method.

@mahkoh
Copy link
Contributor

mahkoh commented Jan 11, 2015

This has to be decided before 1.0.

@mahkoh
Copy link
Contributor

mahkoh commented Jan 12, 2015

Here is the implementation I've just written for a program I'm working on:

    fn word_at(&self, idx: usize) -> &str {
        let s = &self[idx..];
        let mut take_prev = true;
        let mut take_cur = true;
        let mut state = WState::Start;
        let mut prev_idx = 0;
        let mut idx = 0;

        for (curr, c) in s.char_indices() {
            prev_idx = idx;
            idx = curr;

            let cat = word::category(c);
            if cat == WCat::Extend || cat == WCat::Format {
                if state == WState::Start {
                    break;
                } else {
                    continue;
                }
            }

            state = match state {
                WState::Start if c == '\r' => {
                    if idx + 1 != s.len() && s.char_at(idx + 1) == '\n' {
                        idx += 1;
                    }
                    break;
                },
                WState::Start => match cat {
                    WCat::ALetter           => WState::ALetter,
                    WCat::HebrewLetter      => WState::HLetter,
                    WCat::Numeric           => WState::Numeric,
                    WCat::Katakana          => WState::Katakana,
                    WCat::ExtendNumLet      => WState::ExtendNumLet,
                    WCat::RegionalIndicator => WState::RegionalIndicator,
                    _ => break,
                },
                WState::ALetter => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidLetter | WCat::MidNumLet |
                        WCat::SingleQuote => WState::ALetterExt,
                    WCat::Numeric => WState::Numeric,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::ALetterExt => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::HLetter => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidLetter | WCat::MidNumLet => WState::HLetterExt(0),
                    WCat::SingleQuote => WState::HLetterExt(1),
                    WCat::DoubleQuote => WState::HLetterExt(2),
                    WCat::Numeric => WState::Numeric,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::HLetterExt(n) => match cat {
                    WCat::ALetter if n < 2 => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        take_prev = n == 1;
                        break;
                    }
                },
                WState::Numeric => match cat {
                    WCat::Numeric => WState::Numeric,
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidNum | WCat::MidNumLet |
                        WCat::SingleQuote => WState::NumericExt,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::NumericExt => match cat {
                    WCat::Numeric => WState::Numeric,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::Katakana => match cat {
                    WCat::Katakana => WState::Katakana,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::ExtendNumLet => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::Numeric => WState::Numeric,
                    WCat::Katakana => WState::Katakana,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::RegionalIndicator => match cat {
                    WCat::RegionalIndicator => WState::RegionalIndicator,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::SingleQuote => unreachable!(),
            }
        }

        if take_cur {
            idx += s.char_at(idx).len_utf8();
        } else if !take_prev {
            idx = prev_idx;
        }

        &s[..idx]
    }

    fn word_at_reverse(&self, idx: usize) -> &str {
        let s = &self[..idx];

        let mut take_prev = true;
        let mut take_cur = true;
        let mut state = WState::Start;
        let mut prev_prev_idx = s.len();
        let mut prev_idx = s.len();
        let mut idx = s.len();

        for (curr, c) in s.char_indices().rev() {
            let cat = word::category(c);
            if cat == WCat::Extend || cat == WCat::Format {
                continue;
            }

            prev_prev_idx = prev_idx;
            prev_idx = idx;
            idx = curr;

            state = match state {
                WState::Start if c == '\n' => {
                    if idx > 0 && s.char_at_reverse(idx) == '\r' {
                        idx -= 1;
                    }
                    break;
                },
                WState::Start => match cat {
                    WCat::ALetter           => WState::ALetter,
                    WCat::HebrewLetter      => WState::HLetter,
                    WCat::SingleQuote       => WState::SingleQuote,
                    WCat::Numeric           => WState::Numeric,
                    WCat::Katakana          => WState::Katakana,
                    WCat::ExtendNumLet      => WState::ExtendNumLet,
                    WCat::RegionalIndicator => WState::RegionalIndicator,
                    _ => break,
                },
                WState::ALetter => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidLetter | WCat::MidNumLet |
                        WCat::SingleQuote => WState::ALetterExt,
                    WCat::Numeric => WState::Numeric,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::ALetterExt => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::HLetter => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidLetter | WCat::MidNumLet => WState::HLetterExt(0),
                    WCat::DoubleQuote => WState::HLetterExt(2),
                    WCat::Numeric => WState::Numeric,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::HLetterExt(n) => match cat {
                    WCat::ALetter if n < 2 => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::SingleQuote => match cat {
                    WCat::HebrewLetter => WState::HLetter,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::Numeric => match cat {
                    WCat::Numeric => WState::Numeric,
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::MidNum | WCat::MidNumLet |
                        WCat::SingleQuote => WState::NumericExt,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::NumericExt => match cat {
                    WCat::Numeric => WState::Numeric,
                    _ => {
                        take_cur = false;
                        take_prev = false;
                        break;
                    }
                },
                WState::Katakana => match cat {
                    WCat::Katakana => WState::Katakana,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::ExtendNumLet => match cat {
                    WCat::ALetter => WState::ALetter,
                    WCat::HebrewLetter => WState::HLetter,
                    WCat::Numeric => WState::Numeric,
                    WCat::Katakana => WState::Katakana,
                    WCat::ExtendNumLet => WState::ExtendNumLet,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
                WState::RegionalIndicator => match cat {
                    WCat::RegionalIndicator => WState::RegionalIndicator,
                    _ => {
                        take_cur = false;
                        break;
                    }
                },
            }
        }

        if idx == s.len() {
            idx = 0;
        } else if !take_prev {
            idx = prev_prev_idx;
        } else if !take_cur {
            idx = prev_idx;
        }

        &s[idx..]
    }

@kwantam
Copy link
Contributor

kwantam commented Jan 12, 2015

It's still not clear to me that we want "words" to be UAX#29 vs "normal" splitting. Most people expect to break on whitespace based on their experiences in other languages. That's not a reason to do or not do it; I'm just pointing out that some thought would be good here.

I'm convinced that it's worthwhile to have a UAX#29 implementation available in the library, since there is good reason to have strong Unicode support. I'm just unclear on how the naming should be handled.

@ColonelJ
Copy link
Contributor

I think words should stay as it is. In a language with overloading or optional arguments having an extra argument to specify this as an alternate strategy would be nice, but we do not have this luxury.

This seems like a nice function though. I suppose it's quite specific so it could have a longer name... I can give some suggestions: words_by_boundary (my preference), unicode_words, words_unicode, words_proper, word_extract, sel_words, extract_words, word_break, delimited_words, bounded_words, word_divide, word_divisions, word_intervals, word_segments, just segments or word_segs, word_sep, unicode_word_split, uax29_words, words_precise...

And if you wanted to be ultra precise/unambiguous/mean you could remove words and force people to use words_by_space or words_by_boundary!

@mahkoh
Copy link
Contributor

mahkoh commented Jan 12, 2015

which might not be what one expects

One might want to read the documentation then.

That's not a reason to do or not do it; I'm just pointing out that some thought would be good here.

It's clear that no thought went into the current implementation. A function that uses a degenerate algorithm which barely finds words in ascii does not belong in a unicode library. I'm not sure what purpose this algorithm serves at all. Chrome certainly doesn't use it to find word boundaries and neither does vim. If someone actually needs this behavior then they can get it with the one-liner in the first post.

@kwantam
Copy link
Contributor

kwantam commented Jan 12, 2015

This is a fair observation.

In that case, probably we should

  • use your UAXA#29 implementation for words()
  • include something like the one-liner @huonw posted above in the words() documentation (along with some examples and a pointer to the UAX#29 document, of course)

This works for me if it works for you.

From a practical standpoint: are you interested in putting together a pull request to integrate the above into libunicode? If not, I can probably find some time to do it over the weekend.

@mahkoh
Copy link
Contributor

mahkoh commented Jan 13, 2015

I don't know python so I won't do it. Besides, the current python implementation seems bad because it doesn't use the normative file for the grapheme classes. Also, the names of the functions should reflect what they are actually doing.

@brson brson added this to the 1.0 beta milestone Jan 15, 2015
@brson
Copy link
Contributor

brson commented Jan 15, 2015

Don't need to do as described. Just need to decide something.

@pnkfelix pnkfelix self-assigned this Feb 12, 2015
@huonw huonw assigned huonw and unassigned pnkfelix Feb 12, 2015
@aturon
Copy link
Member

aturon commented Feb 13, 2015

FWIW, it seems totally reasonable to mark this #[unstable] for now and stabilize a particular semantics by demand.

@huonw
Copy link
Member Author

huonw commented Feb 13, 2015

I can find exactly one and a half other languages that implements similar functionality (rather than just using plain split). Haskell has words and C/C++ have *scanf and various io stream things (>>, istream_iterator) (these aren't as unambiguiously for word breaking, hence the half). Both of these just use whitespace rather than the true algorithm.

PR for unstabilsation, #22253.

huonw added a commit to huonw/rust that referenced this issue Feb 13, 2015
It is not totally clear if we should just use whitespace, or if the full
unicode word-breaking algorithm is more correct. If there is demand we
can reconsider this decision (and consider the precise algorithm to use
in detail).

cc rust-lang#15628.
@bluss
Copy link
Member

bluss commented Feb 13, 2015

The two algorithms are different enough that they should use different method names IMO. Whitespace split is popular and useful (as a poor man's lex for command lines, for example?).

steveklabnik added a commit to steveklabnik/rust that referenced this issue Feb 13, 2015
It is not totally clear if we should just use whitespace, or if the full
unicode word-breaking algorithm is more correct. If there is demand we
can reconsider this decision (and consider the precise algorithm to use
in detail).

cc rust-lang#15628.
steveklabnik added a commit to steveklabnik/rust that referenced this issue Feb 14, 2015
It is not totally clear if we should just use whitespace, or if the full
unicode word-breaking algorithm is more correct. If there is demand we
can reconsider this decision (and consider the precise algorithm to use
in detail).

cc rust-lang#15628.
Manishearth added a commit to Manishearth/rust that referenced this issue Apr 7, 2015
@mahkoh points out in rust-lang#15628 that unicode.py does not use
normative data for Grapheme classes. This pr fixes that issue,
and does some minor cleanup of the unicode.py script.

In addition, GC_RegionalIndicator is renamed GC_Regional_Indicator
in order to stay in line with the Unicode class name definitions.
I have updated refs in u_str.rs, and verified that there are no
refs elsewhere in the codebase. However, in principle someone
using the unicode tables for their own purposes might see breakage
from this.
@SimonSapin
Copy link
Contributor

RFC’d rust-lang/rfcs#1054

@kwantam
Copy link
Contributor

kwantam commented Apr 10, 2015

I've become convinced that, while it might be useful to offer UAX#29 breaking as part of libunicode, it's definitely not what most people will want most of the time, so it should probably not be the default words() implementation.

I'm almost ready to submit a PR to add a UAX#29 word breaking implementation to libunicode, but probably having a short function name that performs the standard "split on whitespace" is something that people kind of expect from the stdlib. Perhaps renaming the current function per @SimonSapin's suggestion is the right way to go here.

By the way, @mahkoh, I think your reverse iterator above does not correctly handle Format and Extend characters. For example, I think your implementation might group Format or Extend characters with the wrong word, and I'm pretty sure it doesn't do the right thing in the pathological case, namely, when there are Format or Extend characters at the beginning of a string.

@SimonSapin
Copy link
Contributor

@kwantam that’s great! However I’d encourage you to publish this code on crates.io rather than try to get it into libunicode. That way it can evolve independently of the compiler, and the API does not need to make as strong compatibility promises as std does: when there’s a breaking change they don’t want to update for yet, users can keep using old versions on a crates.io library and still update their compiler.

@kwantam
Copy link
Contributor

kwantam commented Apr 10, 2015

That's a not unreasonable opinion, but I think it's probably worth having the discussion in the PR rather than here. If it's ultimately rejected in favor of moving to a separate library that's fine.

A larger question is whether more of the functionality currently in libunicode should be moved elsewhere, e.g., into a standalone crate. For example, it seems slightly arbitrary to have grapheme clustering, normalizations, etc. in stdlib (as is currently the case) and keeping other functionality elsewhere.

@SimonSapin
Copy link
Contributor

A larger question is whether more of the functionality currently in libunicode should be moved elsewhere, e.g., into a standalone crate. For example, it seems slightly arbitrary to have grapheme clustering, normalizations, etc. in stdlib (as is currently the case) and keeping other functionality elsewhere.

Yes, definitely. I kinda volunteered to do this a while back, but haven’t gotten to it yet.

@ftxqxd
Copy link
Contributor

ftxqxd commented Apr 11, 2015

I worry that having a split_whitespace method in the stdlib but not having a proper words method would encourage people too much to use the split_whitespace method, whose algorithm is nearly useless and is not normally what you want, instead of the more correct words method which would be buried somewhere in crates.io. The documentation for split_whitespace could link to a unicode crate on crates.io, of course, but a lot of people would probably not bother adding a new Cargo dependency and just opt for the ‘good enough’ method of using split_whitespace instead.

Unicode is hard and complicated, so I fear many people will just treat strings like ASCII if a lot of Unicode functionality is moved out of the stdlib. The best way to encourage people to handle non-ASCII data correctly is, in my opinion, to make it very easy to do so, and that can be accomplished effectively by keeping all Unicode functionality in-tree.

@SimonSapin
Copy link
Contributor

“More correct” really depends on what you’re doing with it. While I agree with the general sentiment that good Unicode support is good, just because it’s says “Unicode” in the name doesn’t mean that UAX#29 is always the right thing.

Let’s pretend for a minute that #[unstable] items don’t exist. (Which they effectively don’t for beta/stable users.) We’re discussing adding new functionality to std, with at least two alternatives for the behavior. To guide this discussion: what are the use cases for this functionality?

The Python equivalent is str.split without arguments. I’ve used this mostly for parsing files containing space-separated data.

The only code I ever wrote that was actually manipulating words in prose text was line breaking in a text layout engine, but 1. that’s UAX#14, not #29, 2. it’s specialized enough that I want lots of control: I don’t expect to just use a single standard library method.

@blankname
Copy link

Note: I'm out of my depth here and late for bed, so hopefully I'm not just adding noise.

Personally I'd love to have a words iterator that works along the same lines as the "Counting the Words in a UTF-8 String" example in this icu guide http://userguide.icu-project.org/strings/utext

So use UAX#29 word breaking, but only return words and numbers; not whitespace or punctuation. I don't think that this behaviour would be surprising.

The only downsides I see are complexity of implementation and the fact that it would be less speedy than just splitting by whitespace.

As far as complexity goes, I think that this type of complexity argues for it being implemented in the standard library, where it will get a well scrutinized and high quality implementation that is available to anyone using the standard library.

As for performance, it would be easy for a programmer to opt out. If they know what they're doing and what they're doing doesn't need the more complex word splitting algorithm then splitting by whitespace is only a one-liner away.

It appears to me to be a suitable default.
It would cause programmers to write programs that are more "unicode aware" without even having to think about it.

This seems like another place that the standard library can help programmers avoid tripping up.

It would be more complex for implementor not user. For the user it would just work, even in cases the programmer never thought about.

@SimonSapin
Copy link
Contributor

in the standard library, where it will get a well scrutinized and high quality implementation

I disagree with the claim that putting something in std will automatically make it high quality. The people with domain-specific knowledge, those who you would want to do reviews, are not necessarily the same people as the std maintainers. On the contrary, being forced to provide the same stability guarantees as the rest of std would make it harder to fix issues that might be found later: you have to get everything right before you mark it #[stable].

@kwantam
Copy link
Contributor

kwantam commented Apr 11, 2015

@SimonSapin I agree with your concerns about stability and domain knowledge, but I fear you might be overselling them slightly.

My understanding is that marking it #[stable] doesn't mean bugs can't be fixed! though obviously it does mean we need a sensible decision about what to call things so that the interfaces don't change. One other mitigating factor with regard to stability is that UAX#29 comes with an extensive set of test strings that exercise all letter class combinations (and I have a script that mechanically translates them to Rust so we can test against them automatically). The same is true for the graphemes iterator, and libcollectionstest already includes these tests.

As far as domain knowledge goes, in general I agree but in this case it's mostly just a matter of implementing the slightly gross rules (the backward iterator is the gross one) in a more-or-less linear time automaton with reasonable constant factors and acceptable code quality. The deep domain knowledge is required for writing UAX#29, surely, but not so much for implementing it.

@blankname I think what you're after is effectively mystring.split_by_uax29().filter(|s| s.chars().any(|c| c.is_alphanumeric())). Note that the UnicodeStr trait has an is_alphanumeric() function as well, but that one won't work because it requires that every character is alphanumeric.

kwantam added a commit to kwantam/rust that referenced this issue Apr 12, 2015
This patch does the following:

1. Adds three new structs in libunicode/str.rs:

   a. UnicodeWords: a filter on the UWordBounds iterator that yields only
      the "words" of a string as defined in Section 4 of Unicode Standard
      Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries

   b. UWordBounds: an iterator that segments a string on its word
      boundaries as defined in UAX#29. Note that this *only* segments
      the string, and does *not* drop whitespace and other non-word
      pieces of the text (that's what UnicodeWords does).

      Note that UWordBounds has both a forward and backward iterator
      that have total running time (that is, to segment the entire
      string) linear in the size of the string. It should be noted that
      with pathological inputs the reverse iterator could be about 2x less
      efficient than the forward iterator, but on reasonable inputs
      their costs are similar.

   c. UWordBoundIndices: the above iterator, but returning tuples of
      (offset, &str).

2. Adds three new functions in the `UnicodeStr` trait:

   a. words_unicode(): returns a UnicodeWords iterator.

   b. split_words_uax29(): returns a UWordBounds iterator.

   c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.

3. Updates the `src/etc/unicode.py` script to generate tables necessary
   for running the UWordBounds iterators.

4. Adds a new script, `src/etc/unicode_gen_breaktests.py`,
   which processes the grapheme and word break tests published
   by the Unicode consortium into a format for inclusion in
   libcollectionstest.

5. Adds new impls in libcollections's `str` corresponding to the
   `UnicodeStr` functions of (2).

   Note that this new functionality is gated with `feature(unicode)`.

6. Adds tests in libcollectionstest to exercise this new functionality.

   In addition, updates the test data for the graphemes test to
   correspond to the output from the script of (4). (Note that at the
   moment this change is primarily cosmetic.)

This patch does not settle the question raised by @huonw in rust-lang#15628;
rather, it introduces a new function alongside `words()` that follows
UAX#29.

In addition, it does not address the concerns that @SimonSapin raises in
rust-lang/rfcs#1054 since it leaves `words()`
alone.
kwantam added a commit to kwantam/rust that referenced this issue Apr 12, 2015
This patch does the following:

1. Adds three new structs in libunicode/str.rs:

   a. UnicodeWords: a filter on the UWordBounds iterator that yields only
      the "words" of a string as defined in Section 4 of Unicode Standard
      Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries

   b. UWordBounds: an iterator that segments a string on its word
      boundaries as defined in UAX#29. Note that this *only* segments
      the string, and does *not* drop whitespace and other non-word
      pieces of the text (that's what UnicodeWords does).

      Note that UWordBounds has both a forward and backward iterator
      that have total running time (that is, to segment the entire
      string) linear in the size of the string. It should be noted that
      with pathological inputs the reverse iterator could be about 2x less
      efficient than the forward iterator, but on reasonable inputs
      their costs are similar.

   c. UWordBoundIndices: the above iterator, but returning tuples of
      (offset, &str).

2. Adds three new functions in the `UnicodeStr` trait:

   a. words_unicode(): returns a UnicodeWords iterator.

   b. split_words_uax29(): returns a UWordBounds iterator.

   c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.

3. Updates the `src/etc/unicode.py` script to generate tables necessary
   for running the UWordBounds iterators.

4. Adds a new script, `src/etc/unicode_gen_breaktests.py`,
   which processes the grapheme and word break tests published
   by the Unicode consortium into a format for inclusion in
   libcollectionstest.

5. Adds new impls in libcollections's `str` corresponding to the
   `UnicodeStr` functions of (2).

   Note that this new functionality is gated with `feature(unicode)`.

6. Adds tests in libcollectionstest to exercise this new functionality.

   In addition, updates the test data for the graphemes test to
   correspond to the output from the script of (4). (Note that at the
   moment this change is primarily cosmetic.)

This patch does not settle the question raised by @huonw in rust-lang#15628;
rather, it introduces a new function alongside `words()` that follows
UAX#29.

In addition, it does not address the concerns that @SimonSapin raises in
rust-lang/rfcs#1054 since it leaves `words()`
alone.
kwantam added a commit to kwantam/rust that referenced this issue Apr 18, 2015
For now, words() is left in (but deprecated), and Words is a type alias for
struct SplitWhitespace.

Also cleaned up references to s.words() throughout codebase.

Closes rust-lang#15628
kwantam added a commit to kwantam/rust that referenced this issue Apr 20, 2015
For now, words() is left in (but deprecated), and Words is a type alias for
struct SplitWhitespace.

Also cleaned up references to s.words() throughout codebase.

Closes rust-lang#15628
kwantam added a commit to kwantam/rust that referenced this issue Apr 21, 2015
For now, words() is left in (but deprecated), and Words is a type alias for
struct SplitWhitespace.

Also cleaned up references to s.words() throughout codebase.

Closes rust-lang#15628
kwantam added a commit to kwantam/rust that referenced this issue Apr 21, 2015
For now, words() is left in (but deprecated), and Words is a type alias for
struct SplitWhitespace.

Also cleaned up references to s.words() throughout codebase.

Closes rust-lang#15628
kwantam added a commit to kwantam/rust that referenced this issue Apr 21, 2015
For now, words() is left in (but deprecated), and Words is a type alias for
struct SplitWhitespace.

Also cleaned up references to s.words() throughout codebase.

Closes rust-lang#15628
kwantam added a commit to kwantam/rust that referenced this issue Apr 21, 2015
For now, words() is left in (but deprecated), and Words is a type alias for
struct SplitWhitespace.

Also cleaned up references to s.words() throughout codebase.

Closes rust-lang#15628
alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 21, 2015
For now, words() is left in (but deprecated), and Words is a type alias for
struct SplitWhitespace.

Also cleaned up references to str.words() throughout codebase.

Closes rust-lang#15628
bors added a commit to rust-lang-ci/rust that referenced this issue Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-medium Medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.