Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add unicode case folding for char/str #9084

Closed
thestinger opened this issue Sep 9, 2013 · 12 comments
Closed

add unicode case folding for char/str #9084

thestinger opened this issue Sep 9, 2013 · 12 comments
Labels
A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one.

Comments

@thestinger
Copy link
Contributor

No description provided.

@kud1ing
Copy link

kud1ing commented Feb 26, 2014

For reference:

"Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is primarily used for caseless comparison of text [...] As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user."

http://www.unicode.org/faq/casemap_charprop.html

@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

Char upper/lower is done in #12561

For str, what would be more appropriate - an iterator or a method with the converted string?
An iterator would allow also case-insensitive comparisons, something like this:

struct LowerChars<'a> {
  chars: std::str::Chars<'a>
}

fn lower<'a>(s: &'a str) -> LowerChars<'a> {
  LowerChars { chars: s.chars() }
}

impl<'a> Iterator<char> for LowerChars<'a> {
  fn next(&mut self) -> Option<char> {
    self.chars.next().map(|c| c.to_lowercase())
  }
}

#[test]
fn test_to_uppercase(){
  let sl = "foobär";
  let su = "FOOBÄR";

  let mut z = lower(sl).zip(lower(su));
  assert!(z.all(|(x, y)| x == y ));

  let greek = "στιγμας".chars().map(|c| c.to_uppercase()).collect::<~str>();
  // fail!(greek);
  assert_eq!(greek.as_slice(), "ΣΤΙΓΜΑΣ");
}

@Kimundi
Copy link
Member

Kimundi commented Feb 26, 2014

@pzol: The normalization functions already return Iterators, so I'd choose them for case folding too.
Collecting into a actual string is easy then.
You could also provide none one-to-one case folding that way.

@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

Kimundi so I'd suggest

pub trait StrSlice<'a> {
...
    fn lower_chars<'a>(s: &'a str) -> LowerChars<'a>;
    fn upper_chars<'a>(s: &'a str) -> UpperChars<'a>;
}

@Kimundi
Copy link
Member

Kimundi commented Feb 26, 2014

Yeah, that would fit in nicely

@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

Now, for char I only implemented common and simple case folding. That is, one char is always one char. for THIS however, multi-codepoint mapping could be done, i.e. one codepoint can become two codepoints.

However, for case insensitive comparison afaik only ICU gets it right.
CaseInsensitiveString.equals gets the right answers where regular Java fails; for example:

  • "flour and water" & "FLOUR AND WATER"
  • "efficient" & "EFFICIENT"
  • "poſt" & "post"
  • "weiß" & "WEIẞ"
  • "tschüß" & "TSCHÜSS"
  • "ᾲ στο διάολο" & "Ὰͅ Στο Διάολο"
  • "ᾲ στο διάολο" & "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ"

So in other words for str a method like case_insensitive_compare would be appropriate.
Question is, should Equiv for str consider case folding?

@Kimundi
Copy link
Member

Kimundi commented Feb 26, 2014

The Equiv trait only exists to allow uniform comparison between ~str, &str etc, I don't think it should ignore case. That is, it exists to compare different types on identical content.

@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

what about this:

pub trait StrSlice<'a> {
...
    fn equals_ignore_case(&self, needle: &str) -> bool;
    fn starts_with_ignore_case(&self, needle: &str) -> bool;
    fn ends_with_ignore_case(&self, needle: &str) -> bool;
// or 
    fn equals_ignore_case(&self, needle: &str) -> bool;
...
// or
  fn compare(&self, needle: &str, ignore_case: bool) -> bool;
...
}

I like the last one best.

@pzol pzol self-assigned this Feb 26, 2014
@Kimundi
Copy link
Member

Kimundi commented Feb 26, 2014

Not sure... Adding a bunch of method combinations smells like a missed abstraction that could be integrated somehow. I'd add at most a eq_ignore_case function for now.

@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

This is also relevant http://www.w3.org/International/wiki/Case_folding

If my understanding is right, then a proper folding might require providing a language context in order to be correct. My suggestion would be start with a normal case folding with this table

http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

@lifthrasiir
Copy link
Contributor

cc me

@steveklabnik
Copy link
Member

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#791

Jarcho pushed a commit to Jarcho/rust that referenced this issue Aug 29, 2022
Fix false positives of needless_match

closes: rust-lang#9084
made needless_match take into account arm in the form of `_ if => ...`

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one.
Projects
None yet
Development

No branches or pull requests

6 participants