-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Add support for NF(K)C #12792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NF(K)C #12792
Conversation
|
||
#[test] | ||
fn test_nfkc_chars() { | ||
assert_eq!("abc".nfkc_chars().collect::<~str>(), ~"abc"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could write a macro to factor out the common code (here and above):
macro_rules! t {
($input: expr, $expected: expr) => {
assert_eq!($input.nfkc_chars().collect::<~str>(), $expected.to_owned());
}
}
t!("abc", "abc");
t!("\u1e0b\u01c4", "\u1e0bD\u017d");
// ...
(Could also be a function, but the macro will print more useful information on failure.)
I'm a little skeptical to continue to add large amounts of unicode support to cc @brson |
I also do not want to continue rolling our own unicode support in std, and would rather std contain the minimum necessary understanding of unicode. Can we instead think about how to make proper ICU bindings? |
@brson Do you have a clear definition of what the "minimum necessary understanding of unicode" means? Any equality comparison between Unicode strings is pretty much meaningless without normalization. Though I have to admit for that use-case you can get away supporting only NFD and NFKD. NFC and NFKC are more interesting for saving storage space and implementing protocols that require them. Personally I'd like Rust to support at least some basic Unicode operations, without pulling in ICU. Support for this need not necessarily be within |
Some argument against using ICU for everything: it uses UTF16 internally, so every interaction requires allocating & encoding/decoding; it's a C library, and presumably has a variety of security vulnerabilities (for comparison, our |
Closing due to inactivity, but it would be nice to improve our current unicode situation outside of libstd. |
This adds a new
Recompositions
iterator, which performs canonical composition on the result of theNormalizations
iterator (which is canoincal or compatibility decomposition). In effect this implements Unicode normalization forms C and KC.