Skip to content

Add support for NF(K)C #12792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Add support for NF(K)C #12792

wants to merge 5 commits into from

Conversation

Florob
Copy link
Contributor

@Florob Florob commented Mar 9, 2014

This adds a new Recompositions iterator, which performs canonical composition on the result of the Normalizations iterator (which is canoincal or compatibility decomposition). In effect this implements Unicode normalization forms C and KC.


#[test]
fn test_nfkc_chars() {
assert_eq!("abc".nfkc_chars().collect::<~str>(), ~"abc");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could write a macro to factor out the common code (here and above):

macro_rules! t {
    ($input: expr, $expected: expr) => {
        assert_eq!($input.nfkc_chars().collect::<~str>(), $expected.to_owned());
    }
}

t!("abc", "abc");
t!("\u1e0b\u01c4", "\u1e0bD\u017d");
// ...

(Could also be a function, but the macro will print more useful information on failure.)

@alexcrichton
Copy link
Member

I'm a little skeptical to continue to add large amounts of unicode support to libstd. I would be more comfortable with a libunicode trait that provides a Unicode trait for dealing with these corners of unicode (perhaps the crate would be called libencoding?).

cc @brson

@brson
Copy link
Contributor

brson commented Mar 10, 2014

I also do not want to continue rolling our own unicode support in std, and would rather std contain the minimum necessary understanding of unicode.

Can we instead think about how to make proper ICU bindings?

@Florob
Copy link
Contributor Author

Florob commented Mar 10, 2014

@brson Do you have a clear definition of what the "minimum necessary understanding of unicode" means? Any equality comparison between Unicode strings is pretty much meaningless without normalization. Though I have to admit for that use-case you can get away supporting only NFD and NFKD. NFC and NFKC are more interesting for saving storage space and implementing protocols that require them.

Personally I'd like Rust to support at least some basic Unicode operations, without pulling in ICU. Support for this need not necessarily be within libstd, though it might be worthwhile having a separate discussion concerning which operations are the bare minimum to support on a Unicode string type, without requiring additional crates.

@huonw
Copy link
Member

huonw commented Mar 10, 2014

Some argument against using ICU for everything: it uses UTF16 internally, so every interaction requires allocating & encoding/decoding; it's a C library, and presumably has a variety of security vulnerabilities (for comparison, our std::unicode module has no unsafe in it at all).

@alexcrichton
Copy link
Member

Closing due to inactivity, but it would be nice to improve our current unicode situation outside of libstd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants