Add support for NF(K)C #12792

Florob · 2014-03-09T21:31:37Z

This adds a new Recompositions iterator, which performs canonical composition on the result of the Normalizations iterator (which is canoincal or compatibility decomposition). In effect this implements Unicode normalization forms C and KC.

huonw · 2014-03-10T00:34:25Z

src/libstd/str.rs

+
+    #[test]
+    fn test_nfkc_chars() {
+        assert_eq!("abc".nfkc_chars().collect::<~str>(), ~"abc");


You could write a macro to factor out the common code (here and above):

macro_rules! t { ($input: expr, $expected: expr) => { assert_eq!($input.nfkc_chars().collect::<~str>(), $expected.to_owned()); } } t!("abc", "abc"); t!("\u1e0b\u01c4", "\u1e0bD\u017d"); // ...

(Could also be a function, but the macro will print more useful information on failure.)

alexcrichton · 2014-03-10T06:50:03Z

I'm a little skeptical to continue to add large amounts of unicode support to libstd. I would be more comfortable with a libunicode trait that provides a Unicode trait for dealing with these corners of unicode (perhaps the crate would be called libencoding?).

cc @brson

brson · 2014-03-10T18:00:39Z

I also do not want to continue rolling our own unicode support in std, and would rather std contain the minimum necessary understanding of unicode.

Can we instead think about how to make proper ICU bindings?

Florob · 2014-03-10T23:17:02Z

@brson Do you have a clear definition of what the "minimum necessary understanding of unicode" means? Any equality comparison between Unicode strings is pretty much meaningless without normalization. Though I have to admit for that use-case you can get away supporting only NFD and NFKD. NFC and NFKC are more interesting for saving storage space and implementing protocols that require them.

Personally I'd like Rust to support at least some basic Unicode operations, without pulling in ICU. Support for this need not necessarily be within libstd, though it might be worthwhile having a separate discussion concerning which operations are the bare minimum to support on a Unicode string type, without requiring additional crates.

huonw · 2014-03-10T23:53:15Z

Some argument against using ICU for everything: it uses UTF16 internally, so every interaction requires allocating & encoding/decoding; it's a C library, and presumably has a variety of security vulnerabilities (for comparison, our std::unicode module has no unsafe in it at all).

alexcrichton · 2014-03-20T16:34:59Z

Closing due to inactivity, but it would be nice to improve our current unicode situation outside of libstd.

Florob added 5 commits March 9, 2014 14:22

Update unicode.py to actually generate unicode.rs

713af75

std: Use appropriately sized integers for codepoints and bytes

ab05d1a

std: Move Hangul decomposition into unicode.rs

bb587d9

std: Rename str::Normalizations to str::Decompositions

f41ecef

std: Add support for NFC and NFKC

8e444f2

huonw reviewed Mar 10, 2014
View reviewed changes

alexcrichton closed this Mar 20, 2014

Florob mentioned this pull request Apr 25, 2014

Clean up unicode code in libstd #13770

Closed

Florob mentioned this pull request Jun 5, 2014

Add robust unicode support (probably via ICU bindings) #14656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for NF(K)C #12792

Add support for NF(K)C #12792

Uh oh!

Florob commented Mar 9, 2014

Uh oh!

huonw Mar 10, 2014

Uh oh!

alexcrichton commented Mar 10, 2014

Uh oh!

brson commented Mar 10, 2014

Uh oh!

Florob commented Mar 10, 2014

Uh oh!

huonw commented Mar 10, 2014

Uh oh!

alexcrichton commented Mar 20, 2014

Uh oh!

Uh oh!

Add support for NF(K)C #12792

Add support for NF(K)C #12792

Uh oh!

Conversation

Florob commented Mar 9, 2014

Uh oh!

huonw Mar 10, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Mar 10, 2014

Uh oh!

brson commented Mar 10, 2014

Uh oh!

Florob commented Mar 10, 2014

Uh oh!

huonw commented Mar 10, 2014

Uh oh!

alexcrichton commented Mar 20, 2014

Uh oh!

Uh oh!