Add robust unicode support (probably via ICU bindings) #14656

brson · 2014-06-04T22:09:39Z

There's been lots of talk about unicode over the years. We have little support in the core libs, but need to provide something better for serious use. Best idea now is to wrestle libicu into a Rust crate. Start out-of-tree.

emberian · 2014-06-04T23:15:27Z

libicu has had security vulnerabilities in the past, is UTF-16, and provides a lot more than what we would need. https://github.com/lifthrasiir/rust-encoding is pretty mature at this point, and all of the data it uses is publically available. It might be a better idea to just have all unicode support outside of std in a libunicode, and implement the algorithms from the Unicode spec in pure Rust.

emberian · 2014-06-04T23:17:18Z

Some vulnerabilities:

http://www.redhat.com/archives/rhsa-announce/2011-December/msg00037.html
http://www.redhat.com/archives/rhsa-announce/2009-June/msg00016.html
http://www.debian.org/security/2008/dsa-1511

Needless to say, the exact sort of thing Rust is great at preventing! This seems like a good place where we could even provide the same API as libicu and get Rust into the wider world.

Florob · 2014-06-05T15:23:58Z

I think this may end up depending on a) how much Unicode support we want to include (initially) b) how much interest there is in implementing it.
I'd like to note that lib{core, std} already contain some interesting things that are written almost completely in safe Rust code, e.g. case-folding, NF(K)D.
Personally if a non-libicu libunicode comes into existence I'd be more than willing to rewrite PR #12792 to be included in that, and time permitting implement some other algorithms.

soc · 2014-06-05T19:55:24Z

As a Rust-outsider, the level of Unicode support really depends on

how much space do you want to spend on Unicode support (some of the tables needed to do it properly are fairly large)?
how much pain you want to endure while implementing it (some of the algorithms are pretty painful to write (at least without using existing code (haven't checked the ICU license lately, though)))?

So whatever you do, please ...

use UFT-8. I'm sick of seeing other encodings leaking all over the place (especially in cases where people claimed "yes, we use encoding X internally, but we will make sure that we don't leak it").
remember that the "length" of a string, in terms of human perception (not in terms of "amount of storage required"), depends on many things, including the locale. This also applies to most other Unicode operations as well. Design the API with that in mind, not as an afterthought.

brson · 2014-06-05T22:26:28Z

Re 'how much unicode to support', we've long taken a stance that std should provide "some", and a separate crate provides "a lot" (whatever ICU does), and can be opted into. Where the line is drawn of what exactly goes into std is a matter of continual debate, but I want this issue to focus on the Unicode kitchen sink crate, and how to integrate it into the distribution.

Florob · 2014-06-07T05:27:46Z

To get an idea of what ICU actually provides I went over the documentation and made some notes.
Here they are in case anyone else finds them useful:

Character Properties

look up character properties
lots of them
useful mostly to build other algorithms

StringPrep

RFC 3454 https://tools.ietf.org/html/rfc3454
locked to Unicode 3.2 (possibly requires separate tables)
about to be replaced by PRECIS (PREperation and Comparison of Internationalized Strings) http://tools.ietf.org/wg/precis/
framework with different profiles
used in:
- XMPP (switching to PRECIS)
- IDNA2003 (replaced by IDNA2008)
- NFS4 (current bis-draft instead describes behaviour of implementations, no mention of StringPrep)
still useful (for now), but I'd rather have it in a separate crate

Conversion

converts between different encodings
many non-Unicode
Unicode: UTF-8, UTF-16{,LE,BE}, UTF-32{,LE,BE}, SCSU, BOCU-1, UTF-7, UTF-EBCDIC, CESU-8

Locales & Resources

built-in concept of locales and locale-specific resources
partially better suited for a separate crate, but to some extend needed for proper case-folding and collation

Date/Time

IMHO belongs in a different crate

Formatting

locale specific formatting (currency, dates, time, …)
IMHO belongs in a different crate

Transforms

case-mapping: lower-, upper-, title-case
Full/Halfwidth conversion
normalization (NFD, NFC, NFKD, NFKC)
BiDi Algorithm
custom

Collation

sorting according to locale rules

Boundary Analysis

find character (actually grapheme cluster), word, line-break, sentence boundaries

Layout Engine

support for text rendering
can read fonts
does not actually render since that is platform specific, but provides a base class
IMHO very specific and should be a separate crate/its own project

huonw · 2014-06-07T05:40:31Z

cc @lifthrasiir

ghost · 2014-10-03T17:48:46Z

Is anyone working on ICU bindings?

ArtemGr · 2015-01-25T21:10:02Z

@Jurily, https://gist.github.com/ArtemGr/91e88de7e17fbc571926

steveklabnik · 2015-02-02T10:58:02Z

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#797

Implement completion for the callable fields. Fixes rust-lang#14656 PR is opened with basic changes. It could be improved by having a new `SymbolKind` for the callable fields and implementing a separate render function similar to the `render_method` for the new `SymbolKind`. It could also be done without any changes to the `SymbolKind` of course, have the new function called based on the type of field. I prefer the former method. Please give any thoughts or changes you think is appropriate for this method. I could start working on that in this same PR.

brson added A-libs labels Jun 4, 2014

steveklabnik mentioned this issue Feb 2, 2015

Add robust unicode support (probably via ICU bindings) rust-lang/rfcs#797

Open

steveklabnik closed this as completed Feb 2, 2015

filmil mentioned this issue Oct 21, 2019

ICU functionality support in Rust unicode-org/rust-discuss#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add robust unicode support (probably via ICU bindings) #14656

Add robust unicode support (probably via ICU bindings) #14656

brson commented Jun 4, 2014

emberian commented Jun 4, 2014

emberian commented Jun 4, 2014

Florob commented Jun 5, 2014

soc commented Jun 5, 2014

brson commented Jun 5, 2014

Florob commented Jun 7, 2014

huonw commented Jun 7, 2014

ghost commented Oct 3, 2014

ArtemGr commented Jan 25, 2015

steveklabnik commented Feb 2, 2015

Add robust unicode support (probably via ICU bindings) #14656

Add robust unicode support (probably via ICU bindings) #14656

Comments

brson commented Jun 4, 2014

emberian commented Jun 4, 2014

emberian commented Jun 4, 2014

Florob commented Jun 5, 2014

soc commented Jun 5, 2014

brson commented Jun 5, 2014

Florob commented Jun 7, 2014

huonw commented Jun 7, 2014

ghost commented Oct 3, 2014

ArtemGr commented Jan 25, 2015

steveklabnik commented Feb 2, 2015