Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add robust unicode support (probably via ICU bindings) #14656

Closed
brson opened this issue Jun 4, 2014 · 10 comments
Closed

Add robust unicode support (probably via ICU bindings) #14656

brson opened this issue Jun 4, 2014 · 10 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one.

Comments

@brson
Copy link
Contributor

brson commented Jun 4, 2014

There's been lots of talk about unicode over the years. We have little support in the core libs, but need to provide something better for serious use. Best idea now is to wrestle libicu into a Rust crate. Start out-of-tree.

@emberian
Copy link
Member

emberian commented Jun 4, 2014

libicu has had security vulnerabilities in the past, is UTF-16, and provides a lot more than what we would need. https://github.com/lifthrasiir/rust-encoding is pretty mature at this point, and all of the data it uses is publically available. It might be a better idea to just have all unicode support outside of std in a libunicode, and implement the algorithms from the Unicode spec in pure Rust.

@emberian
Copy link
Member

emberian commented Jun 4, 2014

Some vulnerabilities:

http://www.redhat.com/archives/rhsa-announce/2011-December/msg00037.html
http://www.redhat.com/archives/rhsa-announce/2009-June/msg00016.html
http://www.debian.org/security/2008/dsa-1511

Needless to say, the exact sort of thing Rust is great at preventing! This seems like a good place where we could even provide the same API as libicu and get Rust into the wider world.

@Florob
Copy link
Contributor

Florob commented Jun 5, 2014

I think this may end up depending on a) how much Unicode support we want to include (initially) b) how much interest there is in implementing it.
I'd like to note that lib{core, std} already contain some interesting things that are written almost completely in safe Rust code, e.g. case-folding, NF(K)D.
Personally if a non-libicu libunicode comes into existence I'd be more than willing to rewrite PR #12792 to be included in that, and time permitting implement some other algorithms.

@soc
Copy link
Contributor

soc commented Jun 5, 2014

As a Rust-outsider, the level of Unicode support really depends on

  • how much space do you want to spend on Unicode support (some of the tables needed to do it properly are fairly large)?
  • how much pain you want to endure while implementing it (some of the algorithms are pretty painful to write (at least without using existing code (haven't checked the ICU license lately, though)))?

So whatever you do, please ...

  • use UFT-8. I'm sick of seeing other encodings leaking all over the place (especially in cases where people claimed "yes, we use encoding X internally, but we will make sure that we don't leak it").
  • remember that the "length" of a string, in terms of human perception (not in terms of "amount of storage required"), depends on many things, including the locale. This also applies to most other Unicode operations as well. Design the API with that in mind, not as an afterthought.

@brson
Copy link
Contributor Author

brson commented Jun 5, 2014

Re 'how much unicode to support', we've long taken a stance that std should provide "some", and a separate crate provides "a lot" (whatever ICU does), and can be opted into. Where the line is drawn of what exactly goes into std is a matter of continual debate, but I want this issue to focus on the Unicode kitchen sink crate, and how to integrate it into the distribution.

@Florob
Copy link
Contributor

Florob commented Jun 7, 2014

To get an idea of what ICU actually provides I went over the documentation and made some notes.
Here they are in case anyone else finds them useful:

Character Properties

  • look up character properties
  • lots of them
  • useful mostly to build other algorithms

StringPrep

  • RFC 3454 https://tools.ietf.org/html/rfc3454
  • locked to Unicode 3.2 (possibly requires separate tables)
  • about to be replaced by PRECIS (PREperation and Comparison of Internationalized Strings) http://tools.ietf.org/wg/precis/
  • framework with different profiles
  • used in:
    • XMPP (switching to PRECIS)
    • IDNA2003 (replaced by IDNA2008)
    • NFS4 (current bis-draft instead describes behaviour of implementations, no mention of StringPrep)
  • still useful (for now), but I'd rather have it in a separate crate

Conversion

  • converts between different encodings
  • many non-Unicode
  • Unicode: UTF-8, UTF-16{,LE,BE}, UTF-32{,LE,BE}, SCSU, BOCU-1, UTF-7, UTF-EBCDIC, CESU-8

Locales & Resources

  • built-in concept of locales and locale-specific resources
  • partially better suited for a separate crate, but to some extend needed for proper case-folding and collation

Date/Time

  • IMHO belongs in a different crate

Formatting

  • locale specific formatting (currency, dates, time, …)
  • IMHO belongs in a different crate

Transforms

  • case-mapping: lower-, upper-, title-case
  • Full/Halfwidth conversion
  • normalization (NFD, NFC, NFKD, NFKC)
  • BiDi Algorithm
  • custom

Collation

  • sorting according to locale rules

Boundary Analysis

  • find character (actually grapheme cluster), word, line-break, sentence boundaries

Layout Engine

  • support for text rendering
  • can read fonts
  • does not actually render since that is platform specific, but provides a base class
  • IMHO very specific and should be a separate crate/its own project

@huonw
Copy link
Member

huonw commented Jun 7, 2014

cc @lifthrasiir

@ghost
Copy link

ghost commented Oct 3, 2014

Is anyone working on ICU bindings?

@ArtemGr
Copy link
Contributor

ArtemGr commented Jan 25, 2015

@steveklabnik
Copy link
Member

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#797

bors added a commit to rust-lang-ci/rust that referenced this issue Dec 4, 2023
Implement completion for the callable fields.

Fixes rust-lang#14656

PR is opened with basic changes. It could be improved by having a new `SymbolKind` for the callable fields and implementing a separate render function similar to the `render_method` for the new `SymbolKind`.
It could also be done without any changes to the `SymbolKind` of course, have the new function called based on the type of field.
I prefer the former method.

Please give any thoughts or changes you think is appropriate for this method. I could start working on that in this same PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one.
Projects
None yet
Development

No branches or pull requests

7 participants