A type representing an owned C-compatible wide string #1773

bozaro · 2016-10-20T09:51:52Z

This RFC born from issue: rust-lang/rust#36671

Add CWideString/CWideStr for more simple interaction with not well-formed UTF-16
external API (for example: with Windows API).

bozaro · 2016-10-20T10:05:56Z

A preliminary implementation can be found at: https://github.com/bozaro/rust-cwstring

cristicbz · 2016-10-20T10:21:57Z

Rendered

BurntSushi · 2016-10-20T11:06:51Z

Can you please write out the detailed design as part of the RFC? Linking to an existing implementation is fine, but I'd like to see the API in the RFC text.

From looking at the implementation, it appears its internal representation is u8. I think that means @retep998's concern is pertinent:

Would this CWideString type actually wrap [u16], so that you could pass them directly to winapi functions without having to convert to and from WTF-8? It would likely have to be Windows specific though, it doesn't make sense on any other platforms, aside from UEFI.

Could you also please say more about what "wide" strings are in general? What problems do they solve? How do people use them? How are they represented in other languages/ecosystems? Perhaps some code examples showing usage of this new type would help, with comparisons to how you might achieve it today without this wide type.

retep998 · 2016-10-20T11:19:14Z

@BurntSushi The actual implementation is on the u16 branch. The master branch has an old version where the CStr impl was copied over but not adjusted to use u16 yet.

bozaro · 2016-10-20T11:58:47Z

I'm sorry. I forgotten to merge u16 branch to master :(
Current CWideString is looks like working, but also need:

Get better name to get_bytes methods;
Fix method descriptions.

alexcrichton · 2016-10-20T18:30:11Z

I think it's definitely worth considering the relationship to wchar_t on C platforms. A name like CWideString would certainly lead many to believe that it's equivalent to that! Are "wide strings" used much at all on Unix? If so, are they wchar_t-based or are they all explicitly 16-byte wide elements?

I agree with @BurntSushi that we'll likely want to include some APIs here in the RFC itself, especially methods like into_bytes which will need to be renamed as well for the wider element type.

Finally, I'm wary of using c_ushort and would be open to considering wchar_t to mirror FFI usage. That depends on whether 16-bit strings are all that common on Unix, though.

bozaro · 2016-10-20T19:15:41Z

As I know, sizeof(wchar_t) is 2-bytes on Windows and 4 bytes on Linux. But I never seen real use of 4-byte wchar_t on Linux (but C++ is not my primary language). Windows, Java, Qt, Xalan-C++ uses 2-bytes strings.

I seen encoding chaos only on Windows (for example, in my case cp1251 used as ANSI, cp866 as OEM and utf-16 charset used in my Windows at the same time). In other platforms utf-8 is used for multibyte strings.

I'm sure, Rust should not implement 2-bytes and 4-bytes strings with same name.
This C++ pain, which should remain in C++. I doubt the usefulness of 4-byte strings.

I really can't generate good name for into_bytes method (sorry, but English is not my native language).

About c_ushort: I really can't see any c_ushort usage profit and prefer to replace c_ushort by u16 in this code.

retep998 · 2016-10-21T04:39:16Z

I propose bytes -> wchars.

codyps · 2016-10-21T04:47:14Z

As we're considering this, it might make sense to resolve the case of char32_t and char16_t strings for c++11/c11 compatibility as a whole. Or just explicitly restrict this to utf-16 (ie: char16_t, if that's what is needed to resolve the bug).

Edit: to be clear, if this only handles utf-16-like strings, 16 should be somewhere in the name. W is fairly unspecified (as we can see from the unix vs windows discussions here).

retep998 · 2016-10-21T05:38:22Z

@jmesmon Just remember that this type isn't for strict UTF-16. It will allow arbitrary u16 including lone surrogates, similar to how CString allows arbitrary u8 regardless of whether it is valid UTF-8.

est31 · 2016-10-22T14:02:09Z

But I never seen real use of 4-byte wchar_t on Linux

I do know some C++ libraries which use wchar_t, but I have to admit that don't know enough C++ libraries to determine whether wchar_t is popular or not.

Windows, Java, Qt, Xalan-C++ uses 2-bytes strings.

But only windows calls it wchar ("WCHAR" to be precise).

Java calls it jchar
Qt calls it QChar, and explicitly refers to C++'s wchar_t type as Wchar.

and about Xalan-C++, its not a major library like Java or Qt, and it doesn't call its char type wchar either, but XalanDOMChar. And optionally, the XalanDOMChar can be set to wchar_t at build time.

Calling the proposed 2-byte c-string type a wstring will just cause confusion among all users not targeting windows only platforms, and among former C++ developers who think it would be something similar to wchar_t.

👍 to the idea of such a string type but 👎 to the name.

est31 · 2016-10-22T14:04:34Z

What about Cu16String/CU16String instead? Its not mentioning UTF-16 itself, but the u16 type on which it bases, and it also has the "C" in the name which carries the meaning that it could contain invalid UTF-16 like lone surrogates.

petrochenkov · 2016-10-22T14:12:58Z

Can't CString get a type parameter instead of introducing a new string type?

struct CString<CharT = u8> { ... }

This solves the ambiguous naming problem and prevents further proliferation of string types as well.
(Sorry, I didn't read the previous discussion.)

SimonSapin · 2016-10-22T15:50:07Z

In line with WTF-8 naming, I’ve sometimes used WTF-16 to call the character encoding used on Windows since it’s not quite UTF-16 (at least not "well formed") and not quite UCS-2 (which only supports U+0000 to U+FFFF).

But it’s probably ok to ignore Unicode hair-splitting and go with a name that describes "null-terminated u16s" instead (memory representation rather that meaning relative to Unicode). I think C describes zero-terminated well in the context of Rust (CString vs String). But as other have said W for "wide" is more ambiguous, the name should include 16 (or maybe 2) instead, for the number of bits or bytes per code units.

bozaro · 2016-10-24T12:03:23Z

struct CString<CharT = u8> { ... }

This sounds great. But unfortunately CString we need two parameters: for as_bytes() method (u8) and for as_ptr() method (c_char == i8). Also I'm not sure about backward compatibility.

So, as I understand, we got code like:

trait Char: Sized {
    fn memchr(x: Self, text: &[Self]) -> Option<usize>;
}

impl Char for u8 {
    fn memchr(x: Self, text: &[Self]) -> Option<usize> {
        memchr::memchr(x, text)
    }
}

impl Char for u16 {
    fn memchr(x: Self, text: &[Self]) -> Option<usize> {
        for i in 0..text.len() {
            if text[i] == x {
                return Some(i);
            }
        }
        None
    }
}

struct CGenericString<A: Char = u8, B = i8> {
    inner: Vec<A>,
    phantom: PhantomData<B>,
}

impl<A: Char, B> CGenericString<A, B> {
    fn new() -> Self {
        CGenericString {
            inner: Vec::new(),
            phantom: PhantomData,
        }
    }

    fn as_ptr(&self) -> *const B {        
        unsafe { mem::transmute(self.inner.as_ptr()) }
    }
}

fn foo() {
    let raw: CGenericString = CGenericString::new();
    let raw8: CGenericString<u8, c_char> = CGenericString::new();
    let raw16: CGenericString<u16, c_short> = CGenericString::new();
}

Unfortunatelly I dont known, how to check mem::size_of::<A>() == mem::size_of::<B>() at compile time.

joshtriplett · 2016-10-25T19:53:08Z

I've seen wchar_t used as both 2-byte and 4-byte in various environments. CPython supports both. The UEFI environment uses the same 2-byte approach that Windows does.

CWChar makes sense as a type, if it means "whatever the C compiler does on this platform", though that might necessitate a compiler option to change that (analogous to GCC's -fshort-wchar option, which I've used to target UEFI from a Linux compiler).

codyps · 2016-10-25T20:04:02Z

@joshtriplett sounds like wchar_t width should be a target-spec cfg, if we do need it. Unclear that this is needed for this rfc's goal to resolve rust-lang/rust#36671, though (might just need a 16-bit specific version, with a 16-bit specific name).

jan-hudec · 2016-11-07T20:52:00Z

@bozaro,

Unfortunatelly I dont known, how to check mem::size_of::<A>() == mem::size_of::<B>() at compile time.

I would suggest making one of the types a type member of the Char trait. Probably the unsigned one, because it seems to me it makes more sense to

impl Char for c_char { type Unsigned = u8; ... }

than the other way around.

bozaro · 2016-11-17T15:12:32Z

I create generic CStr<A: CChar = u8> and CString<A: CChar = u8>: https://github.com/bozaro/rust-cwstring/tree/generic

It mostly compatible with old CStr/CString classes.
I currentry known only one exception:

assert!(CString::new(vec![0]).is_err());

should be replaced by:

assert!(CString::new(vec![0 as u8]).is_err());

Also I implement CChar only for u8 and u16 types (I think adding u32 should be trivial).

strega-nil · 2016-11-17T21:28:28Z

Adding c_wchar to libc/std::os::raw, in any case, seems good.

alexcrichton · 2016-11-18T16:29:30Z

I'd be pretty wary of changing std::ffi::CString to take a type parameter, primarily because of the inference issues mentioned above. We could perhaps explore that route, but I think we'd want to get a crater run to evaluate the impact, if any, first.

For the name it sounds like the term "wide" is somewhat ambiguous across platforms and also not always widely used. In that sense I'd prefer to avoid that term in the name of this type and instead focus on solving the 16-bit problem which is a very strong motivation for this (how to interact with Windows APIs). Perhaps C16String?

@bozaro also any update on including the API in the RFC itself? It's often difficult to discuss a "hypothetical RFC" as if the text were written, so it's always helpful to have everything spelled out!

strega-nil · 2016-11-18T17:38:22Z

@alexcrichton But, in those cases, they take wchar_t*. They don't take uint16_t*. Similar to how CString papers over char*, I'd argue the wide equivalent should paper over wchar_t*. wchar_t is not a 16-bit type; it's dependent on platform, and if an API uses wide strings, then we should be able to interact with it reasonably.

petrochenkov · 2016-11-18T17:46:17Z

Note that generic CString avoids these fruitless arguments about wchar_t vs u16 by allowing CString to paper over whatever character type is appropriate for a certain situation.

jan-hudec · 2016-11-18T23:07:30Z

@alexcrichton,

I'd be pretty wary of changing std::ffi::CString to take a type parameter, primarily because of the inference issues mentioned above.

What about going the C++ route: std::ffi::CBasicString (or something) would be the generic and std::ffi::CString would be type CString = CBasicString<std::os::raw::c_char>?

For the name it sounds like the term "wide" is somewhat ambiguous across platforms and also not always widely used.

I agree calling 16-bit strings “wide” is wrong. The C++ standard defines wchar_t as “large enough to represent any supported character code unit” and that means 32-bit. So I agree 16-bit should have 16 in the name, 32-bit should have 32 and ‘Wide’ might be alias to whichever corresponds to that platform's default wchar_t.

alexcrichton · 2016-11-19T19:03:47Z

@petrochenkov yes it's true that a generic CString would largely allow sidestepping these issues. It's not clear to me, though, that it's unambiguously the best solution even if it worked (which it doesn't seem to right now).

@jan-hudec perhaps, yeah. We'd likely have to add a new type to be fully backwards compatible, and we'd just want to make sure we've got solid ergonomics across the board.

petrochenkov · 2016-11-19T19:29:53Z

Without compatibility issues it is the unambiguously best solution, I haven't seen arguments against it in this thread so far, while fixed new types have at least three problems 1) proliferation of string types (there are too many already), 2) wchar_t vs u16 problem (either choice is suboptimal), 3) violation of DRY (types are fully identical except for the character type).

Addition of type parameters with defaults was explicitly listed as an acceptable minor change in the API evolution RFC. Crater run is the first thing that needs to be done here, if the results are good then there's basically nothing more to argue about.

petrochenkov · 2016-11-19T19:37:45Z

Oh, and I include the type CString = BasicCString<u8>; variant into "generic CString", it's good too (probably even better if crater finds regressions).

As a small bonus generic CString has nice correspondence with generic C++ basic_string/basic_string_view with which it's going to interact through FFI. I don't think it gives any practical benefits right now, but there's at least a familiarity aspect.

aturon · 2017-03-29T18:12:41Z

The libs team discussed this RFC in triage, and while we remain very sympathetic to the goals here, for an RFC to land it needs to have substantially more detailed text. I'm going to close this PR for the time being, but you should feel free to reopen with a revised RFC, which also addresses the commentary in this thread.

A type representing an owned C-compatible wide string

4f087ac

bozaro force-pushed the cwstring branch from 342f4a6 to 4f087ac Compare October 20, 2016 09:53

bozaro mentioned this pull request Oct 20, 2016

Add OsStringExt::from_wide_ptr for windows rust-lang/rust#36671

Closed

BurntSushi added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Oct 20, 2016

aturon closed this Mar 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A type representing an owned C-compatible wide string #1773

A type representing an owned C-compatible wide string #1773

bozaro commented Oct 20, 2016 •

edited

Loading

bozaro commented Oct 20, 2016

cristicbz commented Oct 20, 2016 •

edited

Loading

BurntSushi commented Oct 20, 2016 •

edited

Loading

retep998 commented Oct 20, 2016

bozaro commented Oct 20, 2016

alexcrichton commented Oct 20, 2016

bozaro commented Oct 20, 2016

retep998 commented Oct 21, 2016

codyps commented Oct 21, 2016 •

edited

Loading

retep998 commented Oct 21, 2016

est31 commented Oct 22, 2016

est31 commented Oct 22, 2016

petrochenkov commented Oct 22, 2016 •

edited

Loading

SimonSapin commented Oct 22, 2016

bozaro commented Oct 24, 2016

joshtriplett commented Oct 25, 2016

codyps commented Oct 25, 2016

jan-hudec commented Nov 7, 2016

bozaro commented Nov 17, 2016

strega-nil commented Nov 17, 2016

alexcrichton commented Nov 18, 2016

strega-nil commented Nov 18, 2016 •

edited

Loading

petrochenkov commented Nov 18, 2016

jan-hudec commented Nov 18, 2016

alexcrichton commented Nov 19, 2016

petrochenkov commented Nov 19, 2016

petrochenkov commented Nov 19, 2016 •

edited

Loading

aturon commented Mar 29, 2017

A type representing an owned C-compatible wide string #1773

A type representing an owned C-compatible wide string #1773

Conversation

bozaro commented Oct 20, 2016 • edited Loading

bozaro commented Oct 20, 2016

cristicbz commented Oct 20, 2016 • edited Loading

BurntSushi commented Oct 20, 2016 • edited Loading

retep998 commented Oct 20, 2016

bozaro commented Oct 20, 2016

alexcrichton commented Oct 20, 2016

bozaro commented Oct 20, 2016

retep998 commented Oct 21, 2016

codyps commented Oct 21, 2016 • edited Loading

retep998 commented Oct 21, 2016

est31 commented Oct 22, 2016

est31 commented Oct 22, 2016

petrochenkov commented Oct 22, 2016 • edited Loading

SimonSapin commented Oct 22, 2016

bozaro commented Oct 24, 2016

joshtriplett commented Oct 25, 2016

codyps commented Oct 25, 2016

jan-hudec commented Nov 7, 2016

bozaro commented Nov 17, 2016

strega-nil commented Nov 17, 2016

alexcrichton commented Nov 18, 2016

strega-nil commented Nov 18, 2016 • edited Loading

petrochenkov commented Nov 18, 2016

jan-hudec commented Nov 18, 2016

alexcrichton commented Nov 19, 2016

petrochenkov commented Nov 19, 2016

petrochenkov commented Nov 19, 2016 • edited Loading

aturon commented Mar 29, 2017

bozaro commented Oct 20, 2016 •

edited

Loading

cristicbz commented Oct 20, 2016 •

edited

Loading

BurntSushi commented Oct 20, 2016 •

edited

Loading

codyps commented Oct 21, 2016 •

edited

Loading

petrochenkov commented Oct 22, 2016 •

edited

Loading

strega-nil commented Nov 18, 2016 •

edited

Loading

petrochenkov commented Nov 19, 2016 •

edited

Loading