Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A type representing an owned C-compatible wide string #1773

Closed
wants to merge 1 commit into from

Conversation

bozaro
Copy link

@bozaro bozaro commented Oct 20, 2016

This RFC born from issue: rust-lang/rust#36671

Add CWideString/CWideStr for more simple interaction with not well-formed UTF-16
external API (for example: with Windows API).

@bozaro
Copy link
Author

bozaro commented Oct 20, 2016

A preliminary implementation can be found at: https://github.com/bozaro/rust-cwstring

@cristicbz
Copy link

cristicbz commented Oct 20, 2016

Rendered

@BurntSushi BurntSushi added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Oct 20, 2016
@BurntSushi
Copy link
Member

BurntSushi commented Oct 20, 2016

Can you please write out the detailed design as part of the RFC? Linking to an existing implementation is fine, but I'd like to see the API in the RFC text.

From looking at the implementation, it appears its internal representation is u8. I think that means @retep998's concern is pertinent:

Would this CWideString type actually wrap [u16], so that you could pass them directly to winapi functions without having to convert to and from WTF-8? It would likely have to be Windows specific though, it doesn't make sense on any other platforms, aside from UEFI.

Could you also please say more about what "wide" strings are in general? What problems do they solve? How do people use them? How are they represented in other languages/ecosystems? Perhaps some code examples showing usage of this new type would help, with comparisons to how you might achieve it today without this wide type.

@retep998
Copy link
Member

@BurntSushi The actual implementation is on the u16 branch. The master branch has an old version where the CStr impl was copied over but not adjusted to use u16 yet.

@bozaro
Copy link
Author

bozaro commented Oct 20, 2016

I'm sorry. I forgotten to merge u16 branch to master :(
Current CWideString is looks like working, but also need:

  • Get better name to get_bytes methods;
  • Fix method descriptions.

@alexcrichton
Copy link
Member

I think it's definitely worth considering the relationship to wchar_t on C platforms. A name like CWideString would certainly lead many to believe that it's equivalent to that! Are "wide strings" used much at all on Unix? If so, are they wchar_t-based or are they all explicitly 16-byte wide elements?

I agree with @BurntSushi that we'll likely want to include some APIs here in the RFC itself, especially methods like into_bytes which will need to be renamed as well for the wider element type.

Finally, I'm wary of using c_ushort and would be open to considering wchar_t to mirror FFI usage. That depends on whether 16-bit strings are all that common on Unix, though.

@bozaro
Copy link
Author

bozaro commented Oct 20, 2016

As I know, sizeof(wchar_t) is 2-bytes on Windows and 4 bytes on Linux. But I never seen real use of 4-byte wchar_t on Linux (but C++ is not my primary language). Windows, Java, Qt, Xalan-C++ uses 2-bytes strings.

I seen encoding chaos only on Windows (for example, in my case cp1251 used as ANSI, cp866 as OEM and utf-16 charset used in my Windows at the same time). In other platforms utf-8 is used for multibyte strings.

I'm sure, Rust should not implement 2-bytes and 4-bytes strings with same name.
This C++ pain, which should remain in C++. I doubt the usefulness of 4-byte strings.

I really can't generate good name for into_bytes method (sorry, but English is not my native language).

About c_ushort: I really can't see any c_ushort usage profit and prefer to replace c_ushort by u16 in this code.

@retep998
Copy link
Member

I propose bytes -> wchars.

@codyps
Copy link

codyps commented Oct 21, 2016

As we're considering this, it might make sense to resolve the case of char32_t and char16_t strings for c++11/c11 compatibility as a whole. Or just explicitly restrict this to utf-16 (ie: char16_t, if that's what is needed to resolve the bug).

Edit: to be clear, if this only handles utf-16-like strings, 16 should be somewhere in the name. W is fairly unspecified (as we can see from the unix vs windows discussions here).

@retep998
Copy link
Member

@jmesmon Just remember that this type isn't for strict UTF-16. It will allow arbitrary u16 including lone surrogates, similar to how CString allows arbitrary u8 regardless of whether it is valid UTF-8.

@est31
Copy link
Member

est31 commented Oct 22, 2016

But I never seen real use of 4-byte wchar_t on Linux

I do know some C++ libraries which use wchar_t, but I have to admit that don't know enough C++ libraries to determine whether wchar_t is popular or not.

Windows, Java, Qt, Xalan-C++ uses 2-bytes strings.

But only windows calls it wchar ("WCHAR" to be precise).

and about Xalan-C++, its not a major library like Java or Qt, and it doesn't call its char type wchar either, but XalanDOMChar. And optionally, the XalanDOMChar can be set to wchar_t at build time.

Calling the proposed 2-byte c-string type a wstring will just cause confusion among all users not targeting windows only platforms, and among former C++ developers who think it would be something similar to wchar_t.

👍 to the idea of such a string type but 👎 to the name.

@est31
Copy link
Member

est31 commented Oct 22, 2016

What about Cu16String/CU16String instead? Its not mentioning UTF-16 itself, but the u16 type on which it bases, and it also has the "C" in the name which carries the meaning that it could contain invalid UTF-16 like lone surrogates.

@petrochenkov
Copy link
Contributor

petrochenkov commented Oct 22, 2016

Can't CString get a type parameter instead of introducing a new string type?

struct CString<CharT = u8> { ... }

This solves the ambiguous naming problem and prevents further proliferation of string types as well.
(Sorry, I didn't read the previous discussion.)

@SimonSapin
Copy link
Contributor

In line with WTF-8 naming, I’ve sometimes used WTF-16 to call the character encoding used on Windows since it’s not quite UTF-16 (at least not "well formed") and not quite UCS-2 (which only supports U+0000 to U+FFFF).

But it’s probably ok to ignore Unicode hair-splitting and go with a name that describes "null-terminated u16s" instead (memory representation rather that meaning relative to Unicode). I think C describes zero-terminated well in the context of Rust (CString vs String). But as other have said W for "wide" is more ambiguous, the name should include 16 (or maybe 2) instead, for the number of bits or bytes per code units.

@bozaro
Copy link
Author

bozaro commented Oct 24, 2016

struct CString<CharT = u8> { ... }

This sounds great. But unfortunately CString we need two parameters: for as_bytes() method (u8) and for as_ptr() method (c_char == i8). Also I'm not sure about backward compatibility.

So, as I understand, we got code like:

trait Char: Sized {
    fn memchr(x: Self, text: &[Self]) -> Option<usize>;
}

impl Char for u8 {
    fn memchr(x: Self, text: &[Self]) -> Option<usize> {
        memchr::memchr(x, text)
    }
}

impl Char for u16 {
    fn memchr(x: Self, text: &[Self]) -> Option<usize> {
        for i in 0..text.len() {
            if text[i] == x {
                return Some(i);
            }
        }
        None
    }
}

struct CGenericString<A: Char = u8, B = i8> {
    inner: Vec<A>,
    phantom: PhantomData<B>,
}

impl<A: Char, B> CGenericString<A, B> {
    fn new() -> Self {
        CGenericString {
            inner: Vec::new(),
            phantom: PhantomData,
        }
    }

    fn as_ptr(&self) -> *const B {        
        unsafe { mem::transmute(self.inner.as_ptr()) }
    }
}

fn foo() {
    let raw: CGenericString = CGenericString::new();
    let raw8: CGenericString<u8, c_char> = CGenericString::new();
    let raw16: CGenericString<u16, c_short> = CGenericString::new();
}

Unfortunatelly I dont known, how to check mem::size_of::<A>() == mem::size_of::<B>() at compile time.

@joshtriplett
Copy link
Member

I've seen wchar_t used as both 2-byte and 4-byte in various environments. CPython supports both. The UEFI environment uses the same 2-byte approach that Windows does.

CWChar makes sense as a type, if it means "whatever the C compiler does on this platform", though that might necessitate a compiler option to change that (analogous to GCC's -fshort-wchar option, which I've used to target UEFI from a Linux compiler).

@codyps
Copy link

codyps commented Oct 25, 2016

@joshtriplett sounds like wchar_t width should be a target-spec cfg, if we do need it. Unclear that this is needed for this rfc's goal to resolve rust-lang/rust#36671, though (might just need a 16-bit specific version, with a 16-bit specific name).

@jan-hudec
Copy link

@bozaro,

Unfortunatelly I dont known, how to check mem::size_of::<A>() == mem::size_of::<B>() at compile time.

I would suggest making one of the types a type member of the Char trait. Probably the unsigned one, because it seems to me it makes more sense to

impl Char for c_char { type Unsigned = u8; ... }

than the other way around.

@bozaro
Copy link
Author

bozaro commented Nov 17, 2016

I create generic CStr<A: CChar = u8> and CString<A: CChar = u8>: https://github.com/bozaro/rust-cwstring/tree/generic

It mostly compatible with old CStr/CString classes.
I currentry known only one exception:

assert!(CString::new(vec![0]).is_err());

should be replaced by:

assert!(CString::new(vec![0 as u8]).is_err());

Also I implement CChar only for u8 and u16 types (I think adding u32 should be trivial).

@strega-nil
Copy link

Adding c_wchar to libc/std::os::raw, in any case, seems good.

@alexcrichton
Copy link
Member

I'd be pretty wary of changing std::ffi::CString to take a type parameter, primarily because of the inference issues mentioned above. We could perhaps explore that route, but I think we'd want to get a crater run to evaluate the impact, if any, first.

For the name it sounds like the term "wide" is somewhat ambiguous across platforms and also not always widely used. In that sense I'd prefer to avoid that term in the name of this type and instead focus on solving the 16-bit problem which is a very strong motivation for this (how to interact with Windows APIs). Perhaps C16String?

@bozaro also any update on including the API in the RFC itself? It's often difficult to discuss a "hypothetical RFC" as if the text were written, so it's always helpful to have everything spelled out!

@strega-nil
Copy link

strega-nil commented Nov 18, 2016

@alexcrichton But, in those cases, they take wchar_t*. They don't take uint16_t*. Similar to how CString papers over char*, I'd argue the wide equivalent should paper over wchar_t*. wchar_t is not a 16-bit type; it's dependent on platform, and if an API uses wide strings, then we should be able to interact with it reasonably.

@petrochenkov
Copy link
Contributor

Note that generic CString avoids these fruitless arguments about wchar_t vs u16 by allowing CString to paper over whatever character type is appropriate for a certain situation.

@jan-hudec
Copy link

@alexcrichton,

I'd be pretty wary of changing std::ffi::CString to take a type parameter, primarily because of the inference issues mentioned above.

What about going the C++ route: std::ffi::CBasicString (or something) would be the generic and std::ffi::CString would be type CString = CBasicString<std::os::raw::c_char>?

For the name it sounds like the term "wide" is somewhat ambiguous across platforms and also not always widely used.

I agree calling 16-bit strings “wide” is wrong. The C++ standard defines wchar_t as “large enough to represent any supported character code unit” and that means 32-bit. So I agree 16-bit should have 16 in the name, 32-bit should have 32 and ‘Wide’ might be alias to whichever corresponds to that platform's default wchar_t.

@alexcrichton
Copy link
Member

@petrochenkov yes it's true that a generic CString would largely allow sidestepping these issues. It's not clear to me, though, that it's unambiguously the best solution even if it worked (which it doesn't seem to right now).

@jan-hudec perhaps, yeah. We'd likely have to add a new type to be fully backwards compatible, and we'd just want to make sure we've got solid ergonomics across the board.

@petrochenkov
Copy link
Contributor

Without compatibility issues it is the unambiguously best solution, I haven't seen arguments against it in this thread so far, while fixed new types have at least three problems 1) proliferation of string types (there are too many already), 2) wchar_t vs u16 problem (either choice is suboptimal), 3) violation of DRY (types are fully identical except for the character type).

Addition of type parameters with defaults was explicitly listed as an acceptable minor change in the API evolution RFC. Crater run is the first thing that needs to be done here, if the results are good then there's basically nothing more to argue about.

@petrochenkov
Copy link
Contributor

petrochenkov commented Nov 19, 2016

Oh, and I include the type CString = BasicCString<u8>; variant into "generic CString", it's good too (probably even better if crater finds regressions).

As a small bonus generic CString has nice correspondence with generic C++ basic_string/basic_string_view with which it's going to interact through FFI. I don't think it gives any practical benefits right now, but there's at least a familiarity aspect.

@aturon
Copy link
Member

aturon commented Mar 29, 2017

The libs team discussed this RFC in triage, and while we remain very sympathetic to the goals here, for an RFC to land it needs to have substantially more detailed text. I'm going to close this PR for the time being, but you should feel free to reopen with a revised RFC, which also addresses the commentary in this thread.

@aturon aturon closed this Mar 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.