-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A type representing an owned C-compatible wide string #1773
Conversation
A preliminary implementation can be found at: https://github.com/bozaro/rust-cwstring |
Can you please write out the detailed design as part of the RFC? Linking to an existing implementation is fine, but I'd like to see the API in the RFC text. From looking at the implementation, it appears its internal representation is
Could you also please say more about what "wide" strings are in general? What problems do they solve? How do people use them? How are they represented in other languages/ecosystems? Perhaps some code examples showing usage of this new type would help, with comparisons to how you might achieve it today without this wide type. |
@BurntSushi The actual implementation is on the |
I'm sorry. I forgotten to merge u16 branch to master :(
|
I think it's definitely worth considering the relationship to I agree with @BurntSushi that we'll likely want to include some APIs here in the RFC itself, especially methods like Finally, I'm wary of using |
As I know, I seen encoding chaos only on Windows (for example, in my case cp1251 used as ANSI, cp866 as OEM and utf-16 charset used in my Windows at the same time). In other platforms utf-8 is used for multibyte strings. I'm sure, Rust should not implement 2-bytes and 4-bytes strings with same name. I really can't generate good name for About |
I propose |
As we're considering this, it might make sense to resolve the case of Edit: to be clear, if this only handles utf-16-like strings, |
@jmesmon Just remember that this type isn't for strict UTF-16. It will allow arbitrary |
I do know some C++ libraries which use wchar_t, but I have to admit that don't know enough C++ libraries to determine whether wchar_t is popular or not.
But only windows calls it wchar ("WCHAR" to be precise).
and about Xalan-C++, its not a major library like Java or Qt, and it doesn't call its char type wchar either, but Calling the proposed 2-byte c-string type a wstring will just cause confusion among all users not targeting windows only platforms, and among former C++ developers who think it would be something similar to 👍 to the idea of such a string type but 👎 to the name. |
What about |
Can't
This solves the ambiguous naming problem and prevents further proliferation of string types as well. |
In line with WTF-8 naming, I’ve sometimes used WTF-16 to call the character encoding used on Windows since it’s not quite UTF-16 (at least not "well formed") and not quite UCS-2 (which only supports U+0000 to U+FFFF). But it’s probably ok to ignore Unicode hair-splitting and go with a name that describes "null-terminated |
This sounds great. But unfortunately So, as I understand, we got code like: trait Char: Sized {
fn memchr(x: Self, text: &[Self]) -> Option<usize>;
}
impl Char for u8 {
fn memchr(x: Self, text: &[Self]) -> Option<usize> {
memchr::memchr(x, text)
}
}
impl Char for u16 {
fn memchr(x: Self, text: &[Self]) -> Option<usize> {
for i in 0..text.len() {
if text[i] == x {
return Some(i);
}
}
None
}
}
struct CGenericString<A: Char = u8, B = i8> {
inner: Vec<A>,
phantom: PhantomData<B>,
}
impl<A: Char, B> CGenericString<A, B> {
fn new() -> Self {
CGenericString {
inner: Vec::new(),
phantom: PhantomData,
}
}
fn as_ptr(&self) -> *const B {
unsafe { mem::transmute(self.inner.as_ptr()) }
}
}
fn foo() {
let raw: CGenericString = CGenericString::new();
let raw8: CGenericString<u8, c_char> = CGenericString::new();
let raw16: CGenericString<u16, c_short> = CGenericString::new();
} Unfortunatelly I dont known, how to check |
I've seen CWChar makes sense as a type, if it means "whatever the C compiler does on this platform", though that might necessitate a compiler option to change that (analogous to GCC's |
@joshtriplett sounds like wchar_t width should be a target-spec |
I would suggest making one of the types a type member of the
than the other way around. |
I create generic It mostly compatible with old CStr/CString classes.
should be replaced by:
Also I implement CChar only for u8 and u16 types (I think adding u32 should be trivial). |
Adding |
I'd be pretty wary of changing For the name it sounds like the term "wide" is somewhat ambiguous across platforms and also not always widely used. In that sense I'd prefer to avoid that term in the name of this type and instead focus on solving the 16-bit problem which is a very strong motivation for this (how to interact with Windows APIs). Perhaps @bozaro also any update on including the API in the RFC itself? It's often difficult to discuss a "hypothetical RFC" as if the text were written, so it's always helpful to have everything spelled out! |
@alexcrichton But, in those cases, they take |
Note that generic |
What about going the C++ route:
I agree calling 16-bit strings “wide” is wrong. The C++ standard defines |
@petrochenkov yes it's true that a generic CString would largely allow sidestepping these issues. It's not clear to me, though, that it's unambiguously the best solution even if it worked (which it doesn't seem to right now). @jan-hudec perhaps, yeah. We'd likely have to add a new type to be fully backwards compatible, and we'd just want to make sure we've got solid ergonomics across the board. |
Without compatibility issues it is the unambiguously best solution, I haven't seen arguments against it in this thread so far, while fixed new types have at least three problems 1) proliferation of string types (there are too many already), 2) Addition of type parameters with defaults was explicitly listed as an acceptable minor change in the API evolution RFC. Crater run is the first thing that needs to be done here, if the results are good then there's basically nothing more to argue about. |
Oh, and I include the As a small bonus generic CString has nice correspondence with generic C++ basic_string/basic_string_view with which it's going to interact through FFI. I don't think it gives any practical benefits right now, but there's at least a familiarity aspect. |
The libs team discussed this RFC in triage, and while we remain very sympathetic to the goals here, for an RFC to land it needs to have substantially more detailed text. I'm going to close this PR for the time being, but you should feel free to reopen with a revised RFC, which also addresses the commentary in this thread. |
This RFC born from issue: rust-lang/rust#36671
Add CWideString/CWideStr for more simple interaction with not well-formed UTF-16
external API (for example: with Windows API).