-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automatically NULL-terminate UTF16 data for passing to Windows APIs #7016
Conversation
This is going to break ICU. |
@nolta: Why? Neither Why would any API look beyond the end of a string in memory, except to look for a NULL terminator? (Note that we already internally 0-terminate all byte arrays, precisely for passing NULL-terminated UTF-8 strings, and I've never heard of this causing a problem.) |
@nolta, I see, you meant the ICU.jl package, not the ICU library, because you are using (And of course many functions will ignore the NULL code point anyway even if it is present at the end of the string.) |
Hmmm......it was a while ago that I added that usage, I'll have to look into it again to see if ODBC would be affected. Though this may actually be a case where "zillions of small UTF16Strings" are being played with (returning a query from a database), but I still don't think it would be much of a problem allocation wise. |
@karbarcca, as the patch is currently written, you should now use
|
It seems I would want |
@karbarcca, yes, if you have valid null-terminated |
This seems like a good solution. |
So the plan is to have |
C interfaces for strings are haphazard... is there much real need for NULL-terminated But if we do want Note also that |
@stevengj thanks for putting this together -- it'll be really helpful in improve the windows support sure, UTF32Strings should probably get this too (ex. just a thought: you could rename |
Renaming I'm not saying that UTF-32 strings are not used, just that I suspect they are rarely needed for passing to C API functions that expect a NULL terminator. (Such C API functions exist, but I'm doubtful that they are often useful in Julia. I do use a NULL-terminated UTF-32 string in one place in PyCall for Python 3.x on Unix, however.) It looks like |
Note also that our old |
In my case, at least, the |
Okay, I'm working on modifying In doing, so, I've become convinced that |
Nice. I've already rebased my PR to use this, and it works great (fwiw, my usage of utf16/UTF16String is consistent with your updated proposal for null termination) |
I also think we should have:
|
Also, it seems like it would be nice if |
…g(c::Char...) in favor of utf32(c...); add conversion to UTF32String from binary data that looks at the BOM
Okay, updated to NULL-terminate I also deprecated the old |
|
||
* `CharString` is renamed to `UTF32String` ([#4943]). | ||
* New string type, `UTF16String` ([#4930]), constructed by `utf16(s)` | ||
from another string, a `Uint32` array, or a byte array (possibly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean a Uint16
array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, will fix.
Why would anyone want to preserve a NULL final character in a string? |
@nalimilan, NUL is a valid Unicode codepoint. I feel like we shouldn't be in the business of (effectively) deleting it from strings unless the user calls a lower-level function. We don't delete it from anywhere else in the string, so it would be odd to specifically delete it from the end only. This is really obvious for the |
…ng and wstring aliases to make wchar_t* APIs usable
@stevengj Yeah, it's a valid character, but in practice it's not very useful, and strings are already complicated enough that playing with NUL characters inside them is rarely a good idea. But I agree that the fact that we're not removing them if they appear in the middle of the string is an argument for keeping them at the end too. What's more worrying to me is the subtle difference between |
I've updated it to add the |
@nalimilan, if you call |
@pygy, yes, it looks like several of the packages use |
Instead of converting first to Vector{Char} and then to UTF-8, convert directly to UTF-8. Also fixes future breakage by JuliaLang/julia#7016
@stevengj OK, fine, as long as |
automatically NULL-terminate UTF16 data for passing to Windows APIs
As discussed in #7008, this changes the
UTF16String
type to automatically append a NULL word at the end of its data (uponutf16(x)
conversions, or to require it in the constructor). This is not treated as part of the string for iteration etcetera, but allows the string to be passed directly to Windows (and similar) API functions expecting NULL-terminated UTF-16 data.Rationale: the main reason for the
UTF16String
type is for calling Windows-like APIs, which typically require NULL-terminated data. A secondary usage is reading UTF-16 encoded files or other data, in which case the overhead of the NULL code unit is almost certainly negligible. Normal Julia progams will never pass around zillions of small UTF-16 strings.