Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatically NULL-terminate UTF16 data for passing to Windows APIs #7016

Merged
merged 8 commits into from
Jun 4, 2014

Conversation

stevengj
Copy link
Member

As discussed in #7008, this changes the UTF16String type to automatically append a NULL word at the end of its data (upon utf16(x) conversions, or to require it in the constructor). This is not treated as part of the string for iteration etcetera, but allows the string to be passed directly to Windows (and similar) API functions expecting NULL-terminated UTF-16 data.

Rationale: the main reason for the UTF16String type is for calling Windows-like APIs, which typically require NULL-terminated data. A secondary usage is reading UTF-16 encoded files or other data, in which case the overhead of the NULL code unit is almost certainly negligible. Normal Julia progams will never pass around zillions of small UTF-16 strings.

@nolta
Copy link
Member

nolta commented May 28, 2014

This is going to break ICU.

@stevengj
Copy link
Member Author

@nolta: Why? Neither sizeof(s) nor length(s) include the NULL terminator, so APIs that aren't looking for the terminator won't see it.

Why would any API look beyond the end of a string in memory, except to look for a NULL terminator? (Note that we already internally 0-terminate all byte arrays, precisely for passing NULL-terminated UTF-8 strings, and I've never heard of this causing a problem.)

@stevengj
Copy link
Member Author

@nolta, I see, you meant the ICU.jl package, not the ICU library, because you are using length(s.data) to get the number of code units in the string. Well, it's an easy fix: either use length(s.data)-1 or (better) sizeof(s) >> 1. Backwards compatibility should not be a big issue here since UTF16String didn't exist in Julia 0.2, but the sizeof(s) >> 1 version should be backwards compatible with other Julia 0.3 prereleases.

(And of course many functions will ignore the NULL code point anyway even if it is present at the end of the string.)

@stevengj
Copy link
Member Author

Currently, the only packages using UTF16String appear to be ICU (@nolta), ODBC (@karbarcca), PyCall, Thrift (@tanmaykm), and UTF16 (@nolta).

@quinnj
Copy link
Member

quinnj commented May 28, 2014

Hmmm......it was a while ago that I added that usage, I'll have to look into it again to see if ODBC would be affected. Though this may actually be a case where "zillions of small UTF16Strings" are being played with (returning a query from a database), but I still don't think it would be much of a problem allocation wise.

@stevengj
Copy link
Member Author

@karbarcca, as the patch is currently written, you should now use utf16(data) rather than UTF16String(data) to construct the string from an Array{Uint16}, since the latter requires null-terminated data.

utf16(data) will work with earlier Julia 0.3 prereleases, too. (Unlike UTF16String(data), the utf16 function also validates the UTF-16 data.) Note also that UTF16String(data) is currently not documented.

@quinnj
Copy link
Member

quinnj commented May 28, 2014

It seems I would want UTF16String then since I'm actually converting C-allocated, null-terminated bytestrings to Julia, right?

@stevengj
Copy link
Member Author

@karbarcca, yes, if you have valid null-terminated Uint16 arrays (which what I assume you mean by "bytestrings") then you can call UTF16String directly to avoid making a copy. However, utf16 will also avoid making a copy if you have an Array{Uint16} that is already null-terminated, but it will still perform validation.)

@JeffBezanson
Copy link
Member

This seems like a good solution.

@nolta
Copy link
Member

nolta commented May 28, 2014

So the plan is to have UTF8Strings with implicit terminators, UTF16Strings with explicit terminators, and UTF32Strings with no terminators? Seems kind of haphazard.

@stevengj
Copy link
Member Author

C interfaces for strings are haphazard... is there much real need for NULL-terminated UTF32Strings? I thought most of the world was converging on either UTF16 (Windows) or UTF8 (everyone else) APIs.

But if we do want UTF32String to be NULL terminated, now would be a good time to make the change.

Note also that UTF16String is not really "explicitly" terminated, in that the NULL code unit is supposed to be mostly hidden in an internal data structure; you don't see it with iterators, sizeof, etcetera, and only see it if you look at s.data.

@vtjnash vtjnash added this to the 0.3 milestone May 29, 2014
@vtjnash
Copy link
Member

vtjnash commented May 29, 2014

@stevengj thanks for putting this together -- it'll be really helpful in improve the windows support

sure, UTF32Strings should probably get this too (ex. wcslen). the lack of usage would argue more for moving it out of base than for leaving it inconsistent

just a thought: you could rename .data to .data0, to force deprecation in ICU.jl so that users discover they need to do a Pkg.update()

@stevengj
Copy link
Member Author

Renaming .data to .data0 seems prudent, though I'm not sure it's worth it to force a deprecation for people using prereleases and who probably update frequently anyway. It does have the disadvantage that UTF16String will need a separate pointer function (see string.jl), whereas currently all of our UTF-xx pointer functions share the same code. This is only 2 lines of duplication, but it is pleasant that all of our UTF-xx string types have the same name for their raw-data array.

I'm not saying that UTF-32 strings are not used, just that I suspect they are rarely needed for passing to C API functions that expect a NULL terminator. (Such C API functions exist, but I'm doubtful that they are often useful in Julia. I do use a NULL-terminated UTF-32 string in one place in PyCall for Python 3.x on Unix, however.) It looks like UTF32String (or its predecessor, CharString) is used in seven packages: GZip (@kmsquire), MAT (@simonster), MUMPS (@lruthotto), Soundex & TextAnalysis (@johnmyleswhite), TOML (@pygy), and DataFrames.

@stevengj
Copy link
Member Author

Note also that our old CharString type (in Julia 0.2) called its data member .chars, so we are already breaking backward compatibility (for code accessing this private member) in 0.3. So, it shouldn't be too traumatic to NULL-terminate UTF32String if that's what we want to do.

@pygy
Copy link
Contributor

pygy commented May 29, 2014

In my case, at least, the CharString call is superfluous. I'll remove it.

@stevengj
Copy link
Member Author

Okay, I'm working on modifying UTF32String to be NULL-terminated too.

In doing, so, I've become convinced that utf16(s) should always append a NULL terminator; if one is already present in s, that should be treated as part of the string. If you have NULL-terminated UTF-16 data in which you don't want the NULL treated as part of the string, you should use the lower-level constructor UTF16String(s). I've updated the code and documentation accordingly.

@vtjnash
Copy link
Member

vtjnash commented May 29, 2014

Nice. I've already rebased my PR to use this, and it works great (fwiw, my usage of utf16/UTF16String is consistent with your updated proposal for null termination)

@stevengj
Copy link
Member Author

I also think we should have:

  • WString and wstring: alias for UTFXXString and utfXX, where XX is the same size as wchar_t.
  • utfXX(ptr [, length]) function analogous to bytestring(ptr [, length]), which infers the length from NULL termination if it is not supplied.

@stevengj
Copy link
Member Author

Also, it seems like it would be nice if utf32(x::Vector{Uint8}) existed and were like utf16(x::Vector{Uint8}) in checking for a BOM to detect a non-native byte ordering.

stevengj added 2 commits May 29, 2014 13:39
…g(c::Char...) in favor of utf32(c...); add conversion to UTF32String from binary data that looks at the BOM
@stevengj
Copy link
Member Author

Okay, updated to NULL-terminate UTF32String.

I also deprecated the old UTF32String(c::Char...) constructor in favor of utf32(c::Integer...). This way, it is a bit more consistent: you only use UTF32String(data) or UTF16String(data) if you have valid NULL-terminated data and want to make a string without making a copy and without appending another NULL. Otherwise, you use utf16 or utf32.


* `CharString` is renamed to `UTF32String` ([#4943]).
* New string type, `UTF16String` ([#4930]), constructed by `utf16(s)`
from another string, a `Uint32` array, or a byte array (possibly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a Uint16 array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, will fix.

@nalimilan
Copy link
Member

In doing, so, I've become convinced that utf16(s) should always append a NULL terminator; if one is already present in s, that should be treated as part of the string. If you have NULL-terminated UTF-16 data in which you don't want the NULL treated as part of the string, you should use the lower-level constructor UTF16String(s).

Why would anyone want to preserve a NULL final character in a string?

@stevengj
Copy link
Member Author

@nalimilan, NUL is a valid Unicode codepoint. I feel like we shouldn't be in the business of (effectively) deleting it from strings unless the user calls a lower-level function. We don't delete it from anywhere else in the string, so it would be odd to specifically delete it from the end only.

This is really obvious for the utf32(x::Vector{Char}) constructor: you are asking a string consisting of the codepoints in x, regardless of whether those codepoints are NUL or not.

…ng and wstring aliases to make wchar_t* APIs usable
@nalimilan
Copy link
Member

@stevengj Yeah, it's a valid character, but in practice it's not very useful, and strings are already complicated enough that playing with NUL characters inside them is rarely a good idea. But I agree that the fact that we're not removing them if they appear in the middle of the string is an argument for keeping them at the end too.

What's more worrying to me is the subtle difference between utf16 and UTF16String. A separate argument to decide whether to add a NUL character or not would be more explicit.

@stevengj
Copy link
Member Author

I've updated it to add the bytestring-like pointer conversions mentioned above, as well as the WString and wstring aliases. Without such aliases, it's pretty hard to use cross-platform APIs based on wchar_t*.

@stevengj
Copy link
Member Author

@nalimilan, if you call UTF16String by accident, you'll almost certainly get an error since most strings don't have NUL at the end. If you call utf16 by accident on NUL-terminated data, you'll probably notice quickly that your strings have \0 in them. I don't see this causing subtle hard-to-find errors.

@stevengj
Copy link
Member Author

@pygy, yes, it looks like several of the packages use CharString only for utf8(CharString(buf::Vector{Char})), which can be replaced even in Julia 0.2 with utf8(buf).

simonster added a commit to JuliaIO/MAT.jl that referenced this pull request May 29, 2014
Instead of converting first to Vector{Char} and then to UTF-8, convert
directly to UTF-8. Also fixes future breakage by JuliaLang/julia#7016
@nalimilan
Copy link
Member

@stevengj OK, fine, as long as UTF16String() remains a really low-level constructor.

JeffBezanson added a commit that referenced this pull request Jun 4, 2014
automatically NULL-terminate UTF16 data for passing to Windows APIs
@JeffBezanson JeffBezanson merged commit ff6bde7 into JuliaLang:master Jun 4, 2014
@stevengj stevengj deleted the utf16_null branch June 4, 2014 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:windows Affects only Windows unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants