automatically NULL-terminate UTF16 data for passing to Windows APIs #7016

stevengj · 2014-05-28T21:04:22Z

As discussed in #7008, this changes the UTF16String type to automatically append a NULL word at the end of its data (upon utf16(x) conversions, or to require it in the constructor). This is not treated as part of the string for iteration etcetera, but allows the string to be passed directly to Windows (and similar) API functions expecting NULL-terminated UTF-16 data.

Rationale: the main reason for the UTF16String type is for calling Windows-like APIs, which typically require NULL-terminated data. A secondary usage is reading UTF-16 encoded files or other data, in which case the overhead of the NULL code unit is almost certainly negligible. Normal Julia progams will never pass around zillions of small UTF-16 strings.

nolta · 2014-05-28T21:29:50Z

This is going to break ICU.

stevengj · 2014-05-28T21:40:21Z

@nolta: Why? Neither sizeof(s) nor length(s) include the NULL terminator, so APIs that aren't looking for the terminator won't see it.

Why would any API look beyond the end of a string in memory, except to look for a NULL terminator? (Note that we already internally 0-terminate all byte arrays, precisely for passing NULL-terminated UTF-8 strings, and I've never heard of this causing a problem.)

stevengj · 2014-05-28T21:51:00Z

@nolta, I see, you meant the ICU.jl package, not the ICU library, because you are using length(s.data) to get the number of code units in the string. Well, it's an easy fix: either use length(s.data)-1 or (better) sizeof(s) >> 1. Backwards compatibility should not be a big issue here since UTF16String didn't exist in Julia 0.2, but the sizeof(s) >> 1 version should be backwards compatible with other Julia 0.3 prereleases.

(And of course many functions will ignore the NULL code point anyway even if it is present at the end of the string.)

stevengj · 2014-05-28T21:54:17Z

Currently, the only packages using UTF16String appear to be ICU (@nolta), ODBC (@karbarcca), PyCall, Thrift (@tanmaykm), and UTF16 (@nolta).

quinnj · 2014-05-28T22:08:27Z

Hmmm......it was a while ago that I added that usage, I'll have to look into it again to see if ODBC would be affected. Though this may actually be a case where "zillions of small UTF16Strings" are being played with (returning a query from a database), but I still don't think it would be much of a problem allocation wise.

stevengj · 2014-05-28T22:11:25Z

@karbarcca, as the patch is currently written, you should now use utf16(data) rather than UTF16String(data) to construct the string from an Array{Uint16}, since the latter requires null-terminated data.

utf16(data) will work with earlier Julia 0.3 prereleases, too. (Unlike UTF16String(data), the utf16 function also validates the UTF-16 data.) Note also that UTF16String(data) is currently not documented.

quinnj · 2014-05-28T22:13:59Z

It seems I would want UTF16String then since I'm actually converting C-allocated, null-terminated bytestrings to Julia, right?

stevengj · 2014-05-28T22:17:22Z

@karbarcca, yes, if you have valid null-terminated Uint16 arrays (which what I assume you mean by "bytestrings") then you can call UTF16String directly to avoid making a copy. However, utf16 will also avoid making a copy if you have an Array{Uint16} that is already null-terminated, but it will still perform validation.)

JeffBezanson · 2014-05-28T23:06:12Z

This seems like a good solution.

nolta · 2014-05-28T23:41:25Z

So the plan is to have UTF8Strings with implicit terminators, UTF16Strings with explicit terminators, and UTF32Strings with no terminators? Seems kind of haphazard.

stevengj · 2014-05-28T23:45:47Z

C interfaces for strings are haphazard... is there much real need for NULL-terminated UTF32Strings? I thought most of the world was converging on either UTF16 (Windows) or UTF8 (everyone else) APIs.

But if we do want UTF32String to be NULL terminated, now would be a good time to make the change.

Note also that UTF16String is not really "explicitly" terminated, in that the NULL code unit is supposed to be mostly hidden in an internal data structure; you don't see it with iterators, sizeof, etcetera, and only see it if you look at s.data.

vtjnash · 2014-05-29T02:37:53Z

@stevengj thanks for putting this together -- it'll be really helpful in improve the windows support

sure, UTF32Strings should probably get this too (ex. wcslen). the lack of usage would argue more for moving it out of base than for leaving it inconsistent

just a thought: you could rename .data to .data0, to force deprecation in ICU.jl so that users discover they need to do a Pkg.update()

stevengj · 2014-05-29T03:33:10Z

Renaming .data to .data0 seems prudent, though I'm not sure it's worth it to force a deprecation for people using prereleases and who probably update frequently anyway. It does have the disadvantage that UTF16String will need a separate pointer function (see string.jl), whereas currently all of our UTF-xx pointer functions share the same code. This is only 2 lines of duplication, but it is pleasant that all of our UTF-xx string types have the same name for their raw-data array.

I'm not saying that UTF-32 strings are not used, just that I suspect they are rarely needed for passing to C API functions that expect a NULL terminator. (Such C API functions exist, but I'm doubtful that they are often useful in Julia. I do use a NULL-terminated UTF-32 string in one place in PyCall for Python 3.x on Unix, however.) It looks like UTF32String (or its predecessor, CharString) is used in seven packages: GZip (@kmsquire), MAT (@simonster), MUMPS (@lruthotto), Soundex & TextAnalysis (@johnmyleswhite), TOML (@pygy), and DataFrames.

stevengj · 2014-05-29T03:43:41Z

Note also that our old CharString type (in Julia 0.2) called its data member .chars, so we are already breaking backward compatibility (for code accessing this private member) in 0.3. So, it shouldn't be too traumatic to NULL-terminate UTF32String if that's what we want to do.

pygy · 2014-05-29T12:16:08Z

In my case, at least, the CharString call is superfluous. I'll remove it.

stevengj · 2014-05-29T15:00:27Z

Okay, I'm working on modifying UTF32String to be NULL-terminated too.

In doing, so, I've become convinced that utf16(s) should always append a NULL terminator; if one is already present in s, that should be treated as part of the string. If you have NULL-terminated UTF-16 data in which you don't want the NULL treated as part of the string, you should use the lower-level constructor UTF16String(s). I've updated the code and documentation accordingly.

vtjnash · 2014-05-29T15:11:28Z

Nice. I've already rebased my PR to use this, and it works great (fwiw, my usage of utf16/UTF16String is consistent with your updated proposal for null termination)

stevengj · 2014-05-29T17:02:54Z

I also think we should have:

WString and wstring: alias for UTFXXString and utfXX, where XX is the same size as wchar_t.
utfXX(ptr [, length]) function analogous to bytestring(ptr [, length]), which infers the length from NULL termination if it is not supplied.

stevengj · 2014-05-29T17:15:12Z

Also, it seems like it would be nice if utf32(x::Vector{Uint8}) existed and were like utf16(x::Vector{Uint8}) in checking for a BOM to detect a non-native byte ordering.

…imilar) APIs

…g(c::Char...) in favor of utf32(c...); add conversion to UTF32String from binary data that looks at the BOM

stevengj · 2014-05-29T19:21:23Z

Okay, updated to NULL-terminate UTF32String.

I also deprecated the old UTF32String(c::Char...) constructor in favor of utf32(c::Integer...). This way, it is a bit more consistent: you only use UTF32String(data) or UTF16String(data) if you have valid NULL-terminated data and want to make a string without making a copy and without appending another NULL. Otherwise, you use utf16 or utf32.

JeffBezanson · 2014-05-29T19:33:06Z

NEWS.md

-
-    * `CharString` is renamed to `UTF32String` ([#4943]).
+    * New string type, `UTF16String` ([#4930]), constructed by `utf16(s)`
+      from another string, a `Uint32` array, or a byte array (possibly


Do you mean a Uint16 array?

Whoops, will fix.

nalimilan · 2014-05-29T19:40:48Z

In doing, so, I've become convinced that utf16(s) should always append a NULL terminator; if one is already present in s, that should be treated as part of the string. If you have NULL-terminated UTF-16 data in which you don't want the NULL treated as part of the string, you should use the lower-level constructor UTF16String(s).

Why would anyone want to preserve a NULL final character in a string?

stevengj · 2014-05-29T20:22:11Z

@nalimilan, NUL is a valid Unicode codepoint. I feel like we shouldn't be in the business of (effectively) deleting it from strings unless the user calls a lower-level function. We don't delete it from anywhere else in the string, so it would be odd to specifically delete it from the end only.

This is really obvious for the utf32(x::Vector{Char}) constructor: you are asking a string consisting of the codepoints in x, regardless of whether those codepoints are NUL or not.

…ng and wstring aliases to make wchar_t* APIs usable

nalimilan · 2014-05-29T20:34:23Z

@stevengj Yeah, it's a valid character, but in practice it's not very useful, and strings are already complicated enough that playing with NUL characters inside them is rarely a good idea. But I agree that the fact that we're not removing them if they appear in the middle of the string is an argument for keeping them at the end too.

What's more worrying to me is the subtle difference between utf16 and UTF16String. A separate argument to decide whether to add a NUL character or not would be more explicit.

stevengj · 2014-05-29T20:35:00Z

I've updated it to add the bytestring-like pointer conversions mentioned above, as well as the WString and wstring aliases. Without such aliases, it's pretty hard to use cross-platform APIs based on wchar_t*.

stevengj · 2014-05-29T20:39:12Z

@nalimilan, if you call UTF16String by accident, you'll almost certainly get an error since most strings don't have NUL at the end. If you call utf16 by accident on NUL-terminated data, you'll probably notice quickly that your strings have \0 in them. I don't see this causing subtle hard-to-find errors.

stevengj · 2014-05-29T20:52:34Z

@pygy, yes, it looks like several of the packages use CharString only for utf8(CharString(buf::Vector{Char})), which can be replaced even in Julia 0.2 with utf8(buf).

Instead of converting first to Vector{Char} and then to UTF-8, convert directly to UTF-8. Also fixes future breakage by JuliaLang/julia#7016

nalimilan · 2014-05-29T21:07:53Z

@stevengj OK, fine, as long as UTF16String() remains a really low-level constructor.

…teger types

automatically NULL-terminate UTF16 data for passing to Windows APIs

stevengj added unicode labels May 28, 2014

vtjnash added this to the 0.3 milestone May 29, 2014

stevengj added 2 commits May 29, 2014 13:39

automatically NULL-terminate UTF16 data for passing to Windows (and s…

10d7ac0

…imilar) APIs

NULL-terminate UTF32String data for consistency; deprecate UTF32Strin…

a4a847c

…g(c::Char...) in favor of utf32(c...); add conversion to UTF32String from binary data that looks at the BOM

NEWS updates for UTF16String and UTF32String

e96719a

JeffBezanson reviewed May 29, 2014
View reviewed changes

added utfXX(ptr[,len]) functions analogous to bytestring, added WStri…

11a0324

…ng and wstring aliases to make wchar_t* APIs usable

fix utf32/UTF32String mixup

e88192e

stevengj mentioned this pull request May 29, 2014

Improve docs for bystestring regarding null-termination #7036

Closed

simonster added a commit to JuliaIO/MAT.jl that referenced this pull request May 29, 2014

Fix unnecessary CharString use

67f2c4f

Instead of converting first to Vector{Char} and then to UTF-8, convert directly to UTF-8. Also fixes future breakage by JuliaLang/julia#7016

This was referenced May 29, 2014

UTF16String API updates JuliaStrings/ICU.jl#12

Closed

UTF16String API updates JuliaDatabases/ODBC.jl#57

Closed

updated CharString semantics for Julia 0.3 JuliaIO/MAT.jl#26

Closed

add deprecation for UTF32String(s::String)

e7f3fd6

This was referenced May 30, 2014

updated CharString semantics for Julia 0.3 JuliaText/TextAnalysis.jl#18

Closed

replace Base.utf8(CharString(buf)) with utf8(buf) pygy/TOML.jl#3

Closed

avoid deprecation warning in converting UTFxxString to equal-sized in…

b24d29f

…teger types

stevengj mentioned this pull request May 30, 2014

replace UTF32String usage for Julia 0.3 & type-stability JuliaData/DataFrames.jl#613

Closed

type typo (and regression test) in utfXX(::Ptr{IntXX})

249d76d

JeffBezanson added a commit that referenced this pull request Jun 4, 2014

Merge pull request #7016 from stevengj/utf16_null

ff6bde7

automatically NULL-terminate UTF16 data for passing to Windows APIs

JeffBezanson merged commit ff6bde7 into JuliaLang:master Jun 4, 2014

stevengj deleted the utf16_null branch June 4, 2014 23:15

ivarne mentioned this pull request Jun 10, 2014

[PkgEval] GZip may have a testing issue on Julia 0.3 (2014-06-10) JuliaIO/GZip.jl#16

Closed

pao mentioned this pull request May 1, 2015

Fix Unicode bugs with UTF-16/UTF-32 conversions (#10959) #11004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatically NULL-terminate UTF16 data for passing to Windows APIs #7016

automatically NULL-terminate UTF16 data for passing to Windows APIs #7016

stevengj commented May 28, 2014

nolta commented May 28, 2014

stevengj commented May 28, 2014

stevengj commented May 28, 2014

stevengj commented May 28, 2014

quinnj commented May 28, 2014

stevengj commented May 28, 2014

quinnj commented May 28, 2014

stevengj commented May 28, 2014

JeffBezanson commented May 28, 2014

nolta commented May 28, 2014

stevengj commented May 28, 2014

vtjnash commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

pygy commented May 29, 2014

stevengj commented May 29, 2014

vtjnash commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

JeffBezanson May 29, 2014

stevengj May 29, 2014

nalimilan commented May 29, 2014

stevengj commented May 29, 2014

nalimilan commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

nalimilan commented May 29, 2014

automatically NULL-terminate UTF16 data for passing to Windows APIs #7016

automatically NULL-terminate UTF16 data for passing to Windows APIs #7016

Conversation

stevengj commented May 28, 2014

nolta commented May 28, 2014

stevengj commented May 28, 2014

stevengj commented May 28, 2014

stevengj commented May 28, 2014

quinnj commented May 28, 2014

stevengj commented May 28, 2014

quinnj commented May 28, 2014

stevengj commented May 28, 2014

JeffBezanson commented May 28, 2014

nolta commented May 28, 2014

stevengj commented May 28, 2014

vtjnash commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

pygy commented May 29, 2014

stevengj commented May 29, 2014

vtjnash commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

JeffBezanson May 29, 2014

Choose a reason for hiding this comment

stevengj May 29, 2014

Choose a reason for hiding this comment

nalimilan commented May 29, 2014

stevengj commented May 29, 2014

nalimilan commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

stevengj commented May 29, 2014

nalimilan commented May 29, 2014