Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatically NULL-terminate UTF16 data for passing to Windows APIs #7016

Merged
merged 8 commits into from
Jun 4, 2014
19 changes: 15 additions & 4 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,9 +162,20 @@ Library improvements

* Triple-quoted regex strings, `r"""..."""` ([#4934]).

* New string type, `UTF16String` ([#4930]).
* New string type, `UTF16String` ([#4930]), constructed by
`utf16(s)` from another string, a `Uint16` array or pointer, or
a byte array (possibly prefixed by a byte-order marker to
indicate endian-ness). Its data is internally `NULL`-terminated
for passing to C ([#7016]).

* `CharString` is renamed to `UTF32String` ([#4943]).
* `CharString` is renamed to `UTF32String` ([#4943]), and its data
is now internally `NULL`-terminated for passing to C ([#7016]).
`CharString(c::Char...)` is deprecated in favor of `utf32(c...)`,
and `utf32(s)` otherwise has functionality similar to `utf16(s)`.

* New `WString` and `wstring` synonyms for either `UTF16String`
and `utf16` or `UTF32String` and `utf32`, respectively, depending
on the width of `Cwchar_t` ([#7016]).

* `normalize_string` function to perform Unicode normalization,
case-folding, and other transformations ([#5576]).
Expand Down Expand Up @@ -284,8 +295,8 @@ Library improvements
* Very large ranges (e.g. `0:typemax(Int)`) can now be constructed, but some
operations (e.g. `length`) will raise an `OverflowError`.

* Extended API for ``cov`` and ``cor``, which accept keyword arguments ``vardim``,
``corrected``, and ``mean`` ([#6273])
* Extended API for `cov` and `cor`, which accept keyword arguments `vardim`,
`corrected`, and `mean` ([#6273])

* New functions `randsubseq` and `randsubseq!` to create a random subsequence of an array ([#6726])

Expand Down
2 changes: 1 addition & 1 deletion base/char.jl
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,4 @@ sizeof(::Type{Char}) = 4
## printing & showing characters ##

print(io::IO, c::Char) = (write(io,c); nothing)
show(io::IO, c::Char) = (print(io,'\''); print_escaped(io,UTF32String(c),"'"); print(io,'\''))
show(io::IO, c::Char) = (print(io,'\''); print_escaped(io,utf32(c),"'"); print(io,'\''))
2 changes: 2 additions & 0 deletions base/deprecated.jl
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,8 @@ const Stat = StatStruct

export CharString
const CharString = UTF32String
@deprecate UTF32String(c::Integer...) utf32(c...)
@deprecate UTF32String(s::String) utf32(s)

export Ranges
const Ranges = Range
Expand Down
2 changes: 2 additions & 0 deletions base/exports.jl
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ export
WeakKeyDict,
WeakRef,
Woodbury,
WString,
Zip,

# Ccall types
Expand Down Expand Up @@ -868,6 +869,7 @@ export
utf16,
utf32,
warn,
wstring,
xdump,

# random numbers
Expand Down
36 changes: 9 additions & 27 deletions base/string.jl
Original file line number Diff line number Diff line change
Expand Up @@ -561,33 +561,6 @@ end
endof(s::GenericString) = endof(s.string)
next(s::GenericString, i::Int) = next(s.string, i)

## plain old character arrays ##

immutable UTF32String <: DirectIndexString
data::Array{Char,1}

UTF32String(a::Array{Char,1}) = new(a)
UTF32String(c::Char...) = new([ c[i] for i=1:length(c) ])
end
UTF32String(x...) = UTF32String(map(char,x)...)

next(s::UTF32String, i::Int) = (s.data[i], i+1)
endof(s::UTF32String) = length(s.data)
length(s::UTF32String) = length(s.data)

utf32(x) = convert(UTF32String, x)
convert(::Type{UTF32String}, s::UTF32String) = s
convert(::Type{UTF32String}, s::String) = UTF32String(Char[c for c in s])
convert{T<:String}(::Type{T}, v::Vector{Char}) = convert(T, UTF32String(v))
convert(::Type{Array{Char,1}}, s::UTF32String) = s.data
convert(::Type{Array{Char}}, s::UTF32String) = s.data

reverse(s::UTF32String) = UTF32String(reverse(s.data))

sizeof(s::UTF32String) = sizeof(s.data)
convert{T<:Union(Int32,Uint32,Char)}(::Type{Ptr{T}}, s::UTF32String) =
convert(Ptr{T}, s.data)

## substrings reference original strings ##

immutable SubString{T<:String} <: String
Expand Down Expand Up @@ -1679,6 +1652,14 @@ function repr(x)
takebuf_string(s)
end

if sizeof(Cwchar_t) == 2
const WString = UTF16String # const, not typealias, to get constructor
const wstring = utf16
elseif sizeof(Cwchar_t) == 4
const WString = UTF32String # const, not typealias, to get constructor
const wstring = utf32
end

# pointer conversions of ASCII/UTF8/UTF16/UTF32 strings:
pointer(x::Union(ByteString,UTF16String,UTF32String)) = pointer(x.data)
pointer{T<:ByteString}(x::SubString{T}) = pointer(x.string.data) + x.offset
Expand All @@ -1687,3 +1668,4 @@ pointer{T<:ByteString}(x::SubString{T}, i::Integer) = pointer(x.string.data) + x
pointer(x::Union(UTF16String,UTF32String), i::Integer) = pointer(x)+(i-1)*sizeof(eltype(x.data))
pointer{T<:Union(UTF16String,UTF32String)}(x::SubString{T}) = pointer(x.string.data) + x.offset*sizeof(eltype(x.data))
pointer{T<:Union(UTF16String,UTF32String)}(x::SubString{T}, i::Integer) = pointer(x.string.data) + (x.offset + (i-1))*sizeof(eltype(x.data))

1 change: 1 addition & 0 deletions base/sysimg.jl
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ include("char.jl")
include("ascii.jl")
include("utf8.jl")
include("utf16.jl")
include("utf32.jl")
include("iobuffer.jl")
include("string.jl")
include("utf8proc.jl")
Expand Down
67 changes: 50 additions & 17 deletions base/utf16.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
immutable UTF16String <: String
data::Array{Uint16,1}
data::Array{Uint16,1} # includes 16-bit NULL termination after string chars
function UTF16String(data::Vector{Uint16})
if length(data) < 1 || data[end] != 0
throw(ArgumentError("UTF16String data must be NULL-terminated"))
end
new(data)
end
end

utf16_is_lead(c::Uint16) = (c & 0xfc00) == 0xd800
Expand All @@ -9,15 +15,14 @@ utf16_get_supplementary(lead::Uint16, trail::Uint16) = char((lead-0xd7f7)<<10 +

function endof(s::UTF16String)
d = s.data
i = length(d)
i = length(d) - 1
i == 0 && return i
utf16_is_surrogate(d[i]) ? i-1 : i
end

function next(s::UTF16String, i::Int)
if !utf16_is_surrogate(s.data[i])
return char(s.data[i]), i+1
elseif length(s.data) > i && utf16_is_lead(s.data[i]) && utf16_is_trail(s.data[i+1])
elseif length(s.data)-1 > i && utf16_is_lead(s.data[i]) && utf16_is_trail(s.data[i+1])
return utf16_get_supplementary(s.data[i], s.data[i+1]), i+2
end
error("invalid UTF-16 character index")
Expand All @@ -34,24 +39,27 @@ function encode16(s::String)
push!(buf, uint16(0xdc00 + c & 0x3ff))
end
end
push!(buf, 0) # NULL termination
UTF16String(buf)
end

utf16(x) = convert(UTF16String, x)
convert(::Type{UTF16String}, s::UTF16String) = s
convert(::Type{UTF16String}, s::String) = encode16(s)
convert(::Type{UTF8String}, s::UTF16String) =
sprint(length(s.data), io->for c in s; write(io,c::Char); end)
convert(::Type{Array{Uint16,1}}, s::UTF16String) = s.data
convert(::Type{Array{Uint16}}, s::UTF16String) = s.data

sizeof(s::UTF16String) = sizeof(s.data)
# TODO: optimize this
convert(::Type{UTF8String}, s::UTF16String) =
sprint(length(s.data)-1, io->for c in s; write(io,c::Char); end)

sizeof(s::UTF16String) = sizeof(s.data) - sizeof(Uint16)
convert{T<:Union(Int16,Uint16)}(::Type{Ptr{T}}, s::UTF16String) =
convert(Ptr{T}, s.data)
convert(Ptr{T}, pointer(s))

function is_valid_utf16(data::Array{Uint16})
function is_valid_utf16(data::AbstractArray{Uint16})
i = 1
n = length(data)
n = length(data) # this may include NULL termination; that's okay
while i < n # check for unpaired surrogates
if utf16_is_lead(data[i]) && utf16_is_trail(data[i+1])
i += 2
Expand All @@ -66,21 +74,46 @@ end

is_valid_utf16(s::UTF16String) = is_valid_utf16(s.data)

function convert(::Type{UTF16String}, data::Array{Uint16})
function convert(::Type{UTF16String}, data::AbstractVector{Uint16})
!is_valid_utf16(data) && throw(ArgumentError("invalid UTF16 data"))
UTF16String(data)
len = length(data)
d = Array(Uint16, len + 1)
d[end] = 0 # NULL terminate
UTF16String(copy!(d,1, data,1, len))
end

function convert(T::Type{UTF16String}, bytes::Array{Uint8})
isempty(bytes) && return UTF16String(Uint16[])
convert(T::Type{UTF16String}, data::AbstractArray{Uint16}) =
convert(T, reshape(data, length(data)))

convert(T::Type{UTF16String}, data::AbstractArray{Int16}) =
convert(T, reinterpret(Uint16, data))

function convert(T::Type{UTF16String}, bytes::AbstractArray{Uint8})
isempty(bytes) && return UTF16String(Uint16[0])
isodd(length(bytes)) && throw(ArgumentError("odd number of bytes"))
data = reinterpret(Uint16, bytes)
# check for byte-order mark (BOM):
if data[1] == 0xfeff # native byte order
convert(T, data[2:end])
d = Array(Uint16, length(data))
copy!(d,1, data,2, length(data)-1)
elseif data[1] == 0xfffe # byte-swapped
convert(T, Uint16[bswap(data[i]) for i=2:length(data)])
d = Array(Uint16, length(data))
for i = 2:length(data)
d[i-1] = bswap(data[i])
end
else
convert(T, copy(data)) # assume native byte order
d = Array(Uint16, length(data) + 1)
copy!(d,1, data,1, length(data)) # assume native byte order
end
d[end] = 0 # NULL terminate
!is_valid_utf16(d) && throw(ArgumentError("invalid UTF16 data"))
UTF16String(d)
end

utf16(p::Ptr{Uint16}, len::Integer) = utf16(pointer_to_array(p, len))
utf16(p::Ptr{Int16}, len::Integer) = utf16(convert(Ptr{Uint16}, p), len)
function utf16(p::Union(Ptr{Uint16}, Ptr{Int16}))
len = 0
while unsafe_load(p, len+1) != 0; len += 1; end
utf16(p, len)
end
98 changes: 98 additions & 0 deletions base/utf32.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
## UTF-32 in the native byte order, i.e. plain old character arrays ##

immutable UTF32String <: DirectIndexString
data::Array{Char,1} # includes 32-bit NULL termination after string chars

function UTF32String(a::Array{Char,1})
if length(a) < 1 || a[end] != 0
throw(ArgumentError("UTF32String data must be NULL-terminated"))
end
new(a)
end
end

next(s::UTF32String, i::Int) = (s.data[i], i+1)
endof(s::UTF32String) = length(s.data) - 1
length(s::UTF32String) = length(s.data) - 1

function utf32(c::Integer...)
a = Array(Char, length(c) + 1)
for i = 1:length(c)
a[i] = c[i]
end
a[end] = 0
UTF32String(a)
end

utf32(x) = convert(UTF32String, x)
convert(::Type{UTF32String}, s::UTF32String) = s

function convert(::Type{UTF32String}, s::String)
a = Array(Char, length(s) + 1)
i = 0
for c in s
a[i += 1] = c
end
a[end] = 0 # NULL terminate
UTF32String(a)
end

function convert(::Type{UTF32String}, data::AbstractVector{Char})
len = length(data)
d = Array(Char, len + 1)
d[end] = 0 # NULL terminate
UTF32String(copy!(d,1, data,1, len))
end

convert{T<:Union(Int32,Uint32)}(::Type{UTF32String}, data::AbstractVector{T}) =
convert(UTF32String, reinterpret(Char, data))

convert{T<:String}(::Type{T}, v::AbstractVector{Char}) = convert(T, utf32(v))

# specialize for performance reasons:
function convert{T<:ByteString}(::Type{T}, data::AbstractVector{Char})
s = IOBuffer(Array(Uint8,length(data)), true, true)
truncate(s,0)
for x in data
print(s, x)
end
convert(T, takebuf_string(s))
end

convert(::Type{Array{Char,1}}, s::UTF32String) = s.data
convert(::Type{Array{Char}}, s::UTF32String) = s.data

reverse(s::UTF32String) = UTF32String(reverse!(copy(s.data), 1, length(s)))

sizeof(s::UTF32String) = sizeof(s.data) - sizeof(Char)
convert{T<:Union(Int32,Uint32,Char)}(::Type{Ptr{T}}, s::UTF32String) =
convert(Ptr{T}, pointer(s))

function convert(T::Type{UTF32String}, bytes::AbstractArray{Uint8})
isempty(bytes) && return UTF32String(Char[0])
length(bytes) & 3 != 0 && throw(ArgumentError("need multiple of 4 bytes"))
data = reinterpret(Char, bytes)
# check for byte-order mark (BOM):
if data[1] == 0x0000feff # native byte order
d = Array(Char, length(data))
copy!(d,1, data,2, length(data)-1)
elseif data[1] == 0xfffe0000 # byte-swapped
d = Array(Char, length(data))
for i = 2:length(data)
d[i-1] = bswap(data[i])
end
else
d = Array(Char, length(data) + 1)
copy!(d,1, data,1, length(data)) # assume native byte order
end
d[end] = 0 # NULL terminate
UTF32String(d)
end

utf32(p::Ptr{Char}, len::Integer) = utf32(pointer_to_array(p, len))
utf32(p::Union(Ptr{Uint32}, Ptr{Int32}), len::Integer) = utf32(convert(Ptr{Char}, p), len)
function utf32(p::Union(Ptr{Char}, Ptr{Uint32}, Ptr{Int32}))
len = 0
while unsafe_load(p, len+1) != 0; len += 1; end
utf32(p, len)
end
7 changes: 7 additions & 0 deletions doc/manual/calling-c-and-fortran-code.rst
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,13 @@ can be called via the following Julia code::
argv = [ "a.out", "arg1", "arg2" ]
ccall(:main, Int32, (Int32, Ptr{Ptr{Uint8}}), length(argv), argv)

For ``wchar_t*`` arguments, the Julia type should be ``Ptr{Wchar_t}``,
and data can be converted to/from ordinary Julia strings by the
``wstring(s)`` function (equivalent to either ``utf16(s)`` or ``utf32(s)``
depending upon the width of ``Cwchar_t``. Note also that ASCII, UTF-8,
UTF-16, and UTF-32 string data in Julia is internally NUL-terminated, so
it can be passed to C functions expecting NUL-terminated data without making
a copy.

Accessing Data through a Pointer
--------------------------------
Expand Down
11 changes: 8 additions & 3 deletions doc/manual/strings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -356,9 +356,14 @@ exception handling required:
y

UTF-8 is not the only encoding that Julia supports, and adding support
for new encodings is quite easy, but discussion of other encodings and
how to implement support for them is beyond the scope of this document
for the time being. For further discussion of UTF-8 encoding issues, see
for new encodings is quite easy. In particular, Julia also provides
``UTF16String`` and ``UTF32String`` types, constructed by the
``utf16(s)`` and ``utf32(s)`` functions respectively, for UTF-16 and
UTF-32 encodings. It also provides aliases ``WString`` and
``wstring(s)`` for either UTF-16 or UTF-32 strings, depending on the
size of ``Cwchar_t``. Additional discussion of other encodings and how to
implement support for them is beyond the scope of this document for
the time being. For further discussion of UTF-8 encoding issues, see
the section below on `byte array literals <#Byte+Array+Literals>`_,
which goes into some greater detail.

Expand Down
Loading