Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hex2bytes for Vector{UInt8} #23267

Merged
merged 18 commits into from
Aug 22, 2017
Merged

hex2bytes for Vector{UInt8} #23267

merged 18 commits into from
Aug 22, 2017

Conversation

sambitdash
Copy link
Contributor

@sambitdash sambitdash commented Aug 15, 2017

function hex2bytes(d::Vector{UInt8}, s::Vector{UInt8}, nInBytes::Int=length(s)) -> Int

  1. The loop is not optimized for SIMD as the branching in the internal loops is too high
  2. Computation is carried out on the word boundary.

Benchmark numbers attached:

julia> @benchmark f_hsS_NOSIMD()   <- hex2bytes(::AbstractString)
BenchmarkTools.Trial: 
  memory estimate:  5.00 MiB
  allocs estimate:  2
  --------------
  minimum time:     56.887 ms (0.00% GC)
  median time:      58.451 ms (0.00% GC)
  mean time:        58.693 ms (0.16% GC)
  maximum time:     65.453 ms (0.00% GC)
  --------------
  samples:          86
  evals/sample:     1

julia> 

julia> @benchmark f_hsB_NOSIMD()
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     43.811 ms (0.00% GC)
  median time:      44.620 ms (0.00% GC)
  mean time:        44.876 ms (0.00% GC)
  maximum time:     50.646 ms (0.00% GC)
  --------------
  samples:          112
  evals/sample:     1

julia> @benchmark f_hsB_SIMD()
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     44.923 ms (0.00% GC)
  median time:      45.786 ms (0.00% GC)
  mean time:        46.127 ms (0.00% GC)
  maximum time:     53.770 ms (0.00% GC)
  --------------
  samples:          109
  evals/sample:     1

Fix #23161

@sambitdash
Copy link
Contributor Author

sambitdash commented Aug 15, 2017

Sanitizing input vs. Rejecting Bad Input

All inputs were sanitized instead of rejecting for bad input.

SIMD performance gains are almost 10% due to this:

@inline get_sanitized_number_from_hex(c::UInt) = begin
    const DIGIT_NINE     = UInt('9')
    return (c > DIGIT_NINE) ? ((c & 0x07) + 9) : (c & 0x0F)
end

julia> @benchmark f_hsB_SIMD_SANE()
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     37.489 ms (0.00% GC)
  median time:      37.915 ms (0.00% GC)
  mean time:        38.010 ms (0.00% GC)
  maximum time:     42.396 ms (0.00% GC)
  --------------
  samples:          132
  evals/sample:     1

julia> @benchmark f_hsB_NOSIMD_SANE()
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     42.419 ms (0.00% GC)
  median time:      42.976 ms (0.00% GC)
  mean time:        43.062 ms (0.00% GC)
  maximum time:     47.542 ms (0.00% GC)
  --------------
  samples:          117
  evals/sample:     1

@JeffBezanson
Copy link
Member

Thanks, @sambitdash !

Noting that this fixes #23161

copied into the destination array. The size of destination array must be at least half of
the `nInBytes` parameter.
"""
function hex2bytes(d::Vector{UInt8}, s::Vector{UInt8}, nInBytes::Int=length(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this modifies its argument, it should be called hex2bytes!. It would be good to also add a 1-argument version that allocates the output for you, hex2bytes(s::Vector{UInt8}).

end

len2 = div(nInBytes, 2)
if size(d)[1] < len2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use length(d) here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this code uses @inbounds but doesn't check that nInBytes <= length(s).

end

i = 0
j = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return 1 if the input is empty.

@JeffBezanson
Copy link
Member

Please add some tests in test/strings/util.jl.

@fredrikekre fredrikekre added the needs tests Unit tests are required for this change label Aug 15, 2017
end

@inline get_number_from_hex(c::UInt) = begin
const DIGIT_ZERO = UInt('0')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all const.

return j
end

@inline get_number_from_hex(c::UInt) = begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inline function number_from_hex(c::UInt)
    # ...
end

@@ -464,6 +464,55 @@ function hex2bytes(s::AbstractString)
end

"""
hex2bytes(d::Vector{UInt8}, s::Vector{UInt8}, nInBytes::Int=length(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nInBytes is a bit of weird capitalization. Perhaps use ninbytes or perhaps even better just n,

results are populated into a destination array. The function returns the number of bytes
copied into the destination array. The size of destination array must be at least half of
the `nInBytes` parameter.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use and example here:


# Examples
```jldoctest
julia> hex2bytes(...)
end
```

Copy link
Member

@stevengj stevengj Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hex2bytes now has an example, moved from hex2bytes!.

Typically, when we have both an in-place and out-of-place version of a function, we only give examples in the simpler out-of-place version.

@sambitdash
Copy link
Contributor Author

Added test cases.

@StefanKarpinski
Copy link
Member

I'm not entirely sold on the n argument. It's going to be hard to remember whether it refers to the number of bytes decoded or the number of hex digits to consume. In fact, I would assume that it indicated the number of bytes to decode, so the number of input bytes would be 2n – but it means the opposite. Is that parameter really necessary? Can't the number of digits to decode by implied by the length of the output array? If one wants to write into part of a larger output array, it can be done with a view, e.g. hex2bytes!(@view(output[a:b]), intput).

@StefanKarpinski
Copy link
Member

It also should not return the number of bytes decoded – that's completely deterministic based on the input, unlike I/O functions, where it's a useful return value. Instead, APIs like this typically return the destination object so that you can easily write code that chains operations.

@sambitdash
Copy link
Contributor Author

@StefanKarpinski while destination buffer makes sense, when you are reading from a file of 2020 bytes and your temporary buffer is 1000 bytes, the last 20 bytes read may need a @view or a mechanism to specify the n variable. So may be it would be hex2bytes(d::AbstractArray{UInt8}, s::AbstractArray{UInt8}). One can use @view for both source and destination. But developers may miss to pick up the @view approach. I will modify the signature accordingly.

hex2bytes(s::Vector{UInt8})

Convert the hexadecimal bytes array to its binary representation. Returns an
`Array{UInt8,1}`, i.e. an array of bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use Vector{UInt8} to match the above signature?

@@ -464,11 +464,103 @@ function hex2bytes(s::AbstractString)
end

"""
hex2bytes(s::Vector{UInt8})

Convert the hexadecimal bytes array to its binary representation. Returns an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns -> return

`Array{UInt8,1}`, i.e. an array of bytes.
"""
@inline function hex2bytes(s::Vector{UInt8})
d = Vector{UInt8}(div(length(s), 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably error here if the length is not even?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed as the underlying method will throw exception. Per @StefanKarpinski all the interfaces are changed to AbstractVector. Many of comments may not be relevant in that case.

of the `n` parameter.

# Examples
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```jldoctest


# Examples
```
julia> s = UInt8["01abEF"...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no indentation here and below

bytes2hex(bin_arr::Array{UInt8, 1}) -> String

Convert an array of bytes to its hexadecimal representation.
All characters are in lower-case.

it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@@ -247,3 +247,24 @@ bin_val = hex2bytes("07bf")

#non-hex characters
@test_throws ArgumentError hex2bytes("0123456789abcdefABCDEFGH")

function test_23161()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps rewrite as a @testset?

@sambitdash
Copy link
Contributor Author

All tests are added and all the review comments are incorporated.

0x00
0x00

julia> hex2bytes!(d, s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns d now, not 3, right? The docstring above should be changed in regard to this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also

The docstring above should be changed in regard to this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -762,6 +762,7 @@ export
graphemes,
hex,
hex2bytes,
hex2bytes!,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to be added to the stdlib doc index somewhere to show up in the rendered docs https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#adding-a-new-docstring-to-base

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

n = 0
c1, i = next(s, i)
done(s, i) && throw(ArgumentError(
"string length must be even: length($(repr(s))) == $(length(s))"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this check be done on entry? Should probably not print the string, enough with the length.

Copy link
Contributor Author

@sambitdash sambitdash Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check on beginning means you have a potential case of multiple passes of the buffer as there is no clarity in a AbstractVector if the size is pre-computed. And depend on the implementation details. Is there a significant benefit in precheck? Secondly the functionality cannot be achieved without a full scan due to the nature of the alogorithm used. Hence, my personal preference will be react only on failure and not check specifically for input.

Since, these are low level functions my normal approach will be to normalize or sanitize the data such that array length is made even by adding an extra zero. And bound the data btw "0x0-0xf" by applying proper filters than exception on failure. But that's a separate discussion.

n += number_from_hex(c2)
d[j+=1] = (n & 0xFF)
end
resize!(d, j)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work with arbitrary AbstractVector, ref #23267 (comment)

Copy link
Contributor Author

@sambitdash sambitdash Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependency on length removed. The array should not be printed as it an be large.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julia> x = Vector{UInt8}(10); resize!(view(x, 1:10), 8)
ERROR: MethodError: no method matching resize!(::SubArray{UInt8,1,Array{UInt8,1},Tuple{UnitRange{Int64}},true}, ::Int64)
Closest candidates are:
  resize!(::Array{T,1} where T, ::Integer) at array.jl:1020
  resize!(::BitArray{1}, ::Integer) at bitarray.jl:836

return (DIGIT_ZERO <= c <= DIGIT_NINE) ? c - DIGIT_ZERO :
(LATIN_UPPER_A <= c <= LATIN_UPPER_F) ? c - LATIN_UPPER_A + 10 :
(LATIN_A <= c <= LATIN_F) ? c - LATIN_A + 10 :
throw(ArgumentError("Not a hexadecimal number"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps "$c is not a hexadecimal number"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

0x45
0x46

julia> d =zeros(UInt8, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in last check-in.

LATIN_A = UInt('a')
LATIN_F = UInt('f')

return (DIGIT_ZERO <= c <= DIGIT_NINE) ? c - DIGIT_ZERO :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 nested tertiaries is a bit hard to reason about. Perhaps something like

DIGIT_ZERO    <= c <= DIGIT_NINE    && return c - DIGIT_ZERO
LATIN_UPPER_A <= c <= LATIN_UPPER_F && return c - LATIN_UPPER_A + 10
LATIN_A       <= c <= LATIN_F       && return c - LATIN_A + 10
throw(ArgumentError("not a hexadecimal number: '$(Char(c))'"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a matter of personal preference and choice. A few lines above in hex2bytes(AbstractVector) has 3 nested tertiaries. Hence ignoring this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a matter of personal preference and choice. A few lines above in hex2bytes(AbstractVector) has 3 nested tertiaries. Hence ignoring this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can fix it up at some other point, doesn't need to block this PR.

@sambitdash
Copy link
Contributor Author

Here is the benchmark of code which was sanitized for input and not errors rejected and implemented with word boundary computation using reinterpret.

julia> f_hsB_Array_noview64()
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  4
  --------------
  minimum time:     12.517 ms (0.00% GC)
  median time:      12.641 ms (0.00% GC)
  mean time:        12.675 ms (0.00% GC)
  maximum time:     14.972 ms (0.00% GC)
  --------------
  samples:          395
  evals/sample:     1

Source code provided for reference:

@inline sanitized_number_from_hex64(c::UInt64) = begin
    n::UInt64 = (c & 0x4F4F4F4F4F4F4F4F)

    f1::UInt64 = 0x0000000000000040
    f2::UInt64 = 0xffffffffffffff07
    f3::UInt64 = 0xffffffffffffff07
    ff::UInt64 = 0xffffffffffffffff
    v::UInt64  = 0x0000000000000009

    for i = 1:8
        if (n & f1 == f1)
            n = (n & f3) + v
        end
        f1 <<= 8
        f2 <<= 8
        f3 = f2 + (ff >> 8(8-i))
        v  <<= 8
    end
    d::UInt = 0
    c1::UInt = 0
    c2::UInt = 0
    for i = 1:4
        f1 = 0xff
        c1 = (n & 0xff)
        n >>= 8
        c2 = (n & 0xff)
        n >>= 8
        d += ((c1 << 4 + c2)&0xff) << 8(i-1)
    end
    return UInt32(d)
end

function hex2bytes64!(d::Vector{UInt8}, s::Vector{UInt8})
    r = rem(length(s), 8)
    si = reinterpret(UInt64, s)
    len = div(length(s), 8)
    di = reinterpret(UInt32, d)
    @simd for i = 1:len
        di[i] = sanitized_number_from_hex64(si[i])
    end
    r_2 = div(r,2)
    i = 8len
    for j = 1:r_2
        @inbounds c1 = UInt(s[i+=1])
        @inbounds c2 = UInt(s[i+=1])
        n = sanitized_number_from_hex(c1)
        n <<= 4
        n += sanitized_number_from_hex(c2)
        @inbounds d[8len+j] = (n & 0xFF)
    end
    return d
end

@inline sanitized_number_from_hex(c::UInt) = begin
    DIGIT_NINE     = UInt('9')
    return (c > DIGIT_NINE) ? ((c & 0x07) + 9) : (c & 0x0F)
end

f_hsB_Array_noview64() = begin
    const mb_10 = (10 << 20)

    arr=UInt8[rand(['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f']) for x = 1:mb_10]
    arr1=zeros(UInt8, (10<<19))

    @benchmark hex2bytes64!(arr1,arr)
end

@stevengj
Copy link
Member

Is there a type instability? What does @code_warntype say?

For example, I notice that the + 10 is a problem because that incurs a promotion to UInt. Add 0x0a instead.

@stevengj
Copy link
Member

(Adding a bunch of explicit type declarations in Julia is usually a sign that you are making a mistake somewhere. Often it is masking an unintended type conversion somewhere. And it doesn't make sense to do all computations with 64-bit quantities when all the actual values are 8-bit; probably the performance problem is because of your unintended conversions to 64-bit via your use of Int literals.)

@sambitdash
Copy link
Contributor Author

@stevengj There is no performance problem with computations with UInt64 quantities but with usage of UInt8 functions as suggested. Secondly, is it documented that 0x0a will be interpreted differently than 10, although typeof on REPL provides UInt8. Thirdly, is there a benchmark to suggest the changes will be actually improving on performance. If not I will be reluctant to make any further changes to the code.

@stevengj
Copy link
Member

stevengj commented Aug 19, 2017

I verified on my machine that changing + 10 to + 0x0a in the UInt8 version eliminates the type instability (@code_warntype would have warned you about it!) and makes the routine almost 10x faster.

With some additional cleanups, my suggested code is:

@inline number_from_hex(c) =
    (UInt8('0') <= c <= UInt8('9')) ? c - UInt8('0') :
    (UInt8('A') <= c <= UInt8('F')) ? c - (UInt8('A') - 0x0a) :
    (UInt8('a') <= c <= UInt8('f')) ? c - (UInt8('a') - 0x0a) :
    throw(ArgumentError("byte is not an ASCII hexadecimal digit"))

function hex2bytes!(d::AbstractVector{UInt8}, s::AbstractVector{UInt8})
    if 2length(d) != length(s)
        isodd(length(s)) && throw(ArgumentError("input hex array must have even length"))
        throw(ArgumentError("output array must be half length of input array"))
    end
    j = start(d) - 1
    for i = start(s):2:endof(s)
        @inbounds d[j += 1] = number_from_hex(s[i]) << 4 + number_from_hex(s[i+1])
    end
    return d
end

@stevengj
Copy link
Member

@sambitdash, the performance problem is not with UInt8, but rather with the type instability; see my improved version above and benchmarks thereof. Regarding integer literals, see the discussion in the manual: the literal 10 has type Int and 0x0a has type UInt8, and the result type of + is determined by promoting the operands.

@sambitdash
Copy link
Contributor Author

@stevengj thanks. As a maintainer you have the rights to the make edits. Hence, please submit the changes as desired.

@stevengj
Copy link
Member

(Of course, you can make it faster still by assuming valid data, etcetera, but I don't think the additional gains are worth it.)

@stevengj
Copy link
Member

@sambitdash, I don't think I can push changes to this PR, because the PR is pulled from your fork.

@fredrikekre
Copy link
Member

Seems like the "Allow edits from maintainers." checkbox is ticked though, at least I can make edits :)

There are some test case failures reported related to views.
@sambitdash
Copy link
Contributor Author

@stevengj and @fredrikekre , I checked in the code @stevengj provided earlier. Those conditions break the test cases on views due to linear indexing assumptions.

@fredrikekre, please make changes necessary related to the code and test cases as desired.

further condense implementation and documentation, merging with `hex2bytes(s::AbstractString)`
3-element Array{UInt8,1}:
0x01
0xab
0xef
```
"""
function hex2bytes end

hex2bytes(s::AbstractString) = hex2bytes(Vector{UInt8}(String(s)))
Copy link
Member

@stevengj stevengj Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes an additional copy of the data for s::String, but is still faster than the old implementation on my machine because it operates on bytes rather than unicode characters. More importantly, it avoids code duplication.

(In any case, it's not clear to me how performance-critical this function is. The caller can always switch to byte arrays if it matters.)

@stevengj
Copy link
Member

@fredrikekre, ah, thanks.

use ===, not ==, to test that operation occurs in-place, and `hex2bytes!` now throws an error for incorrect output length
@stevengj
Copy link
Member

stevengj commented Aug 19, 2017

(Note that 1d views still use start:end indexing. The main thing to be careful of is that you don't want to assume that indices start at 1, which may not be true for custom array types like OffsetArrays. The other annoyance, which I always forget, is that start(a) is no longer an index, so you need first(eachindex(a)). Not sure if there is a simpler way.)

@stevengj
Copy link
Member

(It's annoying to make more than 1-line changes via the github web UI since I have to wait for CI to run in order to detect any typos.)

@stevengj
Copy link
Member

Travis failures seem unrelated (a timeout and a problem with test/file.jl)

@StefanKarpinski
Copy link
Member

This ended up being a very nice (first?) PR to Julia – thanks for bearing with it, @sambitdash. One simple API item that seems to be missing still is a hex2bytes! method that takes strings as the second argument. However, that can be added at any point should someone need it, so I'm not concerned. Thanks for the excellent work!

@fredrikekre fredrikekre removed the needs tests Unit tests are required for this change label Aug 22, 2017
@sambitdash
Copy link
Contributor Author

sambitdash commented Aug 23, 2017

Thanks @stevengj for all the help. Without your assistance I would not have understood the Julia type system and the associated code generation and would have attempted type conversions / casting to optimize. Thanks everyone for your help in accessing base and understand the style guidelines.

Note: Being a C/C++ programmer early in life takes some iterations to unlearn and embrace newer paradigms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants