-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hex2bytes for Vector{UInt8} #23267
hex2bytes for Vector{UInt8} #23267
Conversation
Sanitizing input vs. Rejecting Bad Input All inputs were sanitized instead of rejecting for bad input. SIMD performance gains are almost 10% due to this:
|
Thanks, @sambitdash ! Noting that this fixes #23161 |
base/strings/util.jl
Outdated
copied into the destination array. The size of destination array must be at least half of | ||
the `nInBytes` parameter. | ||
""" | ||
function hex2bytes(d::Vector{UInt8}, s::Vector{UInt8}, nInBytes::Int=length(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this modifies its argument, it should be called hex2bytes!
. It would be good to also add a 1-argument version that allocates the output for you, hex2bytes(s::Vector{UInt8})
.
base/strings/util.jl
Outdated
end | ||
|
||
len2 = div(nInBytes, 2) | ||
if size(d)[1] < len2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use length(d)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this code uses @inbounds
but doesn't check that nInBytes <= length(s)
.
base/strings/util.jl
Outdated
end | ||
|
||
i = 0 | ||
j = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return 1 if the input is empty.
Please add some tests in test/strings/util.jl. |
base/strings/util.jl
Outdated
end | ||
|
||
@inline get_number_from_hex(c::UInt) = begin | ||
const DIGIT_ZERO = UInt('0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all const
.
base/strings/util.jl
Outdated
return j | ||
end | ||
|
||
@inline get_number_from_hex(c::UInt) = begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inline function number_from_hex(c::UInt)
# ...
end
base/strings/util.jl
Outdated
@@ -464,6 +464,55 @@ function hex2bytes(s::AbstractString) | |||
end | |||
|
|||
""" | |||
hex2bytes(d::Vector{UInt8}, s::Vector{UInt8}, nInBytes::Int=length(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nInBytes
is a bit of weird capitalization. Perhaps use ninbytes
or perhaps even better just n
,
base/strings/util.jl
Outdated
results are populated into a destination array. The function returns the number of bytes | ||
copied into the destination array. The size of destination array must be at least half of | ||
the `nInBytes` parameter. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use and example here:
# Examples
```jldoctest
julia> hex2bytes(...)
end
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hex2bytes
now has an example, moved from hex2bytes!
.
Typically, when we have both an in-place and out-of-place version of a function, we only give examples in the simpler out-of-place version.
Added test cases. |
I'm not entirely sold on the |
It also should not return the number of bytes decoded – that's completely deterministic based on the input, unlike I/O functions, where it's a useful return value. Instead, APIs like this typically return the destination object so that you can easily write code that chains operations. |
@StefanKarpinski while destination buffer makes sense, when you are reading from a file of 2020 bytes and your temporary buffer is 1000 bytes, the last 20 bytes read may need a @view or a mechanism to specify the n variable. So may be it would be hex2bytes(d::AbstractArray{UInt8}, s::AbstractArray{UInt8}). One can use @view for both source and destination. But developers may miss to pick up the @view approach. I will modify the signature accordingly. |
base/strings/util.jl
Outdated
hex2bytes(s::Vector{UInt8}) | ||
|
||
Convert the hexadecimal bytes array to its binary representation. Returns an | ||
`Array{UInt8,1}`, i.e. an array of bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use Vector{UInt8}
to match the above signature?
base/strings/util.jl
Outdated
@@ -464,11 +464,103 @@ function hex2bytes(s::AbstractString) | |||
end | |||
|
|||
""" | |||
hex2bytes(s::Vector{UInt8}) | |||
|
|||
Convert the hexadecimal bytes array to its binary representation. Returns an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns
-> return
base/strings/util.jl
Outdated
`Array{UInt8,1}`, i.e. an array of bytes. | ||
""" | ||
@inline function hex2bytes(s::Vector{UInt8}) | ||
d = Vector{UInt8}(div(length(s), 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably error here if the length is not even?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed as the underlying method will throw exception. Per @StefanKarpinski all the interfaces are changed to AbstractVector. Many of comments may not be relevant in that case.
base/strings/util.jl
Outdated
of the `n` parameter. | ||
|
||
# Examples | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```jldoctest
base/strings/util.jl
Outdated
|
||
# Examples | ||
``` | ||
julia> s = UInt8["01abEF"...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no indentation here and below
base/strings/util.jl
Outdated
bytes2hex(bin_arr::Array{UInt8, 1}) -> String | ||
|
||
Convert an array of bytes to its hexadecimal representation. | ||
All characters are in lower-case. | ||
|
||
it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
test/strings/util.jl
Outdated
@@ -247,3 +247,24 @@ bin_val = hex2bytes("07bf") | |||
|
|||
#non-hex characters | |||
@test_throws ArgumentError hex2bytes("0123456789abcdefABCDEFGH") | |||
|
|||
function test_23161() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps rewrite as a @testset
?
All tests are added and all the review comments are incorporated. |
base/strings/util.jl
Outdated
0x00 | ||
0x00 | ||
|
||
julia> hex2bytes!(d, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns d
now, not 3, right? The docstring above should be changed in regard to this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also
The docstring above should be changed in regard to this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -762,6 +762,7 @@ export | |||
graphemes, | |||
hex, | |||
hex2bytes, | |||
hex2bytes!, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need to be added to the stdlib doc index somewhere to show up in the rendered docs https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#adding-a-new-docstring-to-base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
base/strings/util.jl
Outdated
n = 0 | ||
c1, i = next(s, i) | ||
done(s, i) && throw(ArgumentError( | ||
"string length must be even: length($(repr(s))) == $(length(s))")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this check be done on entry? Should probably not print the string, enough with the length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check on beginning means you have a potential case of multiple passes of the buffer as there is no clarity in a AbstractVector if the size is pre-computed. And depend on the implementation details. Is there a significant benefit in precheck? Secondly the functionality cannot be achieved without a full scan due to the nature of the alogorithm used. Hence, my personal preference will be react only on failure and not check specifically for input.
Since, these are low level functions my normal approach will be to normalize or sanitize the data such that array length is made even by adding an extra zero. And bound the data btw "0x0-0xf" by applying proper filters than exception on failure. But that's a separate discussion.
base/strings/util.jl
Outdated
n += number_from_hex(c2) | ||
d[j+=1] = (n & 0xFF) | ||
end | ||
resize!(d, j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work with arbitrary AbstractVector
, ref #23267 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dependency on length removed. The array should not be printed as it an be large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> x = Vector{UInt8}(10); resize!(view(x, 1:10), 8)
ERROR: MethodError: no method matching resize!(::SubArray{UInt8,1,Array{UInt8,1},Tuple{UnitRange{Int64}},true}, ::Int64)
Closest candidates are:
resize!(::Array{T,1} where T, ::Integer) at array.jl:1020
resize!(::BitArray{1}, ::Integer) at bitarray.jl:836
base/strings/util.jl
Outdated
return (DIGIT_ZERO <= c <= DIGIT_NINE) ? c - DIGIT_ZERO : | ||
(LATIN_UPPER_A <= c <= LATIN_UPPER_F) ? c - LATIN_UPPER_A + 10 : | ||
(LATIN_A <= c <= LATIN_F) ? c - LATIN_A + 10 : | ||
throw(ArgumentError("Not a hexadecimal number")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps "$c is not a hexadecimal number"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
in exception handling as it can raise exception.
base/strings/util.jl
Outdated
0x45 | ||
0x46 | ||
|
||
julia> d =zeros(UInt8, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in last check-in.
base/strings/util.jl
Outdated
LATIN_A = UInt('a') | ||
LATIN_F = UInt('f') | ||
|
||
return (DIGIT_ZERO <= c <= DIGIT_NINE) ? c - DIGIT_ZERO : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 nested tertiaries is a bit hard to reason about. Perhaps something like
DIGIT_ZERO <= c <= DIGIT_NINE && return c - DIGIT_ZERO
LATIN_UPPER_A <= c <= LATIN_UPPER_F && return c - LATIN_UPPER_A + 10
LATIN_A <= c <= LATIN_F && return c - LATIN_A + 10
throw(ArgumentError("not a hexadecimal number: '$(Char(c))'"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a matter of personal preference and choice. A few lines above in hex2bytes(AbstractVector) has 3 nested tertiaries. Hence ignoring this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a matter of personal preference and choice. A few lines above in hex2bytes(AbstractVector) has 3 nested tertiaries. Hence ignoring this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can fix it up at some other point, doesn't need to block this PR.
Here is the benchmark of code which was sanitized for input and not errors rejected and implemented with word boundary computation using reinterpret.
Source code provided for reference:
|
Is there a type instability? What does For example, I notice that the |
(Adding a bunch of explicit type declarations in Julia is usually a sign that you are making a mistake somewhere. Often it is masking an unintended type conversion somewhere. And it doesn't make sense to do all computations with 64-bit quantities when all the actual values are 8-bit; probably the performance problem is because of your unintended conversions to 64-bit via your use of |
@stevengj There is no performance problem with computations with UInt64 quantities but with usage of UInt8 functions as suggested. Secondly, is it documented that 0x0a will be interpreted differently than 10, although typeof on REPL provides UInt8. Thirdly, is there a benchmark to suggest the changes will be actually improving on performance. If not I will be reluctant to make any further changes to the code. |
I verified on my machine that changing With some additional cleanups, my suggested code is: @inline number_from_hex(c) =
(UInt8('0') <= c <= UInt8('9')) ? c - UInt8('0') :
(UInt8('A') <= c <= UInt8('F')) ? c - (UInt8('A') - 0x0a) :
(UInt8('a') <= c <= UInt8('f')) ? c - (UInt8('a') - 0x0a) :
throw(ArgumentError("byte is not an ASCII hexadecimal digit"))
function hex2bytes!(d::AbstractVector{UInt8}, s::AbstractVector{UInt8})
if 2length(d) != length(s)
isodd(length(s)) && throw(ArgumentError("input hex array must have even length"))
throw(ArgumentError("output array must be half length of input array"))
end
j = start(d) - 1
for i = start(s):2:endof(s)
@inbounds d[j += 1] = number_from_hex(s[i]) << 4 + number_from_hex(s[i+1])
end
return d
end |
@sambitdash, the performance problem is not with |
@stevengj thanks. As a maintainer you have the rights to the make edits. Hence, please submit the changes as desired. |
(Of course, you can make it faster still by assuming valid data, etcetera, but I don't think the additional gains are worth it.) |
@sambitdash, I don't think I can push changes to this PR, because the PR is pulled from your fork. |
Seems like the "Allow edits from maintainers." checkbox is ticked though, at least I can make edits :) |
There are some test case failures reported related to views.
@stevengj and @fredrikekre , I checked in the code @stevengj provided earlier. Those conditions break the test cases on views due to linear indexing assumptions. @fredrikekre, please make changes necessary related to the code and test cases as desired. |
further condense implementation and documentation, merging with `hex2bytes(s::AbstractString)`
base/strings/util.jl
Outdated
3-element Array{UInt8,1}: | ||
0x01 | ||
0xab | ||
0xef | ||
``` | ||
""" | ||
function hex2bytes end | ||
|
||
hex2bytes(s::AbstractString) = hex2bytes(Vector{UInt8}(String(s))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes an additional copy of the data for s::String
, but is still faster than the old implementation on my machine because it operates on bytes rather than unicode characters. More importantly, it avoids code duplication.
(In any case, it's not clear to me how performance-critical this function is. The caller can always switch to byte arrays if it matters.)
@fredrikekre, ah, thanks. |
use ===, not ==, to test that operation occurs in-place, and `hex2bytes!` now throws an error for incorrect output length
(Note that 1d views still use |
(It's annoying to make more than 1-line changes via the github web UI since I have to wait for CI to run in order to detect any typos.) |
Travis failures seem unrelated (a timeout and a problem with |
This ended up being a very nice (first?) PR to Julia – thanks for bearing with it, @sambitdash. One simple API item that seems to be missing still is a |
Thanks @stevengj for all the help. Without your assistance I would not have understood the Julia type system and the associated code generation and would have attempted type conversions / casting to optimize. Thanks everyone for your help in accessing Note: Being a C/C++ programmer early in life takes some iterations to unlearn and embrace newer paradigms. |
function hex2bytes(d::Vector{UInt8}, s::Vector{UInt8}, nInBytes::Int=length(s)) -> Int
Benchmark numbers attached:
Fix #23161