-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
efficient bit rotate functions: ror, rol #11592
Comments
What is the trickier implementation? |
Oops, edited to add that. |
I was on the edge of my seat there for a minute :) |
#!/usr/bin/julia -f
macro time_func(f, args...)
args = eval(current_module(), Expr(:tuple, args...))::Tuple
argnames = Symbol[gensym() for i in 1:length(args)]
types = map(typeof, args)
quote
function wrapper($(argnames...))
$(Expr(:meta, :noinline))
$f($(argnames...))
end
function timing_wrapper()
println($f, $types)
wrapper($(args...))
gc()
@time for i in 1:1000000000
wrapper($(args...))
end
gc()
end
timing_wrapper()
end
end
ror1(x::UInt64, k::Int8) = (x >>> (0x3f & k)) | (x << (0x3f & -k))
function ror2(x::UInt64, k::Int8)
Base.llvmcall("""
%3 = tail call i64 asm \"rorq \$1,\$0\", \"=r,{cx},0,~{dirflag},~{fpsr},~{flags}\"(i8 %1, i64 %0)
ret i64 %3
""", UInt64, Tuple{UInt64, UInt8}, x, k)
end
for i in 1:10
println("$i: $(hex(ror1(UInt64(1), Int8(i))))")
end
for i in 1:10
println("$i: $(hex(ror2(UInt64(1), Int8(i))))")
end
code_native(ror1, Tuple{UInt64, Int8})
code_native(ror2, Tuple{UInt64, Int8})
@time_func(ror1, UInt64(1), Int8(10))
@time_func(ror2, UInt64(1), Int8(10)) Output:
|
P.S. And by "copied from uint64_t
rotr64(uint64_t x, uint8_t r)
{
asm("rorq %1,%0" : "+r" (x) : "c" (r));
return x;
} Which is in term adapted from here |
So, you add this for Intel platforms, and fall back to the old for ARM, Power, etc.? (or get somebody to figure out the correct inline asm for those platforms)? 👍 For my bit twiddling, this would be very nice, esp. since I'd no longer have to deal with inline asm for each platform myself. |
Well, IMHO, although the inline assembly works, it should not be done in julia for most of the case. In general, it should really be the job of LLVM to emit the best assembly and it already can in recent versions. Inline assembly can probably be used to do something that directly addresses some special hardware features but probably not for general purpose functions and especially not this one since llvm can already do it. |
Ok, I thought people had said that llvm didn't support it... It does as of what version? |
It doesn't support it as an llvm instruction but the x86_64 backend can recognize the code and emit |
@yuyichao, you said:
I asked as of what version it was supported? (i.e. the earliest version, not what you used). Is that supported in 3.3? |
I suppose @StefanKarpinski was on 3.3 and as I just checked it doesn't optmize the function to |
cc: @VicDrastik |
Just an update. I happened to notice that LLVM (3.7) does not emit The difference in performance on my Haswell laptop is ~10%. Not sure if this is good enough... c.c. @Keno |
Isn't there an |
Mostly because Julia does not create optimal code when doing cyclic bit shifts, see JuliaLang/julia#11592 and JuliaLang/julia#19923
Now that we've deprecated the old julia> ror(x::Int, k::Int) = (x >>> (0x3f & k)) | (x << (0x3f & -k))
ror (generic function with 1 method)
julia> @code_native ror(rand(Int), 3)
.section __TEXT,__text,regular,pure_instructions
; Function ror {
; Location: REPL[1]:1
; Function |; {
; Location: REPL[1]:1
movl %esi, %ecx
decl %eax
rorl %cl, %edi
;}
decl %eax
movl %edi, %eax
retl
nopl (%eax)
;} |
Yes, I think we should certainly expose these. That still seems like a lot of instructions for a simple |
I get
|
Looks like @mbauman is on a 32bit machine? |
Nope, that's the official 1.0.0 binary on my Mac (an old westmere/nehalem system). On master I see the same as you, @KristofferC. |
Yep, same as you with 1.0.1 binary for me. |
Also get the |
|
The whole code fragment operates on 32-bit values, so it's self-consistent. |
Since you are on Mac, I presume that you're seeing #28046 |
1.1.0 release shows the correct output now (#28046) |
True, so we can now generate efficient native code, and also have it disassemble correctly on Macs, but shouldn't this issue also include exposing |
Seems like a good idea. |
bump |
Someone just needs to make a PR defining these, adding some tests and NEWS. |
Should these be implemented for all different integer types ( |
At least the unsigned ones. I'm not sure what the right definition of ror and rol for signed types is except if you want to just rotate them as if they were unsigned, i.e. cast, rotate, cast back. |
I had this in my files: (I use rotate with signed ints)
|
Nice. The colons before the |
I removed the colons. I do not see how to use |
for T in Base.BitInteger_types
mask = UInt8(sizeof(T) << 3 - 1)
@eval begin
ror(x::$T, k::Integer) = (x >>> ($mask & k)) | (x << ($mask & -k))
rol(x::$T, k::Integer) = (x << ($mask & k)) | (x >>> ($mask & -k))
end
end |
elegant |
So here's kooky idea. We don't have great words for Perhaps this is too clever by half, but I like the idea of exposing this through a special struct Bits{T <: Integer} <: AbstractVector{Bool}
data::T
end
Base.size(::Bits{T}) where {T} = (sizeof(T)*8,)
Base.getindex(b::Bits, i::Int) = b.data & (1 << (i-1)) != 0 This is likely more useful than
Introduce one new name, get rid of 6 of the 37 remaining Base exports that are The biggest downside is that when you're doing bit-twiddling, you pretty much always want to know that you're using those strangely-named intrinsics. This makes them look like any other array function (because they are). |
For the record, this is exactly what the The dual of |
The obvious way to implement bit rotation is this:
This has a couple of issues, however. First, rotating by more that a word is broken:
Second, the native code is awful:
Both can be improved with a slightly trickier implementation:
That's way better machine code – but
ror
is an x86 instruction – this should just boil down to that. Given that LLVM does not expose rotate instructions, what do we have to do here to get this to emit a single x86 instruction?The text was updated successfully, but these errors were encountered: