Use memcmp to optimize == for bit integer arrays #16877

TotalVerb · 2016-06-11T06:54:38Z

After #16855, it's faster to compare two arrays for equality with String(a) == String(b) than with a == b. No joke!

julia> using BenchmarkTools

julia> const A = repeat([0x01:0x7F;], outer=10000);

julia> const B = repeat([0x01:0x7F;], outer=10000);

julia> @benchmark A == B
BenchmarkTools.Trial: 
  samples:          2271
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     1.99 ms (0.00% GC)
  median time:      2.27 ms (0.00% GC)
  mean time:        2.20 ms (0.00% GC)
  maximum time:     2.46 ms (0.00% GC)

julia> @benchmark String(A) == String(B)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  32.00 bytes
  allocs estimate:  2
  minimum time:     76.32 μs (0.00% GC)
  median time:      80.60 μs (0.00% GC)
  mean time:        82.95 μs (0.00% GC)
  maximum time:     314.34 μs (0.00% GC)

In fact, the fastest (non-String) way to compare byte arrays for equality seems to be lexcmp, which uses memcmp behind the hood:

julia> @benchmark lexcmp(A, B) == 0
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     76.49 μs (0.00% GC)
  median time:      86.93 μs (0.00% GC)
  mean time:        89.47 μs (0.00% GC)
  maximum time:     450.16 μs (0.00% GC)

I think it makes sense to move this optimization up one level, from strings to arrays. I'm not entirely sure what the scope should be—currently it's for one-dimensional arrays of integral types that are at most 64 bits and not Bool, which is a bit arbitrary. It doesn't work for floating point types because of NaN behaviour, and I'm a little unclear about the semantics of Bool in Julia, but I think it might work for that also, if safe.

nalimilan · 2016-06-11T10:23:02Z

base/strings/basic.jl

@@ -128,9 +128,7 @@ isless(a::AbstractString, b::AbstractString) = cmp(a,b) < 0
 cmp(a::String, b::String) = lexcmp(a.data, b.data)
 cmp(a::Symbol, b::Symbol) = Int(sign(ccall(:strcmp, Int32, (Cstring, Cstring), a, b)))


Shouldn't Symbol benefit from the same optimization as String? Looks like it could have a effect in a lot of areas.

Isn't this function already using ccall? Or do you mean a different one?

I was thinking about == for Symbol, but actually it calls === so I guess that's OK already.

timholy · 2016-06-11T16:51:03Z

Wow.

For, me, the following julia implementation:

function iseq3(A::AbstractArray, B::AbstractArray)
    if size(A) != size(B)
        return false
    end
    if isa(A,Range) != isa(B,Range)
        return false
    end
    failures = 0
    @simd for I in eachindex(A)
        @inbounds failures += A[I] != B[I]
    end
    return failures == 0
end

is (for the arrays you were testing on) much better than our current implementation (by 3-4x), but the ccall is another 4x better than that.

rfourquet · 2016-06-11T17:42:50Z

base/array.jl

@@ -653,6 +653,11 @@ function lexcmp(a::Array{UInt8,1}, b::Array{UInt8,1})
    return c < 0 ? -1 : c > 0 ? +1 : cmp(length(a),length(b))
 end

+# use memcmp for == on integer types
+=={T<:Union{Int8,UInt8,Int16,UInt16,Int32,UInt32,Int64,UInt64}}(a::Array{T,1}, b::Array{T,1}) =


You could replace this union by the predefined BitInteger64, or BitInteger if you add [U]Int128 (which you should!)

Awesome! I think I will probably extend this to n-dimensional arrays too; no reason to restrict to one-dimension.

TotalVerb · 2016-06-11T23:41:01Z

Issues addressed. This final version, on my system, has the following benchmarking attributes (benchmarks are done by randomly generating arrays, then copying them, and comparing the two):

NEW ==
rand(UInt8, 0): 8
rand(UInt8, 1): 7
rand(UInt8, 2): 7
rand(UInt8, 3): 7
rand(UInt8, 4): 8
rand(UInt8, 5): 7
rand(UInt8, 10): 8
rand(UInt8, 20): 7
rand(UInt8, 50): 9
rand(UInt8, 100): 11
rand(UInt8, 1000): 36
rand(UInt8, 10000): 292
rand(UInt8, 100000): 3552
rand(UInt8, 1000000): 56017
rand(UInt8, 10000000): 1003760

OLD ==
rand(UInt8, 0): 8
rand(UInt8, 1): 9
rand(UInt8, 2): 10
rand(UInt8, 3): 11
rand(UInt8, 4): 13
rand(UInt8, 5): 15
rand(UInt8, 10): 22
rand(UInt8, 20): 38
rand(UInt8, 50): 93
rand(UInt8, 100): 171
rand(UInt8, 1000): 1583
rand(UInt8, 10000): 15696
rand(UInt8, 100000): 156832
rand(UInt8, 1000000): 1566205
rand(UInt8, 10000000): 15829086

NEW ==
rand(Int64, 0): 7
rand(Int64, 1): 6
rand(Int64, 2): 6
rand(Int64, 3): 7
rand(Int64, 4): 7
rand(Int64, 5): 7
rand(Int64, 10): 10
rand(Int64, 20): 12
rand(Int64, 50): 20
rand(Int64, 100): 30
rand(Int64, 1000): 237
rand(Int64, 10000): 3230
rand(Int64, 100000): 43568
rand(Int64, 1000000): 757348
rand(Int64, 10000000): 8574296

OLD ==
rand(Int64, 0): 8
rand(Int64, 1): 9
rand(Int64, 2): 10
rand(Int64, 3): 11
rand(Int64, 4): 13
rand(Int64, 5): 15
rand(Int64, 10): 22
rand(Int64, 20): 38
rand(Int64, 50): 93
rand(Int64, 100): 171
rand(Int64, 1000): 1583
rand(Int64, 10000): 15700
rand(Int64, 100000): 156662
rand(Int64, 1000000): 1603819
rand(Int64, 10000000): 16082368

It is a bit concerning that the time taken for the new version seems slightly superlinear, but it is so much faster that the superlinearity doesn't even seem to matter. It's possible the superlinearity is an artifact of using the minimum result for BenchmarkTools's @benchmark macro. Or it could be due to cache effects.

The results on Int64 show that most of the improvement seems to come from SIMD optimizations, which are more powerful on bytes than on words.

TotalVerb · 2016-09-14T02:49:29Z

I've rebased this. This performance optimization should still be valid on v0.6.

tkelman · 2016-09-15T01:15:27Z

dunno if this is covered by existing benchmarks, but @nanosoldier runbenchmarks(ALL, vs = ":master")

tkelman · 2016-09-15T01:17:01Z

base/array.jl

+# use memcmp for == on bit integer types
+=={T<:BitInteger,N}(a::Array{T,N}, b::Array{T,N}) =
+    size(a) == size(b) &&
+    ccall(:memcmp, Int32, (Ptr{T}, Ptr{T}, UInt), a, b, sizeof(T) * length(a)) == 0


I find it more readable to lead with the 0 == as opposed to having to match parens over the entire length of the line to see where the ccall starts and ends. Also if one of these is using function form, they're similar enough that may as well use the same form in both.

This was just lazy rebasing, sorry. Fixed.

Oops, I think this might have accidentally invalidated Nanosoldier — though maybe it will still report results of the run?

It just hides the status. It's probably already started by now and should still post a comment, and if the code is functionally the same and your rebase was just a style change, shouldn't matter.

nanosoldier · 2016-09-15T03:49:34Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

stevengj · 2016-10-06T14:39:27Z

Bump.

TotalVerb · 2016-10-07T02:49:26Z

I can't confirm right now, but those regressions look like noise to me.

nalimilan reviewed Jun 11, 2016
View reviewed changes

rfourquet reviewed Jun 11, 2016
View reviewed changes

TotalVerb changed the title ~~[RFC] Use memcmp to optimize == for one-dimensional integral vectors~~ Use memcmp to optimize == for one-dimensional integral vectors Jun 12, 2016

TotalVerb changed the title ~~Use memcmp to optimize == for one-dimensional integral vectors~~ Use memcmp to optimize == for bit integer arrays Jun 28, 2016

kshyatt added the performance Must go faster label Jul 3, 2016

TotalVerb force-pushed the fw/memcmp-array branch from 4a36ba3 to 8820d08 Compare September 14, 2016 02:39

StefanKarpinski added this to the 0.6.0 milestone Sep 14, 2016

tkelman reviewed Sep 15, 2016

View reviewed changes

TotalVerb added 2 commits September 15, 2016 01:20

Use memcmp to optimize == for bit-integral arrays

9eaedc6

Use long function form

2f187da

TotalVerb force-pushed the fw/memcmp-array branch from 8820d08 to 2f187da Compare September 15, 2016 01:33

JeffBezanson merged commit 6a76dc7 into JuliaLang:master Oct 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use memcmp to optimize == for bit integer arrays #16877

Use memcmp to optimize == for bit integer arrays #16877

TotalVerb commented Jun 11, 2016 •

edited

Loading

nalimilan Jun 11, 2016

TotalVerb Jun 11, 2016

nalimilan Jun 11, 2016

timholy commented Jun 11, 2016

rfourquet Jun 11, 2016

TotalVerb Jun 11, 2016

TotalVerb commented Jun 11, 2016 •

edited

Loading

TotalVerb commented Sep 14, 2016

tkelman commented Sep 15, 2016

tkelman Sep 15, 2016

TotalVerb Sep 15, 2016

TotalVerb Sep 15, 2016

tkelman Sep 15, 2016

nanosoldier commented Sep 15, 2016

stevengj commented Oct 6, 2016

TotalVerb commented Oct 7, 2016

		@@ -128,9 +128,7 @@ isless(a::AbstractString, b::AbstractString) = cmp(a,b) < 0
		cmp(a::String, b::String) = lexcmp(a.data, b.data)
		cmp(a::Symbol, b::Symbol) = Int(sign(ccall(:strcmp, Int32, (Cstring, Cstring), a, b)))

Use memcmp to optimize == for bit integer arrays #16877

Use memcmp to optimize == for bit integer arrays #16877

Conversation

TotalVerb commented Jun 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timholy commented Jun 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TotalVerb commented Jun 11, 2016 • edited Loading

TotalVerb commented Sep 14, 2016

tkelman commented Sep 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nanosoldier commented Sep 15, 2016

stevengj commented Oct 6, 2016

TotalVerb commented Oct 7, 2016

TotalVerb commented Jun 11, 2016 •

edited

Loading

TotalVerb commented Jun 11, 2016 •

edited

Loading