Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memcmp to optimize == for bit integer arrays #16877

Merged
merged 2 commits into from
Oct 17, 2016

Conversation

TotalVerb
Copy link
Contributor

@TotalVerb TotalVerb commented Jun 11, 2016

After #16855, it's faster to compare two arrays for equality with String(a) == String(b) than with a == b. No joke!

julia> using BenchmarkTools

julia> const A = repeat([0x01:0x7F;], outer=10000);

julia> const B = repeat([0x01:0x7F;], outer=10000);

julia> @benchmark A == B
BenchmarkTools.Trial: 
  samples:          2271
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     1.99 ms (0.00% GC)
  median time:      2.27 ms (0.00% GC)
  mean time:        2.20 ms (0.00% GC)
  maximum time:     2.46 ms (0.00% GC)

julia> @benchmark String(A) == String(B)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  32.00 bytes
  allocs estimate:  2
  minimum time:     76.32 μs (0.00% GC)
  median time:      80.60 μs (0.00% GC)
  mean time:        82.95 μs (0.00% GC)
  maximum time:     314.34 μs (0.00% GC)

In fact, the fastest (non-String) way to compare byte arrays for equality seems to be lexcmp, which uses memcmp behind the hood:

julia> @benchmark lexcmp(A, B) == 0
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     76.49 μs (0.00% GC)
  median time:      86.93 μs (0.00% GC)
  mean time:        89.47 μs (0.00% GC)
  maximum time:     450.16 μs (0.00% GC)

I think it makes sense to move this optimization up one level, from strings to arrays. I'm not entirely sure what the scope should be—currently it's for one-dimensional arrays of integral types that are at most 64 bits and not Bool, which is a bit arbitrary. It doesn't work for floating point types because of NaN behaviour, and I'm a little unclear about the semantics of Bool in Julia, but I think it might work for that also, if safe.

@@ -128,9 +128,7 @@ isless(a::AbstractString, b::AbstractString) = cmp(a,b) < 0
cmp(a::String, b::String) = lexcmp(a.data, b.data)
cmp(a::Symbol, b::Symbol) = Int(sign(ccall(:strcmp, Int32, (Cstring, Cstring), a, b)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't Symbol benefit from the same optimization as String? Looks like it could have a effect in a lot of areas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this function already using ccall? Or do you mean a different one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about == for Symbol, but actually it calls === so I guess that's OK already.

@timholy
Copy link
Member

timholy commented Jun 11, 2016

Wow.

For, me, the following julia implementation:

function iseq3(A::AbstractArray, B::AbstractArray)
    if size(A) != size(B)
        return false
    end
    if isa(A,Range) != isa(B,Range)
        return false
    end
    failures = 0
    @simd for I in eachindex(A)
        @inbounds failures += A[I] != B[I]
    end
    return failures == 0
end

is (for the arrays you were testing on) much better than our current implementation (by 3-4x), but the ccall is another 4x better than that.

@@ -653,6 +653,11 @@ function lexcmp(a::Array{UInt8,1}, b::Array{UInt8,1})
return c < 0 ? -1 : c > 0 ? +1 : cmp(length(a),length(b))
end

# use memcmp for == on integer types
=={T<:Union{Int8,UInt8,Int16,UInt16,Int32,UInt32,Int64,UInt64}}(a::Array{T,1}, b::Array{T,1}) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could replace this union by the predefined BitInteger64, or BitInteger if you add [U]Int128 (which you should!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I think I will probably extend this to n-dimensional arrays too; no reason to restrict to one-dimension.

@TotalVerb
Copy link
Contributor Author

TotalVerb commented Jun 11, 2016

Issues addressed. This final version, on my system, has the following benchmarking attributes (benchmarks are done by randomly generating arrays, then copying them, and comparing the two):

NEW ==
rand(UInt8, 0): 8
rand(UInt8, 1): 7
rand(UInt8, 2): 7
rand(UInt8, 3): 7
rand(UInt8, 4): 8
rand(UInt8, 5): 7
rand(UInt8, 10): 8
rand(UInt8, 20): 7
rand(UInt8, 50): 9
rand(UInt8, 100): 11
rand(UInt8, 1000): 36
rand(UInt8, 10000): 292
rand(UInt8, 100000): 3552
rand(UInt8, 1000000): 56017
rand(UInt8, 10000000): 1003760

OLD ==
rand(UInt8, 0): 8
rand(UInt8, 1): 9
rand(UInt8, 2): 10
rand(UInt8, 3): 11
rand(UInt8, 4): 13
rand(UInt8, 5): 15
rand(UInt8, 10): 22
rand(UInt8, 20): 38
rand(UInt8, 50): 93
rand(UInt8, 100): 171
rand(UInt8, 1000): 1583
rand(UInt8, 10000): 15696
rand(UInt8, 100000): 156832
rand(UInt8, 1000000): 1566205
rand(UInt8, 10000000): 15829086

NEW ==
rand(Int64, 0): 7
rand(Int64, 1): 6
rand(Int64, 2): 6
rand(Int64, 3): 7
rand(Int64, 4): 7
rand(Int64, 5): 7
rand(Int64, 10): 10
rand(Int64, 20): 12
rand(Int64, 50): 20
rand(Int64, 100): 30
rand(Int64, 1000): 237
rand(Int64, 10000): 3230
rand(Int64, 100000): 43568
rand(Int64, 1000000): 757348
rand(Int64, 10000000): 8574296

OLD ==
rand(Int64, 0): 8
rand(Int64, 1): 9
rand(Int64, 2): 10
rand(Int64, 3): 11
rand(Int64, 4): 13
rand(Int64, 5): 15
rand(Int64, 10): 22
rand(Int64, 20): 38
rand(Int64, 50): 93
rand(Int64, 100): 171
rand(Int64, 1000): 1583
rand(Int64, 10000): 15700
rand(Int64, 100000): 156662
rand(Int64, 1000000): 1603819
rand(Int64, 10000000): 16082368

It is a bit concerning that the time taken for the new version seems slightly superlinear, but it is so much faster that the superlinearity doesn't even seem to matter. It's possible the superlinearity is an artifact of using the minimum result for BenchmarkTools's @benchmark macro. Or it could be due to cache effects.

The results on Int64 show that most of the improvement seems to come from SIMD optimizations, which are more powerful on bytes than on words.

@TotalVerb TotalVerb changed the title [RFC] Use memcmp to optimize == for one-dimensional integral vectors Use memcmp to optimize == for one-dimensional integral vectors Jun 12, 2016
@TotalVerb TotalVerb changed the title Use memcmp to optimize == for one-dimensional integral vectors Use memcmp to optimize == for bit integer arrays Jun 28, 2016
@kshyatt kshyatt added the performance Must go faster label Jul 3, 2016
@TotalVerb
Copy link
Contributor Author

I've rebased this. This performance optimization should still be valid on v0.6.

@StefanKarpinski StefanKarpinski added this to the 0.6.0 milestone Sep 14, 2016
@tkelman
Copy link
Contributor

tkelman commented Sep 15, 2016

dunno if this is covered by existing benchmarks, but @nanosoldier runbenchmarks(ALL, vs = ":master")

# use memcmp for == on bit integer types
=={T<:BitInteger,N}(a::Array{T,N}, b::Array{T,N}) =
size(a) == size(b) &&
ccall(:memcmp, Int32, (Ptr{T}, Ptr{T}, UInt), a, b, sizeof(T) * length(a)) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it more readable to lead with the 0 == as opposed to having to match parens over the entire length of the line to see where the ccall starts and ends. Also if one of these is using function form, they're similar enough that may as well use the same form in both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just lazy rebasing, sorry. Fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I think this might have accidentally invalidated Nanosoldier — though maybe it will still report results of the run?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just hides the status. It's probably already started by now and should still post a comment, and if the code is functionally the same and your rebase was just a style change, shouldn't matter.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@stevengj
Copy link
Member

stevengj commented Oct 6, 2016

Bump.

@TotalVerb
Copy link
Contributor Author

I can't confirm right now, but those regressions look like noise to me.

@JeffBezanson JeffBezanson merged commit 6a76dc7 into JuliaLang:master Oct 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants