Daniel Lemire's blog "Getting good performance in Go by rewriting parts in C?" brought to my attention by a #golang tweet by Will Fitzgerald requesting input -- bad move, reading it.
I entered a nudge-in-the-right-direction comment but Prof. Lemire immediately handed me a D- and told me to complete the assignment like everybody else, so here goes...
This repository is a fork of https://github.com/willf/bitset and is set up as a sandbox for purposes only of exploring the performance of the Count() function in the bitset package. Please use the willf repository for your development; this one will stagnate.
Let me start by saying that I am not an expert in the Go language or performance of user code, but I have tackled a few of the issues related to cgo performance and interfacing with foreign code in https://github.com/tHinqa/outside, my attempt at a flexible go-between for Go programs and dynamic libraries (.dll|.so).
Daniel's reported benchmark baseline for the Count() function is 12 microsecs.
Hit a minor stumbling block when I first ran the test suite - it crashed trying to allocate a 500Mb bitset. This was running Win32 in a fairly crowded address-space. I simply changed the the setting for 386 processors to use 250Mb and it worked fine.
In bitset_test.go
func TestBitSetHuge(t *testing.T) {
var v *BitSet
if runtime.GOARCH == "386" { // arm?
v = New(1<<31)
} else {
v = New(1<<32-1)
}
...
Now you must be thinking "Jerk; where's the Linux?" -- fear not 64-bit Linux is my second environment; NixOS 64 in Virtualbox -- while it may seem back to front, it works for my brain-dead DVB-H TV receiver that a) does not have a Linux decrypter and b) refused to run in VirtualBox/Win32 for 'security reasons' (read: paranoia).
On my antiquated Core2 Duo 2.33GHz processor the baselines are (with all unnecessary services/processes shut down) for 64-bit hamming popcount_2:
NixOS64 1.0 17.0 microsec
Win32XP 1.82 31.0 microsec
Further results will be related to the normalized NixOS64 1.0 starting value
Now, ignoring the more abyssmal Win32 performance, let's evaluate that a bit:
2333000000 (rounded) cycles x .000017s = 39610
100000 bit set / 64 per Count call = 1852 pop calls
39610 / 1852 = ~22 cycles per pop call
Now to me that seemed pretty good as the code has 19 operations, ignoring the entry and return ambles, and loop code. As the number of instructions would probably be ~50% more, the code generated by Go, on face value, seems reasonable.
We'll get closer to the actual code a bit later, but first...
Put simply, (u)int64 use must be carefully monitored for a 32-bit environment. To improve performance, I hacked the hamming popcount_2 to use 32-bit operations and accumulators internally:
const ms1 uint32 = 0x55555555
const ms2 uint32 = 0x33333333
const ms4 uint32 = 0x0f0f0f0f
func hamming_popcount_2(x uint64) uint64 {
x1 := uint32(x & 0xffffffff)
x2 := uint32(x >> 32)
x1 -= (x1 >> 1) & ms1
x1 = (x1 & ms2) + ((x1 >> 2) & ms2)
x1 = (x1 + (x1 >> 4)) & ms4
x1 += x1 >> 8
x1 += x1 >> 16
//x1 += x1 >> 31 dropped
... previous lines repeated for x2 and ...
return uint64((x1 + x2) & 0x7f)
}
The 64-bit original was preserved in bitset_amd64.go
and the 32-bit hack placed in bitset_386.go
, thereby automating the correct version inclusion at compile-time. With 32-bit hamming, a more pleasing:
Win32XP 1.35 32-bit hamming
For every special CPU instruction that gcc serves up as an intrinsic, there is a backup piece of code for non-supporting CPUs. For the POPCNT instruction it is (in Go form):
var popcount_tab = [256]uint32 {
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,
...
}
func gcc_popcount_2(x uint64) uint64 {
ret := popcount_tab[x & 0xff]
ret += popcount_tab[(x >> 8) & 0xff]
ret += popcount_tab[(x >> 16) & 0xff]
ret += popcount_tab[(x >> 24) & 0xff]
ret += popcount_tab[(x >> 32) & 0xff]
ret += popcount_tab[(x >> 40) & 0xff]
ret += popcount_tab[(x >> 48) & 0xff]
ret += popcount_tab[(x >> 56) & 0xff]
return uint64(ret)
}
Simple lookup of precalculated values. Only the type of the array needs to change for 32- or 64-bit use. The code is in bitset_sandbox.go
The benchmark:
NixOS64 1.22 64-bit gcc
NixOS64 1.25 32-bit gcc
Win32XP 1.95 64-bit gcc
Win32XP 1.44 32-bit gcc
This tells us that the hamming algorithm is probably the best one to stick with.
To demonstrate an alternative to using cgo, we'll use gcc to produce assembler code, and incorporate it as a Go function. A gentle introduction to Go asm is in $(GOROOT)/doc/asm.html.
First we start with a C function, in our case we'll use the naive popcount function:
// gcc -c -O3
unsigned long long popcount64(unsigned long long x) {
unsigned long long ret = 0;
for ( ; x != 0 ; ) {
ret += x & 1;
x >>= 1;
}
return ret;
}
We could have scrounged the assembler from gcc itself using "-S", but it produces very cluttered code, so instead we use objdump.
// objdump -d
0: 31 c0 xor %eax,%eax
2: 48 85 ff test %rdi,%rdi
5: 74 19 je 20 <popcount64+0x20>
7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
e: 00 00
10: 48 89 fa mov %rdi,%rdx
13: 83 e2 01 and $0x1,%edx
16: 48 01 d0 add %rdx,%rax
19: 48 d1 ef shr %rdi
1c: 75 f2 jne 10 <popcount64+0x10>
1e: f3 c3 repz retq
20: f3 c3 repz retq
We produce a Go assembler function in bitset_sandbox_amd64.s
#define NOSPLIT 4
//func asmnaive_popcount_2(x uint64) uint64
TEXT ·asmnaive_popcount_2(SB), NOSPLIT, $0
MOVQ x+0(FP),DI
XORL AX,AX
TESTL DI,DI
JE Z
// unaligned
A: MOVQ DI,DX
ANDL $1,DX
ADDQ DX,AX
SHRQ $1,DI
JNE A
Z: REP
MOVQ AX,ret+8(FP)
RET
See bitset_sandbox_386.s
, bitset_sandbox_386.go
and bitset_sandbox_amd64.s
for the rest. At the same time I introduced BenchmarkCountAll0 and BenchmarkCountAll1 variants to show the logarithmic vs linear performance relative to the other two algorithms. (For the fun of it :)
Benchmarks:
NixOS64 6.15 naive Go 64-bit +=
0s: 0.47 1s: 14.76
5.14 naive Go 64-bit if/++
0s: 0.47 1s: 15.39
4.22 naive Go 64-bit (32 accum) +=
0s: 0.47 1s: 9.79
2.56 naive Go 32-bit +=
0s: 0.59 1s: 8.11
2.99 naive Go 32-bit if/++
0s: 0.59 1s: 9.66
Win32XP 7.17 naive Go 64-bit +=
0s: 0.69 1s: 19.29
6.62 naive Go 64-bit if/++
0s: 0.69 1s: 19.82
6.34 naive Go 64-bit (32 accum) +=
0s: 0.69 1s: 15.29
2.37 naive Go 32-bit +=
0s: 0.68 1s: 6.62
2.54 naive Go 32-bit if/++
0s: 0.68 1s: 6.89
NixOS64 1.53 naive assembler 64-bit +=
0s: 0.44 1s: 8.48
Win32XP 2.99 naive assembler 32-bit +=
0s: 1.04 1s: 8.27
Conclusion. There is a case for using asm functions for performance (64-bit 6.15 vs 1.53) and the corollary that Go doesn't always produce efficient code, and, contrary to what Munch found, calls have a significent overhead on at least some (a lot of?) older machines (the 32-bit asm has a Go intermediary to split and call - +2 calls per call).
Realizing this I asm'd the hamming code and at the same time produced a pure Go inline hamming. The results:
NixOS64 0.96 hamming assembler 64-bit
0.67 hamming inline 64-bit
Win32XP 1.29 hamming assembler 32-bit
1.05 hamming inline 32-bit
While the assembler did provide a slight improvement (0.96 vs 1.0, 1.29 vs 1.35), pure Go beat Go + asm. Not sure we could do much better, so...
As Russ Cox says in $(GOROOT)/doc/asm.html, if the instruction doesn't exist, make it.
For the POPCNT CPU instruction:
// gcc -mpopcnt -mlzcnt -O3 -c
#include <x86intrin.h>
int pop32(unsigned int x) {
return __builtin_popcount(x);
}
int pop64(unsigned long long x) {
return __builtin_popcountll(x);
}
// objdump -d
0000000000000000 <pop32>:
0: f3 0f b8 c7 popcnt %edi,%eax
4: c3 retq
0000000000000010 <pop64>:
10: f3 48 0f b8 c7 popcnt %rdi,%rax
15: c3 retq
Becomes:
#define NOSPLIT 4
//func Popcnt32(x uint32) uint32
TEXT ·Popcnt32(SB), NOSPLIT, $0
MOVL x+0(FP),DI
// POPCNTL DI,AX
BYTE $0xf3; BYTE $0x0f; BYTE $0xb8;BYTE $0xC7
MOVL AX, ret+8(FP)
RET
//func Popcnt64(x uint64) uint64
TEXT ·Popcnt64(SB), NOSPLIT, $0
MOVQ x+0(FP),DI
// POPCNTQ DI,AX
BYTE $0xf3; BYTE $0x48; BYTE $0x0f; BYTE $0xb8;BYTE $0xC7
MOVQ AX, ret+8(FP)
RET
bitset_intrinsics_...
is where you will find them. A huge caveat, however; they are untested as the instructions were only introduced in corei CPUs.
bitset.go
contains a small init() to set the operable algorithm. I have included a CPU check function to allow a single executable to use the POPCNT or fall back to hamming. Hope it works! Since there is no real new
code, all additions may be used as seen fit with no copyright claim by me.
Final reminder: This code makes nor provision for architectures other than amd64 and 386.
=====================================================
Package bitset implements bitsets, a mapping between non-negative integers and boolean values. It should be more efficient than map[uint] bool.
It provides methods for setting, clearing, flipping, and testing individual integers.
But it also provides set intersection, union, difference, complement, and symmetric operations, as well as tests to check whether any, all, or no bits are set, and querying a bitset's current length and number of postive bits.
BitSets are expanded to the size of the largest set bit; the memory allocation is approximately Max bits, where Max is the largest set bit. BitSets are never shrunk. On creation, a hint can be given for the number of bits that will be used.
Many of the methods, including Set, Clear, and Flip, return a BitSet pointer, which allows for chaining.
Example use:
import "bitset"
var b BitSet
b.Set(10).Set(11)
if b.Test(1000) {
b.Clear(1000)
}
for i,e := v.NextSet(0); e; i,e = v.NextSet(i + 1) {
frmt.Println("The following bit is set:",i);
}
if B.Intersection(bitset.New(100).Set(10)).Count() > 1 {
fmt.Println("Intersection works.")
}
As an alternative to BitSets, one should check out the 'big' package, which provides a (less set-theoretical) view of bitsets.
Discussions golang-nuts Google Group: