Will Fitzgerald's Readme appears at the end of this blurb.

Introduction

Daniel Lemire's blog "Getting good performance in Go by rewriting parts in C?" brought to my attention by a #golang tweet by Will Fitzgerald requesting input -- bad move, reading it.

I entered a nudge-in-the-right-direction comment but Prof. Lemire immediately handed me a D- and told me to complete the assignment like everybody else, so here goes...

This repository is a fork of https://github.com/willf/bitset and is set up as a sandbox for purposes only of exploring the performance of the Count() function in the bitset package. Please use the willf repository for your development; this one will stagnate.

Let me start by saying that I am not an expert in the Go language or performance of user code, but I have tackled a few of the issues related to cgo performance and interfacing with foreign code in https://github.com/tHinqa/outside, my attempt at a flexible go-between for Go programs and dynamic libraries (.dll|.so).

First step, see what I am up against...

Daniel's reported benchmark baseline for the Count() function is 12 microsecs.

Hit a minor stumbling block when I first ran the test suite - it crashed trying to allocate a 500Mb bitset. This was running Win32 in a fairly crowded address-space. I simply changed the the setting for 386 processors to use 250Mb and it worked fine.

In bitset_test.go

func TestBitSetHuge(t *testing.T) {
    var v *BitSet
    if runtime.GOARCH == "386" { // arm?
        v = New(1<<31)
    } else {
        v = New(1<<32-1)
    }
    ...

Now you must be thinking "Jerk; where's the Linux?" -- fear not 64-bit Linux is my second environment; NixOS 64 in Virtualbox -- while it may seem back to front, it works for my brain-dead DVB-H TV receiver that a) does not have a Linux decrypter and b) refused to run in VirtualBox/Win32 for 'security reasons' (read: paranoia).

On my antiquated Core2 Duo 2.33GHz processor the baselines are (with all unnecessary services/processes shut down) for 64-bit hamming popcount_2:

NixOS64     1.0    17.0 microsec
Win32XP     1.82   31.0 microsec

Further results will be related to the normalized NixOS64 1.0 starting value

Now, ignoring the more abyssmal Win32 performance, let's evaluate that a bit:

2333000000 (rounded) cycles x .000017s = 39610
100000 bit set / 64 per Count call = 1852 pop calls
39610 / 1852 = ~22 cycles per pop call

Now to me that seemed pretty good as the code has 19 operations, ignoring the entry and return ambles, and loop code. As the number of instructions would probably be ~50% more, the code generated by Go, on face value, seems reasonable.

We'll get closer to the actual code a bit later, but first...

Why is 32-bit performance bad?

Put simply, (u)int64 use must be carefully monitored for a 32-bit environment. To improve performance, I hacked the hamming popcount_2 to use 32-bit operations and accumulators internally:

const ms1 uint32 = 0x55555555
const ms2 uint32 = 0x33333333
const ms4 uint32 = 0x0f0f0f0f

func hamming_popcount_2(x uint64) uint64 {
    x1 := uint32(x & 0xffffffff)
    x2 := uint32(x >> 32)

    x1 -= (x1 >> 1) & ms1
    x1 = (x1 & ms2) + ((x1 >> 2) & ms2)
    x1 = (x1 + (x1 >> 4)) & ms4
    x1 += x1 >> 8
    x1 += x1 >> 16
    //x1 += x1 >> 31 dropped
    ... previous lines repeated for x2 and ...
    return uint64((x1 + x2) & 0x7f)
}

The 64-bit original was preserved in bitset_amd64.go and the 32-bit hack placed in bitset_386.go, thereby automating the correct version inclusion at compile-time. With 32-bit hamming, a more pleasing:

Win32XP     1.35  32-bit hamming

What does gcc do for intrinsics?

For every special CPU instruction that gcc serves up as an intrinsic, there is a backup piece of code for non-supporting CPUs. For the POPCNT instruction it is (in Go form):

var popcount_tab = [256]uint32 {
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,
    ...
}

func gcc_popcount_2(x uint64) uint64 {
    ret := popcount_tab[x & 0xff]
    ret += popcount_tab[(x >> 8) & 0xff]
    ret += popcount_tab[(x >> 16) & 0xff]
    ret += popcount_tab[(x >> 24) & 0xff]
    ret += popcount_tab[(x >> 32) & 0xff]
    ret += popcount_tab[(x >> 40) & 0xff]
    ret += popcount_tab[(x >> 48) & 0xff]
    ret += popcount_tab[(x >> 56) & 0xff]
    return uint64(ret)
}

Simple lookup of precalculated values. Only the type of the array needs to change for 32- or 64-bit use. The code is in bitset_sandbox.go The benchmark:

NixOS64     1.22  64-bit gcc
NixOS64     1.25  32-bit gcc
Win32XP     1.95  64-bit gcc
Win32XP     1.44  32-bit gcc

This tells us that the hamming algorithm is probably the best one to stick with.

A bit on the side...

To demonstrate an alternative to using cgo, we'll use gcc to produce assembler code, and incorporate it as a Go function. A gentle introduction to Go asm is in $(GOROOT)/doc/asm.html.

First we start with a C function, in our case we'll use the naive popcount function:

// gcc -c -O3
unsigned long long popcount64(unsigned long long x) {
    unsigned long long ret = 0;
    for ( ; x != 0 ; ) {
        ret += x & 1;
        x >>= 1;
    }
    return ret;
}

We could have scrounged the assembler from gcc itself using "-S", but it produces very cluttered code, so instead we use objdump.

// objdump -d

 0: 31 c0                 xor    %eax,%eax
 2: 48 85 ff              test   %rdi,%rdi
 5: 74 19                 je     20 <popcount64+0x20>
 7: 66 0f 1f 84 00 00 00  nopw   0x0(%rax,%rax,1)
 e: 00 00
10: 48 89 fa              mov    %rdi,%rdx
13: 83 e2 01              and    $0x1,%edx
16: 48 01 d0              add    %rdx,%rax
19: 48 d1 ef              shr    %rdi
1c: 75 f2                 jne    10 <popcount64+0x10>
1e: f3 c3                 repz retq
20: f3 c3                 repz retq

We produce a Go assembler function in bitset_sandbox_amd64.s

#define NOSPLIT 4

//func asmnaive_popcount_2(x uint64) uint64
TEXT ·asmnaive_popcount_2(SB), NOSPLIT, $0
    MOVQ    x+0(FP),DI
    XORL    AX,AX
    TESTL   DI,DI
    JE      Z
    // unaligned
A:  MOVQ    DI,DX
    ANDL    $1,DX
    ADDQ    DX,AX
    SHRQ    $1,DI
    JNE     A
Z:  REP
    MOVQ    AX,ret+8(FP)
    RET

See bitset_sandbox_386.s, bitset_sandbox_386.go and bitset_sandbox_amd64.s for the rest. At the same time I introduced BenchmarkCountAll0 and BenchmarkCountAll1 variants to show the logarithmic vs linear performance relative to the other two algorithms. (For the fun of it :)

Benchmarks:

NixOS64     6.15  naive Go 64-bit   +=
                  0s: 0.47 1s: 14.76
            5.14  naive Go 64-bit   if/++
                  0s: 0.47 1s: 15.39
            4.22  naive Go 64-bit (32 accum) +=
                  0s: 0.47 1s: 9.79
            2.56  naive Go 32-bit   +=
                  0s: 0.59 1s: 8.11
            2.99  naive Go 32-bit   if/++
                  0s: 0.59 1s: 9.66
Win32XP     7.17  naive Go 64-bit   +=
                  0s: 0.69 1s: 19.29
            6.62  naive Go 64-bit   if/++
                  0s: 0.69 1s: 19.82
            6.34  naive Go 64-bit (32 accum) +=
                  0s: 0.69 1s: 15.29
            2.37  naive Go 32-bit   +=
                  0s: 0.68 1s: 6.62
            2.54  naive Go 32-bit   if/++
                  0s: 0.68 1s: 6.89

NixOS64     1.53  naive assembler 64-bit +=
                  0s: 0.44 1s: 8.48
Win32XP     2.99  naive assembler 32-bit +=
                  0s: 1.04 1s: 8.27

Conclusion. There is a case for using asm functions for performance (64-bit 6.15 vs 1.53) and the corollary that Go doesn't always produce efficient code, and, contrary to what Munch found, calls have a significent overhead on at least some (a lot of?) older machines (the 32-bit asm has a Go intermediary to split and call - +2 calls per call).

Realizing this I asm'd the hamming code and at the same time produced a pure Go inline hamming. The results:

NixOS64     0.96  hamming assembler 64-bit
            0.67  hamming inline 64-bit
Win32XP     1.29  hamming assembler 32-bit
            1.05  hamming inline 32-bit

While the assembler did provide a slight improvement (0.96 vs 1.0, 1.29 vs 1.35), pure Go beat Go + asm. Not sure we could do much better, so...

Intrinsics

As Russ Cox says in $(GOROOT)/doc/asm.html, if the instruction doesn't exist, make it.

For the POPCNT CPU instruction:

// gcc -mpopcnt -mlzcnt -O3 -c

#include <x86intrin.h>
int pop32(unsigned int x) {
    return __builtin_popcount(x);
}
int pop64(unsigned long long x) {
    return __builtin_popcountll(x);
}

// objdump -d

0000000000000000 <pop32>:
   0:    f3 0f b8 c7          popcnt %edi,%eax
   4:    c3                   retq

0000000000000010 <pop64>:
  10:    f3 48 0f b8 c7       popcnt %rdi,%rax
  15:    c3                   retq

Becomes:

#define NOSPLIT 4

//func Popcnt32(x uint32) uint32
TEXT ·Popcnt32(SB), NOSPLIT, $0
    MOVL    x+0(FP),DI
//  POPCNTL DI,AX
    BYTE $0xf3; BYTE $0x0f; BYTE $0xb8;BYTE $0xC7
    MOVL    AX, ret+8(FP)
    RET

//func Popcnt64(x uint64) uint64
TEXT ·Popcnt64(SB), NOSPLIT, $0
    MOVQ    x+0(FP),DI
//  POPCNTQ DI,AX
    BYTE $0xf3; BYTE $0x48; BYTE $0x0f; BYTE $0xb8;BYTE $0xC7
    MOVQ    AX, ret+8(FP)
    RET

bitset_intrinsics_... is where you will find them. A huge caveat, however; they are untested as the instructions were only introduced in corei CPUs.

bitset.go contains a small init() to set the operable algorithm. I have included a CPU check function to allow a single executable to use the POPCNT or fall back to hamming. Hope it works! Since there is no real new code, all additions may be used as seen fit with no copyright claim by me.

Final reminder: This code makes nor provision for architectures other than amd64 and 386.

=====================================================

Will Fitzgerald's Readme

Package bitset implements bitsets, a mapping between non-negative integers and boolean values. It should be more efficient than map[uint] bool.

It provides methods for setting, clearing, flipping, and testing individual integers.

But it also provides set intersection, union, difference, complement, and symmetric operations, as well as tests to check whether any, all, or no bits are set, and querying a bitset's current length and number of postive bits.

BitSets are expanded to the size of the largest set bit; the memory allocation is approximately Max bits, where Max is the largest set bit. BitSets are never shrunk. On creation, a hint can be given for the number of bits that will be used.

Many of the methods, including Set, Clear, and Flip, return a BitSet pointer, which allows for chaining.

Example use:

import "bitset"
var b BitSet
b.Set(10).Set(11)
if b.Test(1000) {
	b.Clear(1000)
}
for i,e := v.NextSet(0); e; i,e = v.NextSet(i + 1) {
   frmt.Println("The following bit is set:",i);
}
if B.Intersection(bitset.New(100).Set(10)).Count() > 1 {
	fmt.Println("Intersection works.")
}

As an alternative to BitSets, one should check out the 'big' package, which provides a (less set-theoretical) view of bitsets.

Discussions golang-nuts Google Group:

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.gitignore		.gitignore
CHANGELOG		CHANGELOG
LICENSE.txt		LICENSE.txt
README.md		README.md
bitset.go		bitset.go
bitset_386.go		bitset_386.go
bitset_amd64.go		bitset_amd64.go
bitset_intrinsics.go		bitset_intrinsics.go
bitset_intrinsics.s		bitset_intrinsics.s
bitset_intrinsics_386.go		bitset_intrinsics_386.go
bitset_intrinsics_386.s		bitset_intrinsics_386.s
bitset_intrinsics_amd64.go		bitset_intrinsics_amd64.go
bitset_intrinsics_amd64.s		bitset_intrinsics_amd64.s
bitset_sandbox.go		bitset_sandbox.go
bitset_sandbox_386.go		bitset_sandbox_386.go
bitset_sandbox_386.s		bitset_sandbox_386.s
bitset_sandbox_amd64.go		bitset_sandbox_amd64.go
bitset_sandbox_amd64.s		bitset_sandbox_amd64.s
bitset_sandbox_test.go		bitset_sandbox_test.go
bitset_sandboxt_amd64_test.go		bitset_sandboxt_amd64_test.go
bitset_test.go		bitset_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Will Fitzgerald's Readme appears at the end of this blurb.

Introduction

First step, see what I am up against...

Why is 32-bit performance bad?

What does gcc do for intrinsics?

A bit on the side...

Intrinsics

Will Fitzgerald's Readme

About

Releases

Packages

Languages

License

tHinqa/bitset

Folders and files

Latest commit

History

Repository files navigation

Will Fitzgerald's Readme appears at the end of this blurb.

Introduction

First step, see what I am up against...

Why is 32-bit performance bad?

What does gcc do for intrinsics?

A bit on the side...

Intrinsics

Will Fitzgerald's Readme

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages