Description
Please answer these questions before submitting your issue. Thanks!
-
Version: go version go1.7beta2 windows/amd64
-
Environment:
set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=F:\Go
set GORACE=
set GOROOT=F:\Go
set GOTOOLDIR=F:\Go\pkg\tool\windows_amd64
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Super\AppData\Local\Temp\go-build631535858=/tmp/go-build -gno-record-gcc-switches
set CXX=g++
set CGO_ENABLED=1However, these issues are likely also related to other operating systems and architectures, as in at least x386.
-
Runnable Program: Go Playground link:
// SoE project main.go package main import ( "fmt" "math" ) func primesOdds(top uint32) func() uint32 { topndx := int((top - 3) / 2) topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2 cmpsts := make([]uint32, (topndx/32)+1) for i := 0; i <= topsqrtndx; i++ { if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 { p := i + i + 3 for j := (p*p - 3) >> 1; j <= topndx; j += p { cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F) } } } i := -1 return func() uint32 { oi := i if i <= topndx { i++ } for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 { i++ } if oi < 0 { return 2 } else { return (uint32(oi) << 1) + 3 } } } func main() { iter := primesOdds(1000000) count := 0 for v := iter(); v <= 1000000; v = iter() { count++ } fmt.Printf("%v\r\n", count) }
-
What I saw:
The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:
0x00e9 00233 (main.go:15) LEAQ (CX)(CX*1), R8 ;; **ISSUE 1** - change
0x00ed 00237 (main.go:15) LEAQ 3(CX)(CX*1), R9 ;; prime 'p' calculation; good
0x00f2 00242 (main.go:16) IMULQ R9, R9 ;; 'sqr' calculation - good
0x00f6 00246 (main.go:16) ADDQ $-3, R9 ;; 'sqr' - 3 is good
0x00fa 00250 (main.go:16) SARQ $1, R9 ;; including shortcut divide by 2 - good
0x00fd 00253 (main.go:16) CMPQ R9, SI ;; advance range compare
0x0100 00256 (main.go:16) JGT $0, 319 ;; check - good
0x0102 00258 (main.go:17) MOVQ R9, R11 ;; **ISSUE 2** - change
0x0105 00261 (main.go:17) SARQ $5, R9 ;; calculate word address - good
0x0109 00265 (main.go:17) CMPQ R9, DX ;; *** only here if array bounds check
0x010c 00268 (main.go:17) JCC PANIC ;; *** only here if array bounds check
0x0112 00274 (main.go:17) MOVL (AX)(R9*4), R12 ;; **ISSUE 3** - not this way
0x0116 00278 (main.go:17) MOVQ R11, R13 ;; **ISSUE 4** - not this way
0x0119 00281 (main.go:17) ANDQ $31, R11 ;; **ISSUE 5** - unnecessary
0x011d 00285 (main.go:17) MOVQ R11, CX ;; part of **ISSUE 2** - change
0x0120 00288 (main.go:17) MOVL R10, R14 ;; **ISSUE 6** - unnecessary
0x0123 00291 (main.go:17) SHLL CX, R10 ;; 1 << ('j'&0x1F) - good
0x0126 00294 (main.go:17) ORL R10, R12 ;; part of **ISSUE 3**
0x0129 00297 (main.go:17) MOVL R12, (AX)(R9*4) ;; part of **ISSUE 3**
0x012d 00301 (main.go:16) LEAQ 3(R8)(R13*1), R9 ;; part of **ISSUE 1**
0x0132 00306 (main.go:13) MOVQ "".i+64(SP), CX ;; **ISSUE 7**; unnecessary
0x0137 00311 (main.go:17) MOVL R14, R10 ;; part of **ISSUE 6**
0x013a 00314 (main.go:16) CMPQ R9, SI ;; end of loop
0x013d 00317 (main.go:16) JLE $0, 258 ;; check - good
0x013f 00319 (main.go:13) LEAQ 1(CX), R9 ;; part of **ISSUE #7** - not this way
-
Expected: 78498 is expected and that is what is output - not the issue:
The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:
OuterLoop: LEAQ 3(CX)(CX*1), R8 ;; prime 'p' calculation; good, **left in R8 as per ISSUE 1** MOVQ R8, R9 ;; **ISSUE 1** fixed IMULQ R9, R9 ;; 'sqr' calculation - good ADDQ $-3, R9 ;; 'sqr' - 3 is good SARQ $1, R9 ;; including shortcut divide by 2 - good - left in R9 CMPQ R9, SI ;; advance range compare JGT PastInner ;; check - good InnerLoop: MOVQ R9, R11 ;; **ISSUE 2** fixed MOVQ R9,CX ;; **ISSUE 4** fixed SARQ $5, R11 ;; calculate word address - good MOVQ $1,R10 ;; **ISSUE 6** fixed CMPQ R11, DX ;; *** only here if array bounds check JCC PANIC ;; *** only here if array bounds check SHLL CX, R10 ;; 1 << ('j'&0x1F) - good ADDQ R8, R9 ;; part of **ISSUE 1** fixed ORL R10, (AX)(R11*4) ;; **ISSUE 3** fixed CMPQ R9, SI ;; end of loop JLE InnerLoop ;; check - good PastInner: MOVQ "".i+64(SP), CX ;; **ISSUE 7** fixed; 'i' may well already be in another register LEAQ 1(CX), R9 ;; now more available registers, if other register, just ADD $1
ISSUE 1: Preserves "2 * 'i'", that requires a full 'p' calculation inside the loop using an LEAQ instruction at 272, instead of preserving the full 'p' ('i' + 'i' + 3), that would then eliminate needing to recalculate 'p' inside the loop and would allow for a simple add instruction at line 272, which is a little faster.
ISSUE 2: Preserves the original in a new register before clobbering the original register in order to save latency (ignoring that the CPU will likely use Out of Order Execution - OOE, anyway), where a simple reordering of instructions would do the same and not require the advanced contents be calculated/moved back to the original register at the end of the loop. This is a common pattern.
ISSUE 3: Ignores that the "cmpsts[j>>5] |= ..." can be encoded with a single instruction "ORL R..., (BASEREG)(INDEXREG*4)" to save some complexity and time.
ISSUE 4: the same as ISSUE 2, where a simple instruction order change can mean that no register use swap needs to be made and alleviates the need for more complex LEA use.
ISSUE 5: When a uint32 value is shifted by uint32 bits, the compiler correctly eliminates a logical "and" by 0x1F (31) as the CPU limits the shift to this anyway; the issue is that if shifted by a uint, it doesn't eliminate it as it should (workaround is to use uint32 for shifts). We should check to see if a int32 shifted by 31 bits also gets eliminated as it should; in fact any masking above 31 (above 63 for 64-bit registers) is unnecessary.
ISSUE 6 is cmd/compile: complicated bounds check elimination #16092, where a register is preserved instead of using a simple MOV immediate. This is pervasive further outside these bounds: as if the compiler has a - "avoid immediate MOV at all costs".
ISSUE 7: This instruction is completely unnecessary in restoring a value to the CX register when it never gets used in the loop and gets clobbered for each loop. Correctly, the CX register should be reloaded if necessary outside the loop. This is issue cmd/compile: regalloc restoring to dead register #14761.
The general observation is that the compiler tends to overuse LEA instructions, which instructions are very effective when necessary, but cost a little bit in speed as used instead of other simpler instructions: they are slightly slower than those simpler instructions, which doesn't seem to be taken into account.
Summary: The golang compiler is quite a way from being optimum, and won't come close to "Cee" (C/C++) efficiency until is comes much closer than this. The changes here aren't that complex, and some simple rule changes/additions should suffice. While version 1.7 is better than 1.6, it still is nowhere near as efficient as it needs to be. Rule changes/additions as suggested above can make tight inner loops run up to twice as fast or more on some architectures and some situations, although not that much here for the amd64 as a high end CPU will re-order instructions to minimize the execution time.