cmd/compile: new backend; multiple efficiency issues...

Please answer these questions before submitting your issue. Thanks!
1. **Version:**  go version go1.7beta2 windows/amd64
2. **Environment:**
   set GOARCH=amd64
   set GOBIN=
   set GOEXE=.exe
   set GOHOSTARCH=amd64
   set GOHOSTOS=windows
   set GOOS=windows
   set GOPATH=F:\Go\
   set GORACE=
   set GOROOT=F:\Go
   set GOTOOLDIR=F:\Go\pkg\tool\windows_amd64
   set CC=gcc
   set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Super\AppData\Local\Temp\go-build631535858=/tmp/go-build -gno-record-gcc-switches
   set CXX=g++
   set CGO_ENABLED=1
   
   However, these issues are likely also related to other operating systems and architectures, as in at least x386.
3. **Runnable Program:**   [**Go Playground link:**](https://play.golang.org/p/lTaHmi9hiR)
   
   ``` go
   // SoE project main.go
   package main
   
   import (
   "fmt"
   "math"
   )
   
   func primesOdds(top uint32) func() uint32 {
   topndx := int((top - 3) / 2)
   topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2
   cmpsts := make([]uint32, (topndx/32)+1)
   for i := 0; i <= topsqrtndx; i++ {
       if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 {
           p := i + i + 3
           for j := (p*p - 3) >> 1; j <= topndx; j += p {
               cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F)
           }
       }
   }
   i := -1
   return func() uint32 {
       oi := i
       if i <= topndx {
           i++
       }
       for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 {
           i++
       }
       if oi < 0 {
           return 2
       } else {
           return (uint32(oi) << 1) + 3
       }
   }
   }
   
   func main() {
   iter := primesOdds(1000000)
   count := 0
   for v := iter(); v <= 1000000; v = iter() {
       count++
   }
   fmt.Printf("%v\r\n", count)
   }
   ```
4.  **What I saw:**
   
   The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:
   
   ```
   0x00e9 00233 (main.go:15)   LEAQ    (CX)(CX*1), R8  ;; **ISSUE 1** - change
   0x00ed 00237 (main.go:15)   LEAQ    3(CX)(CX*1), R9 ;; prime 'p' calculation; good
   0x00f2 00242 (main.go:16)   IMULQ   R9, R9  ;; 'sqr' calculation - good
   0x00f6 00246 (main.go:16)   ADDQ    $-3, R9 ;; 'sqr' - 3 is good
   0x00fa 00250 (main.go:16)   SARQ    $1, R9  ;; including shortcut divide by 2 - good
   0x00fd 00253 (main.go:16)   CMPQ    R9, SI  ;; advance range compare
   0x0100 00256 (main.go:16)   JGT $0, 319     ;; check - good
   0x0102 00258 (main.go:17)   MOVQ    R9, R11 ;; **ISSUE 2** - change
   0x0105 00261 (main.go:17)   SARQ    $5, R9  ;; calculate word address - good
   0x0109 00265 (main.go:17)   CMPQ    R9, DX  ;; *** only here if array bounds check
   0x010c 00268 (main.go:17)   JCC PANIC       ;; *** only here if array bounds check
   0x0112 00274 (main.go:17)   MOVL    (AX)(R9*4), R12 ;; **ISSUE 3** - not this way
   0x0116 00278 (main.go:17)   MOVQ    R11, R13    ;; **ISSUE 4** - not this way
   0x0119 00281 (main.go:17)   ANDQ    $31, R11        ;; **ISSUE 5** - unnecessary
   0x011d 00285 (main.go:17)   MOVQ    R11, CX     ;; part of **ISSUE 2** - change
   0x0120 00288 (main.go:17)   MOVL    R10, R14        ;; **ISSUE 6** - unnecessary
   0x0123 00291 (main.go:17)   SHLL    CX, R10     ;; 1 << ('j'&0x1F) - good
   0x0126 00294 (main.go:17)   ORL R10, R12                ;; part of **ISSUE 3**
   0x0129 00297 (main.go:17)   MOVL    R12, (AX)(R9*4) ;; part of **ISSUE 3**
   0x012d 00301 (main.go:16)   LEAQ    3(R8)(R13*1), R9    ;; part of **ISSUE 1**
   0x0132 00306 (main.go:13)   MOVQ    "".i+64(SP), CX ;; **ISSUE 7**; unnecessary
   0x0137 00311 (main.go:17)   MOVL    R14, R10        ;; part of **ISSUE 6**
   0x013a 00314 (main.go:16)   CMPQ    R9, SI  ;; end of loop
   0x013d 00317 (main.go:16)   JLE $0, 258     ;; check - good
   0x013f 00319 (main.go:13)   LEAQ    1(CX), R9       ;; part of **ISSUE #7** - not this way
   ```
5. **Expected:**  78498 is expected and that is what is output - not the issue:
   
   The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:
   
   ```
   OuterLoop:
   LEAQ    3(CX)(CX*1), R8 ;; prime 'p' calculation; good, **left in R8 as per ISSUE 1**
   MOVQ    R8, R9  ;; **ISSUE 1** fixed
   IMULQ   R9, R9  ;; 'sqr' calculation - good
   ADDQ    $-3, R9 ;; 'sqr' - 3 is good
   SARQ    $1, R9  ;; including shortcut divide by 2 - good - left in R9
   CMPQ    R9, SI  ;; advance range compare
   JGT PastInner       ;; check - good
   InnerLoop:
   MOVQ    R9, R11 ;; **ISSUE 2** fixed
   MOVQ    R9,CX   ;; **ISSUE 4** fixed
   SARQ    $5, R11 ;; calculate word address - good
   MOVQ    $1,R10  ;; **ISSUE 6** fixed
   CMPQ    R11, DX ;; *** only here if array bounds check
   JCC PANIC       ;; *** only here if array bounds check
   SHLL    CX, R10 ;; 1 << ('j'&0x1F) - good
   ADDQ    R8, R9  ;; part of **ISSUE  1** fixed
   ORL R10, (AX)(R11*4)    ;; **ISSUE 3** fixed
   CMPQ    R9, SI  ;; end of loop
   JLE InnerLoop   ;; check - good
   PastInner:
   MOVQ    "".i+64(SP), CX ;; **ISSUE 7** fixed; 'i' may well already be in another register
   LEAQ    1(CX), R9       ;; now more available registers, if other register, just ADD $1
   ```
   
   **ISSUE 1:**  Preserves "2 \* 'i'", that requires a full 'p' calculation inside the loop using an LEAQ instruction at 272, instead of preserving the full 'p' ('i' + 'i' + 3), that would then eliminate needing to recalculate 'p' inside the loop and would allow for a simple add instruction at line 272, which is a little faster.
   
   **ISSUE 2:**  Preserves the original in a new register before clobbering the original register in order to save latency (ignoring that the CPU will likely use Out of Order Execution - OOE, anyway), where a simple reordering of instructions would do the same and not require the advanced contents be calculated/moved back to the original register at the end of the loop.  This is a common pattern.
   
   **ISSUE 3:**  Ignores that the "cmpsts[j>>5] |= ..." can be encoded with a single instruction "ORL R..., (BASEREG)(INDEXREG*4)" to save some complexity and time.
   
   **ISSUE 4:**  the same as ISSUE 2, where a simple instruction order change can mean that no register use swap needs to be made and alleviates the need for more complex LEA use. 
   
   **ISSUE 5:**  When a uint32 value is shifted by uint32 bits, the compiler **correctly** eliminates a logical "and" by 0x1F (31) as the CPU limits the shift to this anyway; the issue is that if shifted by a uint, it doesn't eliminate it as it should (workaround is to use uint32 for shifts).  We should check to see if a int32 shifted by 31 bits also gets eliminated as it should; in fact any masking above 31 (above 63 for 64-bit registers) is unnecessary.
   
   **ISSUE 6** is #16092, where a register is preserved instead of using a simple MOV immediate.  This is pervasive further outside these bounds:  as if the compiler has a - "avoid immediate MOV at all costs".
   
   **ISSUE 7:**  This instruction is completely unnecessary in restoring a value to the CX register when it never gets used in the loop and gets clobbered for each loop.  Correctly, the CX register should be reloaded if necessary outside the loop. This is issue #14761.

The general observation is that the compiler tends to overuse LEA instructions, which instructions are very effective when necessary, but cost a little bit in speed as used instead of other simpler instructions:  they are slightly slower than those simpler instructions, which doesn't seem to be taken into account.

**Summary:**  The golang compiler is quite a way from being optimum, and won't come close to "Cee" (C/C++) efficiency until is comes much closer than this.  The changes here aren't that complex, and some simple rule changes/additions should suffice.  While version 1.7 is better than 1.6, it still is nowhere near as efficient as it needs to be.  Rule changes/additions as suggested above can make tight inner loops run up to twice as fast or more on some architectures and some situations, although not that much here for the amd64 as a high end CPU will re-order instructions to minimize the execution time.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/compile: new backend; multiple efficiency issues... #16192

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cmd/compile: new backend; multiple efficiency issues... #16192

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions