-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: new backend; multiple efficiency issues... #16192
Comments
Issue 2: yes, it looks like there are cases where we should perform the op on the copy instead of the original. I'll have to think about how to detect when that is beneficial. |
I think Issue 1 and 4 are all various symptoms of this: to me it looks like it would be better to keep the original and modify the copy more times than not.
x86 doesn't have so many modes, but I think it has the read/modify/write instructions? Regarding addressing complexity, all the addressing modes are already used by the LEA instruction, although the execution time assessments don't seem quite right yet; it you solve that one, then the read/modify/write is solved except for latency. Your concern about OR (BX),AX is misplaced I think, as latency is just dependent on whether the memory is in the cache or not, as for all memory accesses; the OR AX,(BX) is more of a concern only if the destination is in the dependency chain, then it is quite long, but almost always shorter or about the same time as separate MOV (BX),AX/XOR DX,AX/MOV AX,(BX) instructions. Note that 1) there is a latency of the first MOV, but the manipulations of the AX register can hide it, and 2) the use of the instructions is usually is this order where there is no dependency on the final (BX) destination. I'm just saying - GCC can do it in the specific case where the modify/assignment operators are used, why can't we? GCC does not use them when the modify/assignment operators are not used although Clang/LLVM does. We'll never be able to effectively do extreme loop unrolling of some kinds without it (modulos type of things).
Yes, I think you're right. |
The SSA backend encodes each addressing mode in a separate opcode. So for integer loads, we already already have: I'd like to solve this problem at some point. Maybe we bite the bullet and just do all those opcodes (and corresponding rewrite rules). Maybe we autogenerate somehow. It's low priority at the moment because I don't think it buys a whole lot (it's the same microops, just encoded more efficiently) and it's only x86. It's not needed for all the other architectures, and those are higher priority at the moment.
|
You are right that in the general case, using the read/modify/write opcodes doesn't make much difference to timing as generally the latency of the first move will be masked by the computations of the modify value so the difference is likely only ten to fifteen percent for a tight loop as here. However, if one is optimizing using loop unrolling techniques (where the modify value is either immediate or a pre-calculated register), it can make a very large difference as in currently using a load followed by the very short modify followed by the store takes 3 clock cycles where it would take only 1 clock cycle for the read/modify/write instruction (both with no latency/dependency on the memory contents of the store/write). Often, for maximum speed, the base address will be a "unsafe.Pointer", which has the same read/load followed by modify followed by write/store template as currently used here (in fact it has another issue as it produces the simple base address for the MOV's by a LEA instruction when not necessary, which makes the timing even worse; I'll raise an Issue about that once I document it properly).
As I said, it seems you are already handling all of the addressing modes for LEA, although not the load/store variations; couldn't you just apply those rules to all the variations of the other opcodes? And autogeneration of the code seems to be the way to do it, as it would seem that the rules would be the same for each different load or store for each of the different opcodes. I can't comment on your priorities, and the current implementation of read then modify then write is "adequate", but we'll never compete with Cee "extreme" optimizations without this. |
Just as a further goal to shoot for, Rust/LLVM produces the following assembly code for the same algorithm, although in Rust I had to use an unsafe mutable pointer reference to eliminate the array bounds check; the following code is about as good as it gets without hand optimizing:
On my Sky Lake Intel i5-6500, this takes about 3.2 CPU cycles per loop. The CPU is quite likely executing these instructions Out Of Order (OOE) in order to minimize latencies when registers are updated and immediately used in the next instruction. Note that the above code recognizes that the loop variable can sit in the rcx register and thus be used for the variable left shift when shifting 32-bit registers (also works when shifting 64-bit registers) as the mask of the lower five bits (six bits for 64-bit) is not necessary and has been eliminated. golang version 1.7 eliminates the mask, but does not then recognize that it can also then use rcx as the loop variable eliminating one register move and the use of one extra register. A total of only six registers are used in the above inner loop, meaning it can be just as efficient in x86 code. Note that is seems that stable golang version 1.7 for at least x86_64 target is very good at eliminating the bounds check in this case, recognizing that |
I'll leave it to @randall77 to decide about all the specific recommendations here and whether we want to tackle any of them for 1.9. A quick update on the overall execution speed, though--things have at least been getting steadily better. For 1.7.5, 1.8.1, and tip at b53acd8:
This uses the same code as before, but converted straightforwardly into a benchmark. package p
import (
"math"
"testing"
)
func primesOdds(top uint32) func() uint32 {
topndx := int((top - 3) / 2)
topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2
cmpsts := make([]uint32, (topndx/32)+1)
for i := 0; i <= topsqrtndx; i++ {
if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 {
p := i + i + 3
for j := (p*p - 3) >> 1; j <= topndx; j += p {
cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F)
}
}
}
i := -1
return func() uint32 {
oi := i
if i <= topndx {
i++
}
for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 {
i++
}
if oi < 0 {
return 2
} else {
return (uint32(oi) << 1) + 3
}
}
}
func BenchmarkPrimesOdds(b *testing.B) {
iter := primesOdds(1000000)
for i := 0; i < b.N; i++ {
iter()
}
} |
Josh, would you consider adding this benchmark to the Go1 suite?
…On Wed, 17 May 2017, 04:54 Josh Bleecher Snyder ***@***.***> wrote:
I'll leave it to @randall77 <https://github.com/randall77> to decide
about all the specific recommendations here and whether we want to tackle
any of them for 1.9.
A quick update on the overall execution speed, though--things have at
least been getting steadily better. For 1.7.5, 1.8.1, and tip at b53acd8
<b53acd8>
:
name \ time/op p17 p18 ptip
PrimesOdds-8 3.17ns ± 1% 2.56ns ± 1% 2.23ns ± 1%
This uses the same code as before, but converted straightforwardly into a
benchmark.
package p
import (
"math"
"testing"
)
func primesOdds(top uint32) func() uint32 {
topndx := int((top - 3) / 2)
topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2
cmpsts := make([]uint32, (topndx/32)+1)
for i := 0; i <= topsqrtndx; i++ {
if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 {
p := i + i + 3
for j := (p*p - 3) >> 1; j <= topndx; j += p {
cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F)
}
}
}
i := -1
return func() uint32 {
oi := i
if i <= topndx {
i++
}
for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 {
i++
}
if oi < 0 {
return 2
} else {
return (uint32(oi) << 1) + 3
}
}
}
func BenchmarkPrimesOdds(b *testing.B) {
iter := primesOdds(1000000)
for i := 0; i < b.N; i++ {
iter()
}
}
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#16192 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA1BqRjhOpMncUcq0sXFDCNg2cVasks5r6fDUgaJpZM4I-32j>
.
|
I'd love to see the Go1 suite expand. (And maybe shrink, too--I think fmt is over-represented.) I'm sure we could find a bunch of candidates (#16122 also comes to mind--recency effects). But we should do a considered, bulk addition. Would you mind filing a new issue, where we could collect a set of proposed additions and then consider them en masse? |
@josharian, maybe an issue label would do here? |
My problem with the mini benchmark as per @josharian is that the timing includes the determination of the results and not just the culling of the composites as was the intention; thus, the improved results may only indicate better iteration rather than an improved tight composite culling loop that was intended to be tested. The benchmark timing results should only include the
line and not the
lines. |
Sorry for misinterpreting. To be clear, is this the correct translation? var sink func() uint32
func BenchmarkPrimesOdds(b *testing.B) {
for i := 0; i < b.N; i++ {
sink = primesOdds(1000000)
}
} For that, with tip at c20e545, I see a regression from 1.8 to 1.9:
|
I see: (where tip == 1caa062)
|
Please answer these questions before submitting your issue. Thanks!
Version: go version go1.7beta2 windows/amd64
Environment:
set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=F:\Go
set GORACE=
set GOROOT=F:\Go
set GOTOOLDIR=F:\Go\pkg\tool\windows_amd64
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Super\AppData\Local\Temp\go-build631535858=/tmp/go-build -gno-record-gcc-switches
set CXX=g++
set CGO_ENABLED=1
However, these issues are likely also related to other operating systems and architectures, as in at least x386.
Runnable Program: Go Playground link:
What I saw:
The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:
Expected: 78498 is expected and that is what is output - not the issue:
The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:
ISSUE 1: Preserves "2 * 'i'", that requires a full 'p' calculation inside the loop using an LEAQ instruction at 272, instead of preserving the full 'p' ('i' + 'i' + 3), that would then eliminate needing to recalculate 'p' inside the loop and would allow for a simple add instruction at line 272, which is a little faster.
ISSUE 2: Preserves the original in a new register before clobbering the original register in order to save latency (ignoring that the CPU will likely use Out of Order Execution - OOE, anyway), where a simple reordering of instructions would do the same and not require the advanced contents be calculated/moved back to the original register at the end of the loop. This is a common pattern.
ISSUE 3: Ignores that the "cmpsts[j>>5] |= ..." can be encoded with a single instruction "ORL R..., (BASEREG)(INDEXREG*4)" to save some complexity and time.
ISSUE 4: the same as ISSUE 2, where a simple instruction order change can mean that no register use swap needs to be made and alleviates the need for more complex LEA use.
ISSUE 5: When a uint32 value is shifted by uint32 bits, the compiler correctly eliminates a logical "and" by 0x1F (31) as the CPU limits the shift to this anyway; the issue is that if shifted by a uint, it doesn't eliminate it as it should (workaround is to use uint32 for shifts). We should check to see if a int32 shifted by 31 bits also gets eliminated as it should; in fact any masking above 31 (above 63 for 64-bit registers) is unnecessary.
ISSUE 6 is cmd/compile: complicated bounds check elimination #16092, where a register is preserved instead of using a simple MOV immediate. This is pervasive further outside these bounds: as if the compiler has a - "avoid immediate MOV at all costs".
ISSUE 7: This instruction is completely unnecessary in restoring a value to the CX register when it never gets used in the loop and gets clobbered for each loop. Correctly, the CX register should be reloaded if necessary outside the loop. This is issue cmd/compile: regalloc restoring to dead register #14761.
The general observation is that the compiler tends to overuse LEA instructions, which instructions are very effective when necessary, but cost a little bit in speed as used instead of other simpler instructions: they are slightly slower than those simpler instructions, which doesn't seem to be taken into account.
Summary: The golang compiler is quite a way from being optimum, and won't come close to "Cee" (C/C++) efficiency until is comes much closer than this. The changes here aren't that complex, and some simple rule changes/additions should suffice. While version 1.7 is better than 1.6, it still is nowhere near as efficient as it needs to be. Rule changes/additions as suggested above can make tight inner loops run up to twice as fast or more on some architectures and some situations, although not that much here for the amd64 as a high end CPU will re-order instructions to minimize the execution time.
The text was updated successfully, but these errors were encountered: