cmd/compile: poor register allocator behavior in compression code #16122

flanglet · 2016-06-20T00:20:30Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?
go1.6 windows/amd64 and go1.7beta1 windows/amd64
What operating system and processor architecture are you using (go env)?
set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=E:\Users\fred\Documents\Prog\kanzi\go
set GORACE=
set GOROOT=E:\Program Files\go
set GOTOOLDIR=E:\Program Files\go\pkg\tool\windows_amd64
set GO15VENDOREXPERIMENT=1
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
set CXX=g++
set CGO_ENABLED=1
What did you do?
I ran "go build TestZRLT.go" then "TestZRLT.exe" both for Go 1.6 and 1.7 beta1
The source code is very simple: https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/test/TestZRLT.go.
It runs a correctness and a performance test tor the Zero Run Length Transform:
https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/function/ZRLT.go.
What did you expect to see?
I expected to see no performance regression from 1.6 to 1.7beta1
What did you see instead?
ZRLT encoding is much faster with 1.7beta1 but decoding is much slower.

Output for 1.6:
Speed test
Iterations: 50000

ZRLT encoding [ms]: 10694
Throughput [MB/s]: 222
ZRLT decoding [ms]: 7419
Throughput [MB/s]: 321

ZRLT encoding [ms]: 10753
Throughput [MB/s]: 221
ZRLT decoding [ms]: 7472
Throughput [MB/s]: 319

ZRLT encoding [ms]: 10724
Throughput [MB/s]: 222
ZRLT decoding [ms]: 7393
Throughput [MB/s]: 322

Output for 1.7beta1:
Speed test
Iterations: 50000

ZRLT encoding [ms]: 6834
Throughput [MB/s]: 348
ZRLT decoding [ms]: 11560
Throughput [MB/s]: 206

ZRLT encoding [ms]: 6828
Throughput [MB/s]: 349
ZRLT decoding [ms]: 11589
Throughput [MB/s]: 205

ZRLT encoding [ms]: 6790
Throughput [MB/s]: 351
ZRLT decoding [ms]: 11558
Throughput [MB/s]: 206

I narrowed down the issue to the run length decoding loop:

for val <= 1 {
    runLength = (runLength << 1) | int(val)
    srcIdx++

    if srcIdx >= srcEnd {
            break
    }

    val = src[srcIdx]
}

If I replace 'for val <= 1 {' with 'for val&1 == val {', the decoding becomes much faster (although not as fast as with Go 1.6)

Output for 1.7beta1 with code change:

Speed test
Iterations: 50000

ZRLT encoding [ms]: 6800
Throughput [MB/s]: 350
ZRLT decoding [ms]: 7669
Throughput [MB/s]: 310

ZRLT encoding [ms]: 6813
Throughput [MB/s]: 349
ZRLT decoding [ms]: 7689
Throughput [MB/s]: 310

ZRLT encoding [ms]: 6775
Throughput [MB/s]: 351
ZRLT decoding [ms]: 7662
Throughput [MB/s]: 311

The text was updated successfully, but these errors were encountered:

adg · 2016-06-20T01:23:13Z

cc @randall77 @josharian @dr2chase

randall77 · 2016-06-21T02:37:04Z

I'm not seeing what you are seeing.
The performance of 1.6.2 and tip are nearly identical for decode (tip is quite a bit better for encode).
I also don't see any variance between val <= 1 and val&1 == val.

That said, the code for the inner loop in decode isn't very good. The register allocator uses more copies than are necessary.

The indexing produces code like:

  y = x + 1
  load from p + x + 1

The latter is a single complex-addressing mode load, and the x+1 was folded into it.
In between those two instructions, both y and x are live. So the y=x+1 can't be done in place and a bunch of moves need to fix things up afterward.

I'll try to fix this up during the 1.8 cycle.

flanglet · 2016-06-21T03:39:25Z

OK. I suppose the issue has been addressed between 1.7.beta1 and tip then. What remains is an optimization.

rsc · 2016-10-21T04:32:44Z

@randall77, anything left here for Go 1.8?

randall77 · 2016-10-21T16:28:54Z

No, I did not get to fixing anything else here. Punting to 1.9.

josharian · 2017-05-16T19:18:04Z

I grabbed commit 3872c76b9b410c44428e74b2065f3e2291cb8095 of github.com/flanglet/kanzi based on the date of the commit and of this issue, extracted all relevant code into a single file, and updated it to use regular test/benchmark form: https://gist.github.com/josharian/e0bc6e238d4914a44289b44bc4ae3640.

This shows a steady regression over time (tip at b53acd8):

name \ time/op  17          18          tip
Speed/encode-8  314µs ± 1%  406µs ± 1%  438µs ± 2%
Speed/decode-8  314µs ± 1%  333µs ± 1%  356µs ± 1%

josharian · 2017-05-16T19:24:04Z

CL 43491 applied to tip helps a fair amount:

name            old time/op  new time/op  delta
Speed/encode-8   438µs ± 2%   374µs ± 0%  -14.71%  (p=0.000 n=10+9)
Speed/decode-8   356µs ± 1%   320µs ± 1%   -9.93%  (p=0.000 n=10+10)

That brings it back below 1.8 levels, although still not as good as 1.7.

josharian · 2017-05-18T18:10:43Z

CL 43491 is in, so we're back below the 1.8 level. Changing to milestone 1.10 for the original goal of getting back to 1.7 levels.

dr2chase · 2017-05-22T16:13:17Z

I spent some time (intermission at a concert) noodling with pictures of loops, and it seems to me that there's families of loops where we can arrange to have "optimal" answers, and we should probably arrange to do that in the block ordering code and then see what's left for heuristics.

For example, if there is a block in the loop that dominates all the exits and itself is conditional with an exit successor, rotate that block to the bottom, and its in-loop successor to the beginning of the run of blocks for the loop.

If we have a diamond with an exit on one of the two arms, A -> (B,C), B -> (X,D), C -> D, D -> A, order it C D A B.

In "layout", we'd do this by detecting transition into a loop, decoding the loop type, and inferring the best start block if it fits one of our models, and I think that the rest just falls out -- i.e., if you start with C, then the only logical successor is D then A, then B. This might subsume the loop rotation phase.

flanglet · 2017-12-11T20:24:23Z

go1.10beta1 shows a 15% performance regression on this code vs.
go 1.9.2 (see bench at [https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/benchmark/Functions_test.go]).

go1.9.2 windows/amd64

go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof

BenchmarkZRLT-8            10000           1032659 ns/op
PASS
ok      command-line-arguments  10.680s

Duration: 10.60s, Total samples = 10.42s (98.28%)
Showing nodes accounting for 10.41s, 99.90% of 10.42s total
Dropped 2 nodes (cum <= 0.05s)
      flat  flat%   sum%        cum   cum%
     6.39s 61.32% 61.32%      6.39s 61.32%  kanzi/function.(*ZRLT).Forward 
     4.02s 38.58% 99.90%      4.02s 38.58%  kanzi/function.(*ZRLT).Inverse  <===
         0     0% 99.90%     10.41s 99.90%  command-line-arguments.BenchmarkZRLT
         0     0% 99.90%     10.41s 99.90%  testing.(*B).launch 
         0     0% 99.90%     10.41s 99.90%  testing.(*B).runN

go1.10beta1 windows/amd64

go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof

BenchmarkZRLT-8            10000           1185367 ns/op
PASS
ok      command-line-arguments  12.179s

Duration: 12.10s, Total samples = 12.61s (104.20%)

Showing nodes accounting for 12.27s, 97.30% of 12.61s total
Dropped 46 nodes (cum <= 0.06s)
Showing top 10 nodes out of 24 
      flat  flat%   sum%        cum   cum%
     6.36s 50.44% 50.44%      6.36s 50.44%  kanzi/function.(*ZRLT).Forward
     5.58s 44.25% 94.69%      5.58s 44.25%  kanzi/function.(*ZRLT).Inverse   <===
     0.08s  0.63% 95.32%      0.08s  0.63%  runtime.stdcall1
     0.06s  0.48% 95.80%      0.13s  1.03%  runtime/pprof.(*profMap).lookup
     0.04s  0.32% 96.11%      0.07s  0.56%  runtime.mapaccess1_fast64
     0.04s  0.32% 96.43%      0.13s  1.03%  runtime.ready
     0.04s  0.32% 96.75%      0.17s  1.35%  runtime/pprof.(*profileBuilder).addC
PUData
     0.03s  0.24% 96.99%      0.07s  0.56%  runtime/pprof.readProfile
     0.02s  0.16% 97.15%      0.08s  0.63%  runtime.findrunnable
     0.02s  0.16% 97.30%      0.12s  0.95%  runtime.park_m

Problematic lines: 166 and 194


ROUTINE ======================== kanzi/function.(*ZRLT).Inverse 
     5.58s      5.58s (flat, cum) 44.25% of Total
         .          .    147:   runLength := 1
         .          .    148:   srcIdx, dstIdx := 0, 0
         .          .    149:   var err error
         .          .    150:
         .          .    151:   if srcIdx < srcEnd {
     410ms      410ms    152:           for dstIdx < dstEnd {
      90ms       90ms    153:                   if runLength > 1 {
     260ms      260ms    154:                           runLength--
      60ms       60ms    155:                           dst[dstIdx] = 0
         .          .    156:                           dstIdx++
         .          .    157:                           continue
         .          .    158:                   }
         .          .    159:
     340ms      340ms    160:                   if srcIdx >= srcEnd {
         .          .    161:                           break
         .          .    162:                   }
         .          .    163:
     770ms      770ms    164:                   val := src[srcIdx]
         .          .    165:
     1.14s      1.14s    166:                   if val <= 1 {
         .          .    167:                           // Generate the run leng
th bit by bit (but force MSB)
         .          .    168:                           runLength = 1
         .          .    169:
     100ms      100ms    170:                           for val <= 1 {
      50ms       50ms    171:                                   runLength = (run
Length << 1) | int(val)
      40ms       40ms    172:                                   srcIdx++
         .          .    173:
         .          .    174:                                   if srcIdx >= src
End {
         .          .    175:                                           break
         .          .    176:                                   }
         .          .    177:
      30ms       30ms    178:                                   val = src[srcIdx
]
         .          .    179:                           }
         .          .    180:
      50ms       50ms    181:                           continue
         .          .    182:                   }
         .          .    183:
         .          .    184:                   // Regular data processing
     340ms      340ms    185:                   if val == 0xFF {
         .          .    186:                           srcIdx++
         .          .    187:
         .          .    188:                           if srcIdx >= srcEnd {
         .          .    189:                                   break
         .          .    190:                           }
         .          .    191:
         .          .    192:                           dst[dstIdx] = 0xFE + src
[srcIdx]
         .          .    193:                   } else {
     1.62s      1.62s    194:                           dst[dstIdx] = val - 1
         .          .    195:                   }
         .          .    196:
         .          .    197:                   srcIdx++
     280ms      280ms    198:                   dstIdx++
         .          .    199:           }
         .          .    200:   }
         .          .    201:
         .          .    202:   // If runLength is not 1, add trailing 0s
         .          .    203:   end := dstIdx + runLength - 1

randall77 · 2017-12-11T22:26:40Z

@flanglet I don't see the same performance issues you do.
(I'm on linux/amd64, in case that makes a difference.)

$ ~/go1.9.2/bin/go test -test.bench=ZRLT -test.count=10 > log192
$ ~/goTip/bin/go test -test.bench=ZRLT -test.count=10 > logTip
$ benchstat log192 logTip 
name  old time/op  new time/op  delta
ZRLT   541µs ± 9%   547µs ± 9%   ~     (p=0.393 n=10+10)

There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.

flanglet · 2017-12-12T01:03:57Z

I don't see the same performance issues you do.
(I'm on linux/amd64, in case that makes a difference.)

I also do not see much difference on a Linux machine. But the performance drop is reproducible every time the benchmark runs on the Windows 7 machine.

There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.

Increase the test count. I will work on reducing the noise.

randall77 · 2017-12-12T01:39:44Z

@flanglet : Then I don't understand what is going on. The register allocator is identical on linux and windows, so this may be an entirely separate problem from the one that started this issue.
Are your linux and windows boxes two separate machines? Might it be the CPU, not the OS?

flanglet · 2017-12-12T02:01:36Z

The Windows 7 machine has an Intel Core7 2600 (Sandy bridge) CPU.
The Linux machine has an Intel Core7 7700K (Kaby Lake) CPU.
So. yeah, it might be CPU related.

randall77 · 2017-12-12T02:09:34Z

@TocarIP , might be up your alley.

TocarIP · 2017-12-12T22:19:51Z

Couldn't reproduce on Sandy Bridge
model name : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
I see:
ZRLT-24 935µs ± 0% 931µs ± 1% -0.40% (p=0.009 n=10+10)

flanglet · 2017-12-12T23:56:52Z

Well, that is weird. It happens every time on my machine.
FWIW, I downloaded the go package from https://golang.org/dl/#go1.10beta1. Maybe you tested a slightly different version ?
Anyway, I will wait for newer releases and keep on testing this case.

adg changed the title ~~Go 1.7beta1 performance drop vs. Go 1.6 in compression code~~ cmd/compile: Go 1.7beta1 performance drop vs. Go 1.6 in compression code Jun 20, 2016

adg added this to the Go1.7Maybe milestone Jun 20, 2016

randall77 modified the milestones: Go1.8, Go1.7Maybe Jun 21, 2016

randall77 self-assigned this Jun 21, 2016

quentinmit changed the title ~~cmd/compile: Go 1.7beta1 performance drop vs. Go 1.6 in compression code~~ cmd/compile: poor register allocator behavior in compression code Jun 30, 2016

quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 30, 2016

randall77 added the Performance label Oct 7, 2016

randall77 modified the milestones: Go1.9, Go1.8 Oct 21, 2016

This was referenced May 16, 2017

cmd/compile: looprotate picking wrong jump target #20355

Closed

cmd/compile: new backend; multiple efficiency issues... #16192

Open

proposal: expand the go1 bench suite #20384

Closed

josharian modified the milestones: Go1.10, Go1.9 May 18, 2017

bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

bradfitz modified the milestones: Go1.11, Unplanned Jun 19, 2018

randall77 mentioned this issue Apr 2, 2019

cmd/compile: loops compiled differently when using range or index #31205

Closed

rsc unassigned randall77 Jun 23, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

randall77 mentioned this issue Jan 24, 2023

cmd/compile: performance regression on ppc64 due to change in scheduling CL 270940 #57976

Closed

Woodpile37 mentioned this issue Mar 21, 2023

[Snyk] Upgrade webpack from 5.64.4 to 5.75.0 Woodpile37/go#2

Merged

snyk-bot mentioned this issue Mar 25, 2023

[Snyk] Upgrade webpack from 5.64.4 to 5.75.0 Woodpile37/go#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: poor register allocator behavior in compression code #16122

cmd/compile: poor register allocator behavior in compression code #16122

flanglet commented Jun 20, 2016

adg commented Jun 20, 2016

randall77 commented Jun 21, 2016

flanglet commented Jun 21, 2016

rsc commented Oct 21, 2016

randall77 commented Oct 21, 2016

josharian commented May 16, 2017

josharian commented May 16, 2017 •

edited

Loading

josharian commented May 18, 2017

dr2chase commented May 22, 2017

flanglet commented Dec 11, 2017

randall77 commented Dec 11, 2017

flanglet commented Dec 12, 2017

randall77 commented Dec 12, 2017

flanglet commented Dec 12, 2017

randall77 commented Dec 12, 2017

TocarIP commented Dec 12, 2017

flanglet commented Dec 12, 2017

cmd/compile: poor register allocator behavior in compression code #16122

cmd/compile: poor register allocator behavior in compression code #16122

Comments

flanglet commented Jun 20, 2016

adg commented Jun 20, 2016

randall77 commented Jun 21, 2016

flanglet commented Jun 21, 2016

rsc commented Oct 21, 2016

randall77 commented Oct 21, 2016

josharian commented May 16, 2017

josharian commented May 16, 2017 • edited Loading

josharian commented May 18, 2017

dr2chase commented May 22, 2017

flanglet commented Dec 11, 2017

randall77 commented Dec 11, 2017

flanglet commented Dec 12, 2017

randall77 commented Dec 12, 2017

flanglet commented Dec 12, 2017

randall77 commented Dec 12, 2017

TocarIP commented Dec 12, 2017

flanglet commented Dec 12, 2017

josharian commented May 16, 2017 •

edited

Loading