Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgo: profile-guided optimization for golang v1.17 #49688

Closed
wants to merge 5,533 commits into from

Conversation

rajbarik
Copy link
Contributor

This PR implements profile-guided optimization for Golang v1.17. At a very high-level, our implementation consumes PPROF CPU profiles via the compiler flag '-fpprofprofileuse' and performs three key optimizations based on profiles: code specialization ('-specialize'), inlining ('-pprofinline'), code layout ('-codelayout'). Code specialization targets interface method calls to specialize them to direct function calls. Inlining decisions are made based on profiles. Codelayout optimization currently reorders functions based on profiles. BasicBlock layout optimization is currently WIP.

Inlining based on profiles yields decent wins (avg 15%, max 40%).

For bitset, we see ~19% gain from inlining. Please find the steps below to reproduce:

(bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -cpuprofile LemireIterate.pprof
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1780585980 ns/op
PASS
ok github.com/bits-and-blooms/bitset 12.828s

(bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-fpprofprofileuse LemireIterate.pprof -specialize -pprofinline"
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1597644922 ns/op
PASS
ok github.com/bits-and-blooms/bitset 10.303s

Please note: due to Go compiler dependency issues, we had to duplicate a lot of functionalities from PPROF and from inline/inl.go in order to make this work. We are currently working closely with Google-go folks to eliminate these redundancies.

@google-cla google-cla bot added the cla: no Used by googlebot to label PRs as having an invalid CLA. The text of this label should not change. label Nov 19, 2021
@rajbarik rajbarik changed the base branch from master to release-branch.go1.17 November 19, 2021 18:28
@mvdan
Copy link
Member

mvdan commented Nov 20, 2021

Did you see #28262? Usually, big features like these are designed and discussed in the issue tracker first. This is a very large amount of code to show up with all of a sudden :)

@ianlancetaylor
Copy link
Contributor

@mvdan and others: for background, the authors of this CL have been discussing their PGO implementation privately with the Go compiler team. This pull request is the mechanism they are using to share their code changes. This pull request certainly isn't going to go in as is--for one thing, it's against the 1.17 branch--but will be used for further discussion and, I hope, a later updated implementation on mainline.

So, it's new to the community (and new to me too, for that matter) but it's not entirely out of the blue.

@mvdan
Copy link
Member

mvdan commented Nov 22, 2021

Thanks for the quick reply, @ianlancetaylor! I had some thoughts but they weren't related to the code changes, so I've left a comment on the issue instead.

@rajbarik
Copy link
Contributor Author

Thank you for proving the context @ianlancetaylor. This PR is still WIP as we (Uber) are working closely with @cherrymui @aclements and @jeremyfaller on cleaning up compiler dependencies (which would avoid a large amount of code duplication), supporting the common profiling format, adding bblayout optimization, and integrating our code in the main branch.

@josharian
Copy link
Contributor

Usually changes of this magnitude start with a publicly circulated design proposal.

@jeremyfaller
Copy link
Contributor

Usually changes of this magnitude start with a publicly circulated design proposal.

This will be no different. I don't think anyone is implying we would skip the proposal process. Uber's code drop is to get their work and ours on the same page, so we can build from there. I don't know what form it will take, but we have to start by sharing. :)

@prattmic
Copy link
Member

prattmic commented Dec 2, 2021

Minor note: @google-cla thinks there is no CLA signed yet. Per https://opensource.google/docs/cla/#external-contributors, I believe you need to get added to the appropriate "Google CLA" Google Group within Uber to be seen as from Uber by the bot. This will be needed eventually before changes can go in, though as noted above, design docs / proposals will come well before that.

@lni
Copy link

lni commented Dec 3, 2021

@rajbarik, thanks for sharing this very interesting WIP patch!

When building your feature branch at https://github.com/rajbarik/go/tree/pgo, I got the following errors

~/src/go/src$ ./make.bash
Building Go cmd/dist using /usr/local/go. (go1.17.2 darwin/arm64)
Building Go toolchain1 using /usr/local/go.
# bootstrap/cmd/compile/internal/pgo
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:230: fn.NodeFreq undefined (type *ir.Func has no field or method NodeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:230: undefined: ir.NodeFrequency
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:258: call.EdgeFreq undefined (type *ir.CallExpr has no field or method EdgeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:259: call.EdgeFreq undefined (type *ir.CallExpr has no field or method EdgeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:259: undefined: ir.EdgeFrequency
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:262: call.EdgeFreq undefined (type *ir.CallExpr has no field or method EdgeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:262: undefined: ir.EdgeFrequency
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:303: f.NodeFreq undefined (type *ir.Func has no field or method NodeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:303: undefined: ir.NodeFrequency
go tool dist: FAILED: /usr/local/go/bin/go install -gcflags=-l -tags=math_big_pure_go compiler_bootstrap bootstrap/cmd/...: exit status 2

quick grep search shows ir.NodeFrequency is indeed missing from the ir package.

Did I miss something here? Thanks.

@rajbarik
Copy link
Contributor Author

rajbarik commented Dec 3, 2021

@lni My apologies. I have updated the PR now and have tested it to work.

@cherrymui fyi.

@rajbarik
Copy link
Contributor Author

rajbarik commented Dec 3, 2021

Minor note: @google-cla thinks there is no CLA signed yet. Per https://opensource.google/docs/cla/#external-contributors, I believe you need to get added to the appropriate "Google CLA" Google Group within Uber to be seen as from Uber by the bot. This will be needed eventually before changes can go in, though as noted above, design docs / proposals will come well before that.

Thanks for pointing this out. I am working on getting this fixed.

@lni
Copy link

lni commented Dec 4, 2021

@rajbarik thanks a lot for the quick response.

I've managed to build your patched version of go, interestingly I couldn't reproduce the above reported performance improvements. That LemireIterateManyb benchmark from bitset is slower when PGO is enabled. Please see my results below.

$ /Users/lni/src/go/bin/go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -cpuprofile LemireIterate.pprof
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkLemireIterateManyb 	       3	   2676432 ns/op
PASS
ok  	github.com/bits-and-blooms/bitset	0.230s

$ /Users/lni/src/go/bin/go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-fpprofprofileuse LemireIterate.pprof -specialize -pprofinline"
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkLemireIterateManyb 	       3	   2775468 ns/op
PASS
ok  	github.com/bits-and-blooms/bitset	0.233s

two observed differences in our benchmark outputs,

  1. your ns/op is significantly higher than mine, 1780585980 ns/op vs. 2676432 ns/op
  2. different processors, i7-8850H processor at 2.6Ghz vs. i9-9980HK at 2.4Ghz

I've tried on a M1 ARM processor based mbp as well and I got similar results.

Did I still miss anything here? Thanks!

@rajbarik
Copy link
Contributor Author

rajbarik commented Dec 6, 2021

@lni Great to hear that you are able to build and test the patch.
AFAICT, performance improvements may vary between hardware generations and SKUs. You may want to increase the benchtime to 10x or more to increase the execution time [in order to eliminate any noises]. Also, I recommend adding "-m" to gcflags and mining the logs to ensure inlining via pprof profiles is indeed working as expected [before and after]. Please see example below:

Before pprofinline

(bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-m" -cpuprofile LemireIterate.pprof 2> before_log.txt
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1483586757 ns/op
PASS
ok github.com/bits-and-blooms/bitset 11.476s

(bitset) go tool pprof LemireIterate.pprof
Type: cpu
Time: Dec 6, 2021 at 9:42am (PST)
Duration: 10.74s, Total samples = 9.29s (86.47%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 9.21s, 99.14% of 9.29s total
Dropped 19 nodes (cum <= 0.05s)
flat flat% sum% cum cum%
3.29s 35.41% 35.41% 3.39s 36.49% github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
2.69s 28.96% 64.37% 9.11s 98.06% github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
1.57s 16.90% 81.27% 3s 32.29% github.com/bits-and-blooms/bitset.(*BitSet).Set
1.38s 14.85% 96.12% 1.43s 15.39% github.com/bits-and-blooms/bitset.(*BitSet).extendSetMaybe

After pprofinline

bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-m -fpprofprofileuse LemireIterate.pprof -pprofinline" -cpuprofile LemireIterate_after.pprof 2> after_log.txt
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1311964698 ns/op
PASS
ok github.com/bits-and-blooms/bitset 8.782s
(bitset) go tool pprof LemireIterate_after.pprof
Type: cpu
Time: Dec 6, 2021 at 9:46am (PST)
Duration: 8.02s, Total samples = 6.97s (86.95%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 6.93s, 99.43% of 6.97s total
Dropped 9 nodes (cum <= 0.03s)
flat flat% sum% cum cum%
3.38s 48.49% 48.49% 3.40s 48.78% github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
3.31s 47.49% 95.98% 6.81s 97.70% github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb

after_log.txt content

PprofInline: [hot] inline (*BitSet).extendSetMaybe in (*BitSet).Set with weight 5.20
PprofInline: [hot] inline (*BitSet).Set in BenchmarkLemireIterateManyb with weight 10.92
PprofInline: [hot] inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb with weight 12.34

Please let me know if they help.

@lni
Copy link

lni commented Dec 8, 2021

@rajbarik Thanks for the help above, sadly I still couldn't reproduce the performance improvements.

Based on your suggestion above, I changed benchtime to 1000x and repeated the test -

lni$ /Users/lni/src/go/bin/go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m" -cpuprofile LemireIterate.pprof 2> before_log.txt
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkLemireIterateManyb 	    1000	   2275634 ns/op
PASS
ok  	github.com/bits-and-blooms/bitset	3.114s

lni$ go tool pprof LemireIterate.pprof
Type: cpu
Time: Dec 8, 2021 at 11:07pm (CST)
Duration: 2.52s, Total samples = 2.09s (82.86%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for 2.07s, 99.04% of 2.09s total
Dropped 14 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     1.61s 77.03% 77.03%      1.63s 77.99%  github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
     0.41s 19.62% 96.65%      2.05s 98.09%  github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
     0.03s  1.44% 98.09%      0.03s  1.44%  runtime.asyncPreempt
     0.02s  0.96% 99.04%      0.02s  0.96%  runtime.kevent
         0     0% 99.04%      0.02s  0.96%  runtime.netpoll
         0     0% 99.04%      0.02s  0.96%  runtime.startTheWorldWithSema
         0     0% 99.04%      0.03s  1.44%  runtime.systemstack
         0     0% 99.04%      2.05s 98.09%  testing.(*B).launch
         0     0% 99.04%      2.05s 98.09%  testing.(*B).runN

lni$ /Users/lni/src/go/bin/go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m -fpprofprofileuse LemireIterate.pprof -pprofinline" -cpuprofile LemireIterate_after.pprof 2> after_log.txt
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkLemireIterateManyb 	    1000	   2290498 ns/op
PASS
ok  	github.com/bits-and-blooms/bitset	2.599s

lni$ go tool pprof LemireIterate_after.pprof
Type: cpu
Time: Dec 8, 2021 at 11:08pm (CST)
Duration: 2.49s, Total samples = 2.05s (82.47%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top40
Showing nodes accounting for 2.03s, 99.02% of 2.05s total
Dropped 9 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     1.48s 72.20% 72.20%      1.52s 74.15%  github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
     0.48s 23.41% 95.61%      2.04s 99.51%  github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
     0.05s  2.44% 98.05%      0.05s  2.44%  runtime.asyncPreempt
     0.02s  0.98% 99.02%      0.02s  0.98%  runtime.memclrNoHeapPointers
         0     0% 99.02%      0.02s  0.98%  github.com/bits-and-blooms/bitset.(*BitSet).Set
         0     0% 99.02%      0.02s  0.98%  github.com/bits-and-blooms/bitset.(*BitSet).extendSetMaybe
         0     0% 99.02%      0.03s  1.46%  runtime.makeslice
         0     0% 99.02%      0.03s  1.46%  runtime.mallocgc
         0     0% 99.02%      0.02s  0.98%  runtime.memclrNoHeapPointersChunked
         0     0% 99.02%      2.02s 98.54%  testing.(*B).launch
         0     0% 99.02%      0.02s  0.98%  testing.(*B).run1.func1
         0     0% 99.02%      2.04s 99.51%  testing.(*B).runN

lni$ cat after_log.txt | grep NextSetMany
PprofInline: can't determine the hotness, but inline trailingZeroes64 in (*BitSet).NextSetMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany

lni$ cat after_log.txt | grep "\[hot\]"
PprofInline: [hot] inline (*BitSet).extendSetMaybe in (*BitSet).Set with weight 3.57
PprofInline: [hot] inline strconv.FormatInt in (*BitSet).String with weight 3.57
PprofInline: [hot] inline (*BitSet).Set in BenchmarkLemireIterateManyb with weight 7.14
PprofInline: [hot] inline (*BitSet).String in TestStringLong with weight 3.57
PprofInline: [hot] inline New in TestBitSetHuge with weight 3.57

Please let me know if there anything else I should try. Thanks!

@zhouguangyuan0718
Copy link
Contributor

I‘m trying. ButI couldn't reproduce the performance improvements, too.
The result is a little different with lni.

$ go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m" -cpuprofile LemireIterate.pprof 2> before_log.txt
goos: linux
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
BenchmarkLemireIterateManyb         1000           2482742 ns/op
PASS
ok      github.com/bits-and-blooms/bitset       2.754s
$ go tool pprof LemireIterate.pprof
File: bitset.test
Type: cpu
Time: Dec 10, 2021 at 1:18am (CST)
Duration: 2.72s, Total samples = 2.60s (95.47%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)  top30
Showing nodes accounting for 2.58s, 99.23% of 2.60s total
Dropped 12 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     1.95s 75.00% 75.00%      2.07s 79.62%  github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
     0.42s 16.15% 91.15%      2.50s 96.15%  github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
     0.12s  4.62% 95.77%      0.12s  4.62%  github.com/bits-and-blooms/bitset.trailingZeroes64 (inline)
     0.09s  3.46% 99.23%      0.09s  3.46%  runtime.memclrNoHeapPointers
         0     0% 99.23%      0.09s  3.46%  github.com/bits-and-blooms/bitset.New
         0     0% 99.23%      0.09s  3.46%  github.com/bits-and-blooms/bitset.TestBitSetHuge
         0     0% 99.23%      0.02s  0.77%  runtime.(*mcache).allocLarge
         0     0% 99.23%      0.10s  3.85%  runtime.makeslice
         0     0% 99.23%      0.10s  3.85%  runtime.mallocgc
         0     0% 99.23%      0.08s  3.08%  runtime.memclrNoHeapPointersChunked
         0     0% 99.23%      2.49s 95.77%  testing.(*B).launch
         0     0% 99.23%      2.50s 96.15%  testing.(*B).runN
         0     0% 99.23%      0.10s  3.85%  testing.tRunner
(pprof) q
go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m -fpprofprofileuse LemireIterate.pprof -pprofinline" -cpuprofile LemireIterate_after.pprof 2> after_log.txt
goos: linux
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
BenchmarkLemireIterateManyb         1000           2504440 ns/op
PASS
ok      github.com/bits-and-blooms/bitset       2.852s
$  go tool pprof LemireIterate.pprof
File: bitset.test
Type: cpu
Time: Dec 10, 2021 at 1:18am (CST)
Duration: 2.72s, Total samples = 2.60s (95.47%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for 2.58s, 99.23% of 2.60s total
Dropped 12 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     1.95s 75.00% 75.00%      2.07s 79.62%  github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
     0.42s 16.15% 91.15%      2.50s 96.15%  github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
     0.12s  4.62% 95.77%      0.12s  4.62%  github.com/bits-and-blooms/bitset.trailingZeroes64 (inline)
     0.09s  3.46% 99.23%      0.09s  3.46%  runtime.memclrNoHeapPointers
         0     0% 99.23%      0.09s  3.46%  github.com/bits-and-blooms/bitset.New
         0     0% 99.23%      0.09s  3.46%  github.com/bits-and-blooms/bitset.TestBitSetHuge
         0     0% 99.23%      0.02s  0.77%  runtime.(*mcache).allocLarge
         0     0% 99.23%      0.10s  3.85%  runtime.makeslice
         0     0% 99.23%      0.10s  3.85%  runtime.mallocgc
         0     0% 99.23%      0.08s  3.08%  runtime.memclrNoHeapPointersChunked
         0     0% 99.23%      2.49s 95.77%  testing.(*B).launch
         0     0% 99.23%      2.50s 96.15%  testing.(*B).runN
         0     0% 99.23%      0.10s  3.85%  testing.tRunner
(pprof) q
$  cat after_log.txt | grep NextSetMany
PprofInline Caller=github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb Callee=github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany Weight=2200000000
PprofInline: can't determine the hotness, but inline trailingZeroes64 in (*BitSet).NextSetMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb
PprofInline: [hot] inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb with weight 28.03
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany
$   cat after_log.txt | grep "\[hot\]"
PprofInline: [hot] inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb with weight 28.03

@rajbarik
I want to know what else I should try, too.
Thank you.

@zhouguangyuan0718
Copy link
Contributor

For github.com/bits-and-blooms/bitset. I run benchmark for all the testcase. And add a test flag -count=100.

$ go test -bench=. -cpu 1 -parallel 1 -benchtime=3x -count=100 -cpuprofile all.pprof > before.txt
$ go test -bench=. -cpu 1 -parallel 1 -benchtime=3x -count=100 -gcflags="-fpprofprofileuse all.pprof -specialize -pprofinline" >after.txt
$ benchstat ./before.txt after.txt
name                                             old time/op   new time/op  delta
Set                                               91.4ns ±52%  55.8ns ±40%  -38.89%  (p=0.000 n=95+95)
GetTest                                           81.6ns ±38%  62.2ns ±39%  -23.72%  (p=0.000 n=98+98)
SetExpand                                        1.02µs ±153%  1.95µs ±15%  +90.68%  (p=0.000 n=98+91)
Count                                             1.16µs ± 1%  1.16µs ± 2%     ~     (p=0.777 n=90+96)
Iterate                                           13.1µs ± 0%  13.1µs ± 0%   -0.06%  (p=0.003 n=92+86)
SparseIterate                                     14.5µs ±30%  13.6µs ± 1%   -6.49%  (p=0.000 n=99+84)
LemireCreate                                      14.9ms ± 6%  15.7ms ± 4%   +5.62%  (p=0.000 n=95+97)
LemireCount                                       1.54ms ± 9%  1.51ms ± 7%   -1.85%  (p=0.000 n=98+100)
LemireIterate                                     4.20ms ± 1%  4.16ms ± 1%   -1.11%  (p=0.000 n=88+99)
LemireIterateb                                    4.22ms ± 1%  4.17ms ± 1%   -1.11%  (p=0.000 n=92+100)
LemireIterateManyb                                2.94ms ± 5%  2.98ms ± 2%   +1.05%  (p=0.000 n=94+96)
FlorianUekermannIterateMany                        724ns ±10%   679ns ± 9%   -6.24%  (p=0.000 n=89+98)
FlorianUekermannIterateManyReg                    1.46µs ± 4%  1.40µs ± 4%   -4.45%  (p=0.000 n=100+97)
FlorianUekermannIterateManyComp                    788ns ±11%   926ns ± 9%  +17.47%  (p=0.000 n=97+100)
FlorianUekermannLowDensityIterateMany             1.16ms ± 1%  1.19ms ± 1%   +2.54%  (p=0.000 n=91+97)
FlorianUekermannLowDensityIterateManyReg          1.14ms ± 1%  1.13ms ± 1%   -0.74%  (p=0.000 n=90+98)
FlorianUekermannLowDensityIterateManyComp         1.34ms ± 1%  1.35ms ± 1%   +0.54%  (p=0.000 n=94+93)
FlorianUekermannMidDensityIterateMany             12.7ms ± 1%  12.9ms ± 1%   +1.79%  (p=0.000 n=89+100)
FlorianUekermannMidDensityIterateManyReg          19.0ms ± 1%  18.9ms ± 0%   -0.21%  (p=0.000 n=94+97)
FlorianUekermannMidDensityIterateManyComp         13.8ms ± 1%  13.7ms ± 0%   -0.45%  (p=0.000 n=92+99)
FlorianUekermannMidStrongDensityIterateMany       38.3ms ± 1%  38.4ms ± 0%   +0.36%  (p=0.000 n=96+98)
FlorianUekermannMidStrongDensityIterateManyReg    71.1ms ± 0%  74.8ms ± 0%   +5.26%  (p=0.000 n=95+97)
FlorianUekermannMidStrongDensityIterateManyComp   46.3ms ± 2%  46.4ms ± 1%   +0.21%  (p=0.000 n=99+92)

For some testcase, it improved. But some testcases got a worse result.

@lni
Copy link

lni commented Dec 11, 2021

@rajbarik

The bitset version I am using is 5a829244ffd64e4120015ce1bf8285b0b6168d55 which is the latest in its default branch. Any chance you can quickly check which version of bitset are you running?

Also wondering is there any other tests/benchmarks that are known to benefit from the PGO patch? Will be happy to try those as well. Thanks!

@rajbarik
Copy link
Contributor Author

@lni I will take a look at this next week and update you. Please stay tuned.

@lni
Copy link

lni commented Dec 17, 2021

@rajbarik Thanks a lot for looking into this.

We are building a new HTAP database in Go called MatrixOne. Our profiling results indicate that some hot functions are not inlined as expected when using the official compiler, in our experiments, we manually batched their invocations and immediately observed noticeable performance improvements. Profiling guided inlining as described in your PGO patch sounds like a pretty cool potential solution to the problem we face.

Looking forward to your update, will be very happy to further cooperate on this exciting PGO feature!

@rajbarik
Copy link
Contributor Author

rajbarik commented Dec 20, 2021

@lni @zhouguangyuan0718 My apologies, I had modified the BenchmarkLemireIterateManyb function in bitset_benchmark_test.go to increase its execution time by setting every bit in the bitset. I had modified one line; please see the commented line below.

// go test -bench=BenchmarkLemireIterateManyb
// see http://lemire.me/blog/2016/09/22/swift-versus-java-the-bitset-performance-test/
func BenchmarkLemireIterateManyb(b *testing.B) {
  bitmap := New(1000000000)
  for v := uint(0); v <= 1000000000; v += 1 { // This is the only change to the code which replaces 100 with 1
    bitmap.Set(v)
  }
  buffer := make([]uint, 256)
  b.ResetTimer()
  sum := uint(0)
  for i := 0; i < b.N; i++ {
    j := uint(0)
    j, buffer = bitmap.NextSetMany(j, buffer)
    for ; len(buffer) > 0; j, buffer = bitmap.NextSetMany(j, buffer) {
      for k := range buffer {
        sum += buffer[k]
      }
      j++
    }
  }

  if sum == 0 { // added just to fool ineffassign
    return
  }
}

With this change, here is what I see on my Intel Mac laptop:

go test -bench=BenchmarkLemireIterateManyb -cpu 1 -parallel 1 -count 5 -benchtime=3x -cpuprofile LemireIterate.pprof > before.txt
go test -bench=BenchmarkLemireIterateManyb -cpu 1 -parallel 1 -count 5 -benchtime=3x '-gcflags=-fpprofprofileuse LemireIterate.pprof -pprofinline' -cpuprofile LemireIterate_after.pprof > after.txt
benchstat before.txt after.txt
name                old time/op  new time/op  delta
LemireIterateManyb   1.46s ± 4%   1.31s ± 2%  -9.87%  (p=0.008 n=5+5)

Another benchmark where I see ~25% improvement is Tally. Steps below:

go test -bench=ScopeReporting/size1000000 -cpu 1 -parallel 1 -count 5 -benchtime=1x -cpuprofile scope.pprof > before.txt
go test -bench=ScopeReporting/size1000000 -cpu 1 -parallel 1 -count 5 -benchtime=1x '-gcflags=-fpprofprofileuse scope.pprof -pprofinline' -cpuprofile scope_after.pprof > after.txt
benchstat before.txt after.txt
name                        old time/op  new time/op  delta
ScopeReporting/size1000000  2.75ms ± 5%  2.04ms ± 5%  -25.75%  (p=0.008 n=5+5)

Please let me know if you are able to reproduce bitset and tally now.

@lni Regarding MatrixOne, I am able to run the unit-tests. Is there any particular unit-test I should dig deeper? Are there integration/component tests?

@mvdan
Copy link
Member

mvdan commented Aug 22, 2022

It seems like PGO is being talked about internally again (#43930 (comment)). Is there a chance that we could have more details about the current plan and design? For those of us interested in the Go compiler while outside Google, it's being particularly hard to follow what is happening, let alone contribute in any meaningful way :)

I understand that it may still be early for a formal proposal, but perhaps there could be a regular GitHub issue thread where we can subscribe for detailed updates every few weeks. Ideally the discussions about PGO would simply happen in a public place, removing the need for that back and forth between private and public communication.

@mvdan
Copy link
Member

mvdan commented Aug 24, 2022

It's only been two days, but my last comment is already buried by the commit push activity :) I had to click on "see more" three times to find it again. Can we please at least have one tracking issue to talk about PGO?

@josharian
Copy link
Contributor

#28262

@mvdan
Copy link
Member

mvdan commented Aug 24, 2022

Oh, thanks Josh. It has been some time and I even forgot I already asked this last year :)

wdvxdr1123 and others added 2 commits August 31, 2022 09:35
Fixes golang#45928.

Change-Id: Ifbb0effbca4ab7c0eb56069fee40edb564553c35
Reviewed-on: https://go-review.googlesource.com/c/go/+/410336
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
And add some documentation for the debug query param.

Fixes golang#27737
Fixes golang#53971

Change-Id: I629aaa2d4a43175381eb04872f1caad238519a41
Reviewed-on: https://go-review.googlesource.com/c/go/+/421635
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
cuiweixie and others added 25 commits September 8, 2022 15:52
Change-Id: I9ec725009376f5865adedca6c159b14140dde097
Reviewed-on: https://go-review.googlesource.com/c/go/+/426086
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Daniel Theophanes <kardianos@gmail.com>
Run-TryBot: Daniel Theophanes <kardianos@gmail.com>
For golang#46505.

Change-Id: I9bc9da5dd4b76cb2d8ff41390e1567678e72d88d
Reviewed-on: https://go-review.googlesource.com/c/go/+/428938
Run-TryBot: Robert Griesemer <gri@google.com>
Reviewed-by: Robert Findley <rfindley@google.com>
Auto-Submit: Robert Griesemer <gri@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
pageAlloc.chunks used to require an atomic store when growing the heap
because the scavenger would look at the list without locking the heap
lock. However, the scavenger doesn't do that anymore, and it looks like
nothing really does at all.

This change updates the comment and makes the store non-atomic.

Change-Id: Ib452d147861060f9f6e74e2d98ee111cf89ce8f5
Reviewed-on: https://go-review.googlesource.com/c/go/+/429219
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
All subfields use atomic types to ensure alignment, so there's no more
need for these fields.

Change-Id: Iada4253f352a074073ce603f1f6b07cbd5b7c58a
Reviewed-on: https://go-review.googlesource.com/c/go/+/429220
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
This change fixes an old TODO that made it a uint64 because it would
make alignment within mheap more complicated. Now that we don't have to
worry about it since we're using atomic types as much as possible,
switch to using a Uintptr. This likely will improve performance a tiny
bit on 32-bit platforms, but really it's mostly cleanup.

Change-Id: Ie705799a111ccad977fc1f43de8b50cf611be303
Reviewed-on: https://go-review.googlesource.com/c/go/+/429221
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
This is an optimization that prevents N^2 behavior within a chunk, but
was accidentally skipped. There should be no functional change as a
result of this CL.

Fixes golang#54892.

Change-Id: I861967a2268699fdc3464bd41bc56618b5628e6b
Reviewed-on: https://go-review.googlesource.com/c/go/+/429255
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
This was left over from the old pacer, and never removed when the old
pacer was removed in Go 1.19.

Change-Id: I79e5f0420c6100c66bd06129a68f5bbab7c1ea8f
Reviewed-on: https://go-review.googlesource.com/c/go/+/429256
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
For golang#46505.

Change-Id: I1a30fd895496befd16626afb48717ac837ed5778
Reviewed-on: https://go-review.googlesource.com/c/go/+/429315
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Robert Griesemer <gri@google.com>
Because most of these APIs are recently supported, we can only do some
advancement work as much as possible under the premise of compatibility.

For golang#54854.

Change-Id: Id15d11288bf23902570d54eaf2704a5264210b2e
Reviewed-on: https://go-review.googlesource.com/c/go/+/429115
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: hopehook <hopehook@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
FixedZone is transitively called by Time.UnmarshalJSON or Time.UnmarshalText
for any RFC 3339 timestamp that is not in UTC.
This function is relatively slow since it allocates 3 times.

Given that RFC 3339 never has a zone name and most offsets are by the hour,
we can cache unnamed zones on hour offsets.
Caching a Location should be safe since it has no exported fields or methods
that can mutate the Location. It is functionally immutable.

The only way to observe that the Location was cached is either
by pointer comparison or by shallow copying the struct value.
Neither operation seems sensible to do with a *time.Location.

Performance:

	name           old time/op  new time/op  delta
	UnmarshalText  268ns ± 2%   182ns ± 1%  -32.01%  (p=0.000 n=10+10)

Change-Id: Iab5432f34bdbb485512bb8b5464e076c03fd106f
Reviewed-on: https://go-review.googlesource.com/c/go/+/425116
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Joseph Tsai <joetsai@digital-static.net>
Auto-Submit: Joseph Tsai <joetsai@digital-static.net>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Change-Id: I213b078effa4b7049c44498d651de5b938e5404b
Reviewed-on: https://go-review.googlesource.com/c/go/+/428779
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: hopehook <hopehook@golangcn.org>
Run-TryBot: hopehook <hopehook@golangcn.org>
Auto-Submit: Tobias Klauser <tobias.klauser@gmail.com>
When the type assertion fails, the test mistakenly prints the expected
(rather than the actual) type.

When the error string doesn't match, the text mistakenly prints the
original (rather than the converted) error (although there might not be
any difference in the output, the code looks wrong).

Fix both issues.

Change-Id: Ia7dd0632fc677f458fec25d899c46268a12f76e7
Reviewed-on: https://go-review.googlesource.com/c/go/+/428916
Run-TryBot: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Change-Id: I25c8e8b701d6489f360fea30d09090826276b950
GitHub-Last-Rev: c2c8319
GitHub-Pull-Request: golang#54924
Reviewed-on: https://go-review.googlesource.com/c/go/+/428976
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
It's set but the output is never used.

Change-Id: I36ecb9c5f087a85289529907ede9f9bfc295d739
Reviewed-on: https://go-review.googlesource.com/c/go/+/428637
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Auto-Submit: Ian Lance Taylor <iant@golang.org>
As discussed in CL 401434 there are substantial misuses of these in the
wild, and they are a potential source of unsafety even for code that
does not use them directly.

Since proposal golang#53003 has already been implemented, now is the right
time to deprecate reflect.{SliceHeader, StringHeader}.

For golang#53003.

Change-Id: I724cf46d4b22d2ed3cbf2b948e6aac5ee4bf0f6e
Reviewed-on: https://go-review.googlesource.com/c/go/+/428757
Run-TryBot: hopehook <hopehook@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
…ice, String}

After we deprecated reflect.{SliceHeader, StringHeader}, it is recommended
to use unsafe.{Slice, String} to replace its work. However, the compiler
and linker cannot be migrated for the time being.

As a temporary strategy, using the "internal/unsafeheader" package like
other code is the most suitable choice at present.

For golang#53003.

Change-Id: I69d0ef72e2d95caabd0706bbb247a719d225c758
Reviewed-on: https://go-review.googlesource.com/c/go/+/429755
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: hopehook <hopehook@golangcn.org>
…ckage

Follow CL 428777.

Change-Id: I5ce49322e92c5d6539bb08248e3366187c30dcd8
Reviewed-on: https://go-review.googlesource.com/c/go/+/428780
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Tobias Klauser <tobias.klauser@gmail.com>
First, we know that Go source files almost always weigh at least a few
kilobytes, so we can kickstart the output buffer to be a reasonable size
and reduce the initial number of incremental allocations and copies when
appending bytes or strings to output.

Second, in nodeSize we use a nested printer, but we don't actually need
its printed bytes - we only need to know how many bytes it prints.
For that reason, use a throwaway buffer: the part of our output buffer
between length and capacity, as we haven't used it yet.

Third, use a sync.Pool to reuse allocated printers.
The current API doesn't allow reusing printers,
and some programs like gofmt will print many files in sequence.

Those changes combined result in a modest reduction in allocations and
CPU usage. The benchmark uses testdata/parser.go, which has just over
two thousand lines of code, which is pretty standard size-wise.

We also split the Print benchmark to cover both a medium-sized ast.File
as well as a pretty small ast.Decl node. The latter is a somewhat common
scenario in gopls, which has code actions which alter small bits of the
AST and print them back out to rewrite only a few lines in a file.

	name          old time/op    new time/op     delta
	PrintFile-16    5.43ms ± 1%     4.85ms ± 3%  -10.68%  (p=0.000 n=9+10)
	PrintDecl-16    19.1µs ± 0%     18.5µs ± 1%   -3.04%  (p=0.000 n=10+10)

	name          old speed      new speed       delta
	PrintFile-16  9.56MB/s ± 1%  10.69MB/s ± 3%  +11.81%  (p=0.000 n=8+10)
	PrintDecl-16  1.67MB/s ± 0%   1.73MB/s ± 1%   +3.05%  (p=0.000 n=10+10)

	name          old alloc/op   new alloc/op    delta
	PrintFile-16     332kB ± 0%      107kB ± 2%  -67.87%  (p=0.000 n=10+10)
	PrintDecl-16    3.92kB ± 0%     3.28kB ± 0%  -16.38%  (p=0.000 n=10+10)

	name          old allocs/op  new allocs/op   delta
	PrintFile-16     3.45k ± 0%      2.42k ± 0%  -29.90%  (p=0.000 n=10+10)
	PrintDecl-16      56.0 ± 0%       46.0 ± 0%  -17.86%  (p=0.000 n=10+10)

Change-Id: I475a3babca77532b2d51888f49710f74763d81d2
Reviewed-on: https://go-review.googlesource.com/c/go/+/424924
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@google.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
Change-Id: Ie543e1b1df667cfaf3aafa4be727881461ee8b7d
GitHub-Last-Rev: ed993db
GitHub-Pull-Request: golang#54888
Reviewed-on: https://go-review.googlesource.com/c/go/+/428716
Auto-Submit: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Change-Id: I9de5aafb36d05bdc90bbdba516367eb2b200a7e0
Reviewed-on: https://go-review.googlesource.com/c/go/+/428777
Auto-Submit: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Implement CLONE_INTO_CGROUP feature, allowing to put a child in a
specified cgroup in a clean and simple way. Note that the feature only
works for cgroup v2, and requires Linux kernel 5.7 or newer.

Using the feature requires a new syscall, clone3. Currently this is the
only reason to use clone3, but the code is structured in a way so that
other cases may be easily added in the future.

Add a test case.

While at it, try to simplify the syscall calling code in
forkAndExecInChild1, which became complicated over time because:

1. It was using either rawVforkSyscall or RawSyscall6 depending on
   whether CLONE_NEWUSER was set.

2. On Linux/s390, the first two arguments to clone(2) system call are
   swapped (which deserved a mention in Linux ABI hall of shame). It
   was worked around in rawVforkSyscall on s390, but had to be
   implemented via a switch/case when using RawSyscall6, making the code
   less clear.

Let's

 - modify rawVforkSyscall to have two arguments (which is also required
   for clone3);

 - remove the arguments workaround from s390 asm, instead implementing
   arguments swap in the caller (which still looks ugly but at least
   it's done once and is clearly documented now);

 - use rawVforkSyscall for all cases (since it is essentially similar to
   RawSyscall6, except for having less parameters, not returning r2, and
   saving/restoring the return address before/after syscall on 386 and
   amd64).

Updates golang#51246.

Change-Id: Ifcd418ebead9257177338ffbcccd0bdecb94474e
Reviewed-on: https://go-review.googlesource.com/c/go/+/417695
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Change-Id: Ib9746cb4f27625cb22620271b280d2da242b2fba
Reviewed-on: https://go-review.googlesource.com/c/go/+/428437
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Andy Pan <panjf2000@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Update CL 425881 and CL 428396

I browsed the source code related to copy_file_range in the kernel and found that the latest kernel may still return EXDEV errors in copy_file_range(2) due to certain cases, for details see: https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1559, https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1479, and
https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1439.

Therefore, the EXDEV still needs to be kept, but the ENOSYS error can be safely removed.

Change-Id: I47026b8dd33f7ffc4de1306af6b67c7b4d2062d2
Reviewed-on: https://go-review.googlesource.com/c/go/+/428555
Run-TryBot: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Andy Pan <panjf2000@gmail.com>
Change-Id: I9c4c3ada3a8f5d8d198cc42a4afc06972ee00c61
GitHub-Last-Rev: 4ed8011
GitHub-Pull-Request: golang#54916
Reviewed-on: https://go-review.googlesource.com/c/go/+/428921
Run-TryBot: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Change-Id: I7ea150e4896fc9b2e3a6dbdd9a1c2b651e74b844
Reviewed-on: https://go-review.googlesource.com/c/go/+/428778
Auto-Submit: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@prattmic
Copy link
Member

FYI, this has moved to https://go.dev/cl/429863, and there is a proposal at #55022.

@prattmic prattmic closed this Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: no Used by googlebot to label PRs as having an invalid CLA. The text of this label should not change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.