Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf!: concurrent merge with optional debugging #61

Merged
merged 12 commits into from
Apr 18, 2023

Conversation

mark-rushakoff
Copy link
Member

@mark-rushakoff mark-rushakoff commented Apr 14, 2023

Background

The non-determinism tests in the SDK have been taking excessively long to run. Profiling the test showed that there are many calls to codec/types.NewInterfaceRegistry, which calls proto.MergedRegistry in this module. Prior to these optimizations, that method would be limited to a single CPU for about 5ms on my machine when not performing validations, or about 225ms with validations on; with the optimizations, that drops to about 1.5ms or 55ms respectively, on my machine.

While the simulations need to be cleaned up to not call that method so frequently, this was still worth optimizing because:

  • other tests will continue to instantiate a new merged registry as needed
  • the merged registry is currently required when starting simd either in client or server mode. The validations are currently off, but should probably be on as a sanity check -- shaving off 150+ milliseconds of CLI startup time is worth doing for faster automated recovery and for a better CLI experience.

This PR supersedes #56. It differs by offering cleaner control over child goroutines to process file descriptors; backfilling tests and benchmarks; and using an explicitly limited number of goroutines that reuse allocation-heavy values and process data arriving on channels, rather than one goroutine per file descriptor. Thank you to @mmsqe for the work in #56 that influenced this change.

API change

Calls to proto.MergedRegistry() do not need to change.

Existing calls to proto.MergedFileDescriptors() need to change to call one of proto.MergedGlobalFileDescriptors() or proto.MergedGlobalFileDescriptorsWithValidation(). I chose to move the existing debug parameter into an explicit part of the method name rather than a boolean argument. This will make call sites clear whether they will fail due to import path mismatches or incorrectly duplicated file descriptors, without the reader needing context on the literal true or false argument.

For finer control over a merge, proto.MergedFileDescriptors has been updated to accept arguments (globalFiles *protoregistry.Files, appFiles map[string][]byte). I'm not sure if this will be useful outside of tests, but it was necessary for testing and benchmarking these changes.

Performance data

In a draft PR on the SDK, this brings the time for test-sim-nondeterminism's test stage down to under 9 minutes: https://github.com/cosmos/cosmos-sdk/actions/runs/4724996838/jobs/8382891823. With the previous implementation, runs taking longer than 45 minutes were common.

Here are benchmarks comparing the previous and new implementations, against a representative but incomplete copy of the protobuf file descriptors as of cosmos/cosmos-sdk@c4489d9:

$ go test ./test/mergedregistry/ -cpu=1,2,4,8,10 -bench=Repres -benchtime=5s
goos: darwin
goarch: arm64
pkg: github.com/cosmos/gogoproto/test/mergedregistry
BenchmarkRepresentativeMergedFileDescriptors/sequential            	    1790	   3259606 ns/op	 1636177 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-2          	    2041	   2902895 ns/op	 1636184 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-4          	    2101	   2864636 ns/op	 1636200 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-8          	    2031	   2933869 ns/op	 1636229 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-10         	    2024	   2959004 ns/op	 1636226 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent            	    1783	   3364500 ns/op	 1637314 B/op	   42165 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-2          	    3027	   1978660 ns/op	 1694548 B/op	   42176 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-4          	    4794	   1301653 ns/op	 1808718 B/op	   42198 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-8          	    4257	   1356715 ns/op	 2030442 B/op	   42241 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-10         	    4261	   1396325 ns/op	 2135427 B/op	   42262 allocs/op

See these trace screenshots for some visualizations of the performance data. Note that the time scale differs in each trace, and that work is properly spread across all available procs.

Screen Shot 2023-04-14 at 3 40 00 PM

Screen Shot 2023-04-14 at 3 43 26 PM

And extract MergedFileDescriptors's default behavior to
MergedGlobalFileDescriptors, and update MergedFileDescriptors to accept
arguments. Now that MergedFileDescriptors is testable, backfill some
tests for the error cases.
Add a basic benchmark for MergedGlobalFileDescriptors, and add a new
package so that we can import jhump/protoreflect in order to dynamically
create new file descriptors. Right now, this is a crude implementation
that produces very simple file descriptors; but if we need to improve
the test cases, we now have a framework in place.
And add tests and benchmarks.

I'm slightly stuck on the benchmarks, in deserializing the "global"
file descriptors from simd. Once I figure that out, we will have
benchmarks representative of real load.
This doesn't include every single global file, but it includes enough to
make it a representative benchmark now.
Now that we are no longer using jhump/protoreflect for this, these tests
can live in the same package as the other benchmarks.

Also mark file descriptor json files as binary so that they no longer
show up in diffs or git grep.
I was focused on benchmarks, so wasn't running this test -- which ended
up breaking due to function renames.
And since I didn't include the benchmark data in a commit message
earlier, this is what it looks like, before I remove the old style in
the next commit. On my 10-core M1 Max machine, with a few different CPU
values to exercise varying core counts:

$ go test ./test/mergedregistry/ -cpu=1,2,4,8,10 -bench=Repres -benchtime=5s
goos: darwin
goarch: arm64
pkg: github.com/cosmos/gogoproto/test/mergedregistry
BenchmarkRepresentativeMergedFileDescriptors/sequential            	    1790	   3259606 ns/op	 1636177 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-2          	    2041	   2902895 ns/op	 1636184 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-4          	    2101	   2864636 ns/op	 1636200 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-8          	    2031	   2933869 ns/op	 1636229 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-10         	    2024	   2959004 ns/op	 1636226 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent            	    1783	   3364500 ns/op	 1637314 B/op	   42165 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-2          	    3027	   1978660 ns/op	 1694548 B/op	   42176 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-4          	    4794	   1301653 ns/op	 1808718 B/op	   42198 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-8          	    4257	   1356715 ns/op	 2030442 B/op	   42241 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-10         	    4261	   1396325 ns/op	 2135427 B/op	   42262 allocs/op
PASS

A single core machine unsurprisingly would perform slightly better on
the sequential approach. With two cores, the concurrent approach is
measurably better. I'm slightly surprised to see throughput go lower at
8 and 10 cores. But this approach is still more than twice as fast in
the 4-10 core case compared to the sequential approach, so I'm going to
leave the GOMAXPROCS count as-is. This should still make a measurable
improvement in the simulation tests for the SDK, and it should be much
less of a bottleneck than the sequential approach.
Benchmarks, and manually run experiments using simd and other
cosmos-sdk tests, have shown that the concurrent implementation is
faster regardless of whether validation is included. This commit's
parent can be used if any other performance comparison is needed later.
@mark-rushakoff mark-rushakoff changed the title perf: concurrent merge with optional debugging perf!: concurrent merge with optional debugging Apr 17, 2023
// GOMAXPROCS is the number of CPU cores available, by default.
// Respect that setting as the number of CPU-bound goroutines,
// and for channel sizes.
nProcs := runtime.GOMAXPROCS(0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limit with CPU-bound is a good idea, but might need rename to mergeFileDescriptors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you. Comments corrected in cf9db71.

Just a missed search and replace.
Copy link

@mmsqe mmsqe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the refactor.

So that we don't need a separate commit after #61 is merged.
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@mark-rushakoff mark-rushakoff merged commit a0d0171 into main Apr 18, 2023
@mark-rushakoff mark-rushakoff deleted the mr/concurrent-merge-with-debug branch April 18, 2023 13:48
@facundomedica facundomedica restored the mr/concurrent-merge-with-debug branch April 21, 2023 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants