-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf!: concurrent merge with optional debugging #61
Conversation
And extract MergedFileDescriptors's default behavior to MergedGlobalFileDescriptors, and update MergedFileDescriptors to accept arguments. Now that MergedFileDescriptors is testable, backfill some tests for the error cases.
Add a basic benchmark for MergedGlobalFileDescriptors, and add a new package so that we can import jhump/protoreflect in order to dynamically create new file descriptors. Right now, this is a crude implementation that produces very simple file descriptors; but if we need to improve the test cases, we now have a framework in place.
And add tests and benchmarks. I'm slightly stuck on the benchmarks, in deserializing the "global" file descriptors from simd. Once I figure that out, we will have benchmarks representative of real load.
CI still uses go1.19.
This doesn't include every single global file, but it includes enough to make it a representative benchmark now.
Now that we are no longer using jhump/protoreflect for this, these tests can live in the same package as the other benchmarks. Also mark file descriptor json files as binary so that they no longer show up in diffs or git grep.
I was focused on benchmarks, so wasn't running this test -- which ended up breaking due to function renames.
And since I didn't include the benchmark data in a commit message earlier, this is what it looks like, before I remove the old style in the next commit. On my 10-core M1 Max machine, with a few different CPU values to exercise varying core counts: $ go test ./test/mergedregistry/ -cpu=1,2,4,8,10 -bench=Repres -benchtime=5s goos: darwin goarch: arm64 pkg: github.com/cosmos/gogoproto/test/mergedregistry BenchmarkRepresentativeMergedFileDescriptors/sequential 1790 3259606 ns/op 1636177 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/sequential-2 2041 2902895 ns/op 1636184 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/sequential-4 2101 2864636 ns/op 1636200 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/sequential-8 2031 2933869 ns/op 1636229 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/sequential-10 2024 2959004 ns/op 1636226 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent 1783 3364500 ns/op 1637314 B/op 42165 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent-2 3027 1978660 ns/op 1694548 B/op 42176 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent-4 4794 1301653 ns/op 1808718 B/op 42198 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent-8 4257 1356715 ns/op 2030442 B/op 42241 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent-10 4261 1396325 ns/op 2135427 B/op 42262 allocs/op PASS A single core machine unsurprisingly would perform slightly better on the sequential approach. With two cores, the concurrent approach is measurably better. I'm slightly surprised to see throughput go lower at 8 and 10 cores. But this approach is still more than twice as fast in the 4-10 core case compared to the sequential approach, so I'm going to leave the GOMAXPROCS count as-is. This should still make a measurable improvement in the simulation tests for the SDK, and it should be much less of a bottleneck than the sequential approach.
Benchmarks, and manually run experiments using simd and other cosmos-sdk tests, have shown that the concurrent implementation is faster regardless of whether validation is included. This commit's parent can be used if any other performance comparison is needed later.
// GOMAXPROCS is the number of CPU cores available, by default. | ||
// Respect that setting as the number of CPU-bound goroutines, | ||
// and for channel sizes. | ||
nProcs := runtime.GOMAXPROCS(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limit with CPU-bound is a good idea, but might need rename to mergeFileDescriptors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thank you. Comments corrected in cf9db71.
Just a missed search and replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the refactor.
So that we don't need a separate commit after #61 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Background
The non-determinism tests in the SDK have been taking excessively long to run. Profiling the test showed that there are many calls to
codec/types.NewInterfaceRegistry
, which callsproto.MergedRegistry
in this module. Prior to these optimizations, that method would be limited to a single CPU for about 5ms on my machine when not performing validations, or about 225ms with validations on; with the optimizations, that drops to about 1.5ms or 55ms respectively, on my machine.While the simulations need to be cleaned up to not call that method so frequently, this was still worth optimizing because:
simd
either in client or server mode. The validations are currently off, but should probably be on as a sanity check -- shaving off 150+ milliseconds of CLI startup time is worth doing for faster automated recovery and for a better CLI experience.This PR supersedes #56. It differs by offering cleaner control over child goroutines to process file descriptors; backfilling tests and benchmarks; and using an explicitly limited number of goroutines that reuse allocation-heavy values and process data arriving on channels, rather than one goroutine per file descriptor. Thank you to @mmsqe for the work in #56 that influenced this change.
API change
Calls to
proto.MergedRegistry()
do not need to change.Existing calls to
proto.MergedFileDescriptors()
need to change to call one ofproto.MergedGlobalFileDescriptors()
orproto.MergedGlobalFileDescriptorsWithValidation()
. I chose to move the existingdebug
parameter into an explicit part of the method name rather than a boolean argument. This will make call sites clear whether they will fail due to import path mismatches or incorrectly duplicated file descriptors, without the reader needing context on the literal true or false argument.For finer control over a merge,
proto.MergedFileDescriptors
has been updated to accept arguments(globalFiles *protoregistry.Files, appFiles map[string][]byte)
. I'm not sure if this will be useful outside of tests, but it was necessary for testing and benchmarking these changes.Performance data
In a draft PR on the SDK, this brings the time for
test-sim-nondeterminism
's test stage down to under 9 minutes: https://github.com/cosmos/cosmos-sdk/actions/runs/4724996838/jobs/8382891823. With the previous implementation, runs taking longer than 45 minutes were common.Here are benchmarks comparing the previous and new implementations, against a representative but incomplete copy of the protobuf file descriptors as of cosmos/cosmos-sdk@c4489d9:
See these trace screenshots for some visualizations of the performance data. Note that the time scale differs in each trace, and that work is properly spread across all available procs.